[PATCH] serial: sh-sci: remove obsolete latency workaround
Since the transition to hrtimers there is no more need to set a minimum RX timeout to work around latency issues. Signed-off-by: Ulrich Hecht --- drivers/tty/serial/sh-sci.c | 13 + 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c index e3af97a59856..ef37fdf37612 100644 --- a/drivers/tty/serial/sh-sci.c +++ b/drivers/tty/serial/sh-sci.c @@ -2609,21 +2609,10 @@ static void sci_set_termios(struct uart_port *port, struct ktermios *termios, udelay(DIV_ROUND_UP(10 * 100, baud)); } - /* -* Calculate delay for 2 DMA buffers (4 FIFO). -* See serial_core.c::uart_update_timeout(). -* With 10 bits (CS8), 250Hz, 115200 baud and 64 bytes FIFO, the above -* function calculates 1 jiffie for the data plus 5 jiffies for the -* "slop(e)." Then below we calculate 5 jiffies (20ms) for 2 DMA -* buffers (4 FIFO sizes), but when performing a faster transfer, the -* value obtained by this formula is too small. Therefore, if the value -* is smaller than 20ms, use 20ms as the timeout value for DMA. -*/ + /* Calculate delay for 2 DMA buffers (4 FIFO). */ s->rx_frame = (1 * bits) / (baud / 100); #ifdef CONFIG_SERIAL_SH_SCI_DMA s->rx_timeout = s->buf_len_rx * 2 * s->rx_frame; - if (s->rx_timeout < 20) - s->rx_timeout = 20; #endif if ((termios->c_cflag & CREAD) != 0) -- 2.20.1
Antw: [EXT] [systemd-devel] [SPECIFICATION RFC] The firmware and bootloader log specification
>>> Daniel Kiper schrieb am 14.11.2020 um 00:52 in Nachricht <20201113235242.k6fzlwmwm2xqh...@tomti.i.net-space.pl>: ... > The members of struct bf_log_msg: > ‑ size: total size of bf_log_msg struct, > ‑ ts_nsec: timestamp expressed in nanoseconds starting from 0, Who or what defines t == 0? ... Regards, Ulrich Windl
FYI: PoC: Running 100000 processes in 5.3.18 (SLES15 SP2)
Hi! Just in case someone is interested: As a Proof-of-Concept I started 100 thousand processes on a big machine (72 cores). It worked! However starting those too more than 30 minutes, and top needs more than 30 minutes to refresh ist display. Still, interactive input via SSH works nice, but any file-system access seems quite slow (my test processes just use CPU; the do no t do any I/O). Kernel messages while the processes were created: kernel: [65648.247688] perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79250 kernel: [65997.263218] perf: interrupt took too long (3146 > 3145), lowering kernel.perf_event_max_sample_rate to 63500 kernel: [66790.221057] perf: interrupt took too long (3938 > 3932), lowering kernel.perf_event_max_sample_rate to 50750 kernel: [69884.371426] perf: interrupt took too long (4925 > 4922), lowering kernel.perf_event_max_sample_rate to 40500 Last top output (more than 30 late): top - 11:16:19 up 19:19, 3 users, load average: 64164.56, 62997.24, 55597.09 Tasks: 101432 total, 60249 running, 41183 sleeping, 0 stopped, 0 zombie %Cpu(s): 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 772127.6+total, 755924.2+free, 14253.01+used, 1950.363 buff/cache MiB Swap: 773120.0+total, 772958.1+free, 161.816 used. 754248.8+avail Mem ... That's a load, isn't it? ;-) # cat /proc/uptime 72084.21 9356423.41 # cat /proc/loadavg 64188.31 64188.81 63636.08 64228/102328 134935 Regards, Ulrich
Re: [PATCH v3] pinctrl: renesas: r8a7790: Add VIN1-B and VIN2-G pins, groups and functions
Hi! Thanks for the patch. Verified it against data sheet, the magic numbers check out. Reviewed-by: Ulrich Hecht CU Uli > On 09/17/2020 9:59 PM Lad Prabhakar > wrote: > > > Add pins, groups and functions for the VIN1-B [data/sync/field/clkenb/clk] > and VIN2-G8. > > Signed-off-by: Lad Prabhakar > Reviewed-by: Biju Das > --- > v2->v3: > * Included vin1_data4_b, field_b and clkenb_b > * Renamed vin2_data8g to vin2_g8 > * Rebased patch on latest changes > > v1->v2: > * Added complete list of VIN1-B pins > * Renamed vin2_data8_g to vin2_data8g > * Sorted vin1_sync_b pins > > v1 - https://patchwork.kernel.org/patch/11761191/ > --- > drivers/pinctrl/renesas/pfc-r8a7790.c | 132 +- > 1 file changed, 131 insertions(+), 1 deletion(-) > > diff --git a/drivers/pinctrl/renesas/pfc-r8a7790.c > b/drivers/pinctrl/renesas/pfc-r8a7790.c > index 60f973c5dffe..3f48d3d879f7 100644 > --- a/drivers/pinctrl/renesas/pfc-r8a7790.c > +++ b/drivers/pinctrl/renesas/pfc-r8a7790.c > @@ -3866,6 +3866,72 @@ static const unsigned int vin1_data18_mux[] = { > VI1_R4_MARK, VI1_R5_MARK, > VI1_R6_MARK, VI1_R7_MARK, > }; > +static const union vin_data vin1_data_b_pins = { > + .data24 = { > + /* B */ > + RCAR_GP_PIN(3, 0), RCAR_GP_PIN(3, 1), > + RCAR_GP_PIN(3, 2), RCAR_GP_PIN(3, 3), > + RCAR_GP_PIN(3, 4), RCAR_GP_PIN(3, 5), > + RCAR_GP_PIN(3, 6), RCAR_GP_PIN(3, 7), > + /* G */ > + RCAR_GP_PIN(1, 14), RCAR_GP_PIN(1, 15), > + RCAR_GP_PIN(1, 17), RCAR_GP_PIN(1, 20), > + RCAR_GP_PIN(1, 22), RCAR_GP_PIN(1, 12), > + RCAR_GP_PIN(1, 9), RCAR_GP_PIN(1, 7), > + /* R */ > + RCAR_GP_PIN(0, 27), RCAR_GP_PIN(0, 28), > + RCAR_GP_PIN(0, 29), RCAR_GP_PIN(1, 4), > + RCAR_GP_PIN(1, 5), RCAR_GP_PIN(1, 6), > + RCAR_GP_PIN(1, 10), RCAR_GP_PIN(1, 8), > + }, > +}; > +static const union vin_data vin1_data_b_mux = { > + .data24 = { > + /* B */ > + VI1_DATA0_VI1_B0_B_MARK, VI1_DATA1_VI1_B1_B_MARK, > + VI1_DATA2_VI1_B2_B_MARK, VI1_DATA3_VI1_B3_B_MARK, > + VI1_DATA4_VI1_B4_B_MARK, VI1_DATA5_VI1_B5_B_MARK, > + VI1_DATA6_VI1_B6_B_MARK, VI1_DATA7_VI1_B7_B_MARK, > + /* G */ > + VI1_G0_B_MARK, VI1_G1_B_MARK, > + VI1_G2_B_MARK, VI1_G3_B_MARK, > + VI1_G4_B_MARK, VI1_G5_B_MARK, > + VI1_G6_B_MARK, VI1_G7_B_MARK, > + /* R */ > + VI1_R0_B_MARK, VI1_R1_B_MARK, > + VI1_R2_B_MARK, VI1_R3_B_MARK, > + VI1_R4_B_MARK, VI1_R5_B_MARK, > + VI1_R6_B_MARK, VI1_R7_B_MARK, > + }, > +}; > +static const unsigned int vin1_data18_b_pins[] = { > + /* B */ > + RCAR_GP_PIN(3, 2), RCAR_GP_PIN(3, 3), > + RCAR_GP_PIN(3, 4), RCAR_GP_PIN(3, 5), > + RCAR_GP_PIN(3, 6), RCAR_GP_PIN(3, 7), > + /* G */ > + RCAR_GP_PIN(1, 17), RCAR_GP_PIN(1, 20), > + RCAR_GP_PIN(1, 22), RCAR_GP_PIN(1, 12), > + RCAR_GP_PIN(1, 9), RCAR_GP_PIN(1, 7), > + /* R */ > + RCAR_GP_PIN(0, 29), RCAR_GP_PIN(1, 4), > + RCAR_GP_PIN(1, 5), RCAR_GP_PIN(1, 6), > + RCAR_GP_PIN(1, 10), RCAR_GP_PIN(1, 8), > +}; > +static const unsigned int vin1_data18_b_mux[] = { > + /* B */ > + VI1_DATA2_VI1_B2_B_MARK, VI1_DATA3_VI1_B3_B_MARK, > + VI1_DATA4_VI1_B4_B_MARK, VI1_DATA5_VI1_B5_B_MARK, > + VI1_DATA6_VI1_B6_B_MARK, VI1_DATA7_VI1_B7_B_MARK, > + /* G */ > + VI1_G2_B_MARK, VI1_G3_B_MARK, > + VI1_G4_B_MARK, VI1_G5_B_MARK, > + VI1_G6_B_MARK, VI1_G7_B_MARK, > + /* R */ > + VI1_R2_B_MARK, VI1_R3_B_MARK, > + VI1_R4_B_MARK, VI1_R5_B_MARK, > + VI1_R6_B_MARK, VI1_R7_B_MARK, > +}; > static const unsigned int vin1_sync_pins[] = { > RCAR_GP_PIN(1, 24), /* HSYNC */ > RCAR_GP_PIN(1, 25), /* VSYNC */ > @@ -3874,24 +3940,50 @@ static const unsigned int vin1_sync_mux[] = { > VI1_HSYNC_N_MARK, > VI1_VSYNC_N_MARK, > }; > +static const unsigned int vin1_sync_b_pins[] = { > + RCAR_GP_PIN(1, 24), /* HSYNC */ > + RCAR_GP_PIN(1, 25), /* VSYNC */ > +}; > +static const unsigned int vin1_sync_b_mux[] = { > + VI1_HSYNC_N_B_MARK, > + VI1_VSYNC_N_B_MARK, > +}; > static const unsigned int vin1_field_pins[] = { > RCAR_GP_PIN(1, 13), > }; > static const unsigned int vin1_field_mux[] = { > VI1_FIELD_MARK, > }; > +static const unsigned int vin1_field_b_pins[] = { > + RCAR_GP_PIN(1, 13), > +}; > +static const unsigned int vin1_fie
heads up: clock_gettime(CLOCK_MONOTONIC, ...) is not monotonic with Xen PVM
.tv_nsec); printf("%s: largest delta is 0.%09ld\n", __func__, deltas[CLOCK_SAMPLES - 2].tv_nsec); if ( deltas[u].tv_nsec > tsp->tv_nsec ) tsp->tv_nsec = deltas[0].tv_nsec; leave: return(result); } /* main */ int main(int argc, char *argv[]) { int result = 0; ts_tres; get_res(); printf("resolution is 0.%09ld\n", res.tv_nsec); return(result); } - it is intentional that the program aborts on error Output from a newer machine: get_res: resolution is 0.1 get_res: smallest delta is 0.00030 get_res: largest delta is 0.00050 resolution is 0.00030 Regards, Ulrich Windl (Keep me on CC if I should read your replies)
Q: Getting consistent CPU status rates from /proc/stat
in/perl # Check consistency of CPU states in /proc/stat # written for SLES12 SP5 and PERL 5.18 by Ulrich Windl use strict; use warnings; use Fcntl qw(:flock SEEK_SET); my ($rdiff, $rcount, $reading, @readings) = (0, 0, 0, [], []); my @all_cpu; if (open(my $fh, '<', my $file = '/proc/stat')) { do { my $rref = $readings[$reading]; seek($fh, 0, SEEK_SET) or die "seek() failed SEEK_SET to 0: $!\n"; while (<$fh>) { my @vals; if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(@vals, $^N) }))+$/) { $cpu = length($cpu) > 0 ? $cpu + 1 : 0; $rref->[$cpu] = \@vals; } } if (++$rcount > 1) { my ($old, $new) = ($readings[($reading + 1) % 2], $readings[$reading]); $rdiff = 0; for (my $i = $#$rref; $i >= 0; --$i) { my ($ocr, $ncr) = ($old->[$i], $new->[$i]); for (my $j = $#$ncr; $j >= 0; --$j) { if ((my $delta = $ncr->[$j] - $ocr->[$j]) != 0) { print "reading $rcount: cpu #$i stat #$j has delta " . "$delta\n"; $rdiff += abs($delta); } } } } $reading = ($reading + 1) % 2; } while ($rcount < 2 || $rdiff > 0); @all_cpu = @{$readings[$reading]}; # consistent reading close($fh); if ((my $n = scalar(@all_cpu)) >= 2) { my ($sref, $cref); $sref = $all_cpu[0]; $cref = $all_cpu[$n] = [ map { 0 } (0 .. $#$sref) ]; for (my $i = 1; $i < $n; ++$i) { my $iref = $all_cpu[$i]; for (my $j = $#$cref; $j >= 0; --$j) { $cref->[$j] += $iref->[$j]; } } for (my $j = $#$sref; $j >= 0; --$j) { if ((my $delta = $sref->[$j] - $cref->[$j]) != 0) { print "mismatch for cpu stat #$j: $sref->[$j] - $cref->[$j] " . "is $delta\n"; } } } else { print "No CPU?\n"; } } else { die "$file: $!\n"; } --EOF-- Regards, Ulrich Windl
Re: [BOOTLOADER SPECIFICATION RFC] The bootloader log format for TrenchBoot and others
On Fri, 29 May 2020 13:27:35 +0200 Daniel Kiper wrote: > Below you can find my rough idea of the bootloader log format which is > generic thing but initially will be used for TrenchBoot work. I > discussed this proposal with Ross and Daniel S. So, the idea went > through initial sanitization. Now I would like to take feedback from > other folks too. So, please take a look and complain... > > In general we want to pass the messages produced by the bootloader to > the OS kernel and finally to the userspace for further processing and > analysis. Below is the description of the structures which will be > used for this thing. This should mention that this about having one contiguous block of memory which contains a struct bootloader_log. > struct bootloader_log_msgs Why the plural with the trailing letter s in the struct name? This looks like a single message to me, and should thus be signular without a trailing letter s. > { > grub_uint32_t level; > grub_uint32_t facility; > char type[]; > char msg[]; > } How would this work? How could a compiler know where msg starts? gcc just reports "error: flexible array member not at end of struct" here as there is no way of knowing. Where are the sizes of type[] and msg[] defined? Only implicitly by just having two NUL terminated strings right after each other? That would mean you need to change the struct to struct bootloader_log_msg { grub_uint32_t level; grub_uint32_t facility; char strings[]; } and have anyone parsing this structure walk through all characters in strings[] looking for NULs to eventually find out where the msg string actually starts. This looks at least a bit ugly to me, unless you absolutely need to save the last bit of code and data memory in the bootloader. To help find where the msg actually starts without needing to look for a NUL character, you could add a struct member showing where the msg string actually starts within strings[]: struct bootloader_log_msg { grub_uint32_t level; grub_uint32_t facility; grub_uint32_t msg_start; char strings[]; } This msg_start value could be defined as an character offset into strings[]. Then accessing msg->strings or >strings[0] would access the type string, and >strings[msg_ofs] would access the message: struct bootloader_log_msg *msg = ...; printf("msg type: %s\n", >strings[0]); printf("msg: %s\n", >strings[msg->msg_start]); For quick parsing of the messages, having a "grub_uint32_t size" struct member to quickly skip the current struct bootloader_log_msg completely and jump directly to the next struct would be helpful, regardless of how the strings are actually stored: struct bootloader_log_msg { grub_uint32_t level; grub_uint32_t facility; grub_uint32_t size; [ ... string storage ... ] } > struct bootloader_log > { > grub_uint32_t version; > grub_uint32_t producer; > grub_uint32_t size; > grub_uint32_t next_off; > bootloader_log_msgs msgs[]; > } > > The members of struct bootloader_log: > - version: the bootloader log format version number, 1 for now, > - producer: the producer/bootloader type; we can steal some values > from linux/Documentation/x86/boot.rst:type_of_loader, > - size: total size of the log buffer including the bootloader_log > struct, "size in bytes"? > - next_off: offset in bytes, from start of the bootloader_log > struct, of the next byte after the last log message in the msgs[]; > i.e. the offset of the next available log message slot, It appears to me that next_off is only interesting for code which appends log messages to that structure. For reading such a struct, next_off is useless and could thus be a private variable inside that bootloader which is not passed on to a next stage bootloader or an OS kernel. So specifying next_off as something other than a "reserved" uint32_t is for when you have a chain of bootloaders which all append messages to that buffer, and you want to avoid all the bootloaders having to parse the messages from the previous bootloader's messages in order to find out where to append messages. If that is the intention, this procedure should at least be mentioned somewhere. I am also missing any mention of memory alignment. With the number of CPUs in the field which cannot directly read misaligned words increasing, specifying a 4 byte or 8 byte alignment for these structures could significantly reduce the code complexity for such CPUs at the cost of a few bytes of memory. And while on the topic of memory layout: With all these uint32_t values, is this only for a 32bit protocol, or will this remain 32bit even for otherwise 64bit code, or what is the plan here? > - msgs: the array of log messages. > > The members of struct bootloader_log_msgs: > - level: similar to syslog meaning; can be used to differentiate > normal messages from debug messages; exact
Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages
On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote: > > Just looked at > > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' > > helper function > > > > which says: > > Also like 'get_page()', you can't use this function unless you already > > had a reference to the page. The intent is that you can use this > > exactly like get_page(), but in situations where you want to limit the > > maximum reference count. > > > > The code currently does an unconditional WARN_ON_ONCE() if we ever hit > > the reference count issues (either zero or negative), as a notification > > that the conditional non-increment actually happened. > > > > If try_get_page must not be called with an existing reference, that means > s/not// > > that when we call it the page reference is already higher and our freeze > > will never succeed. That would imply that we cannot trigger this. No? Well, my understanding is that the "existing" reference may be one of the references that is expected by our freeze code, in particular in gup the existing reference is simply the one from the pte. So in this case our freeze *would* succeeed. Bye, Ulrich -- Dr. Ulrich Weigand GNU/Linux compilers and toolchain ulrich.weig...@de.ibm.com
Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages
On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote: > On 5/4/20 6:41 AM, Ulrich Weigand wrote: > > You're right that there is no mechanism to prevent new references, > > but that's really never been the goal either. We're simply trying > > to ensure that no I/O is ever done on a page that is in the "secure" > > (or inaccessible) state. To do so, we rely on the assumption that > > all code that starts I/O on a page cache page will *first*: > > - mark the page as pending I/O by either taking an extra page > > count, or by setting the Writeback flag; then: > > - call arch_make_page_accessible(); then: > > - start I/O; and only after I/O has finished: > > - remove the "pending I/O" marker (Writeback and/or extra ref) > > Let's ignore writeback for a moment because get_page() is the more > general case. The locking sequence is: > > 1. get_page() (or equivalent) "locks out" a page from converting to >inaccessbile, > 2. followed by a make_page_accessible() guarantees that the page >*stays* accessible until > 3. I/O is safe in this region > 4. put_page(), removes the "lock out", I/O now unsafe Yes, exactly. > They key is, though, the get_page() must happen before > make_page_accessible() and *every* place that acquires a new reference > needs a make_page_accessible(). Well, sort of: every place that acquires a new reference *and then performs I/O* needs a make_page_accessible(). There seem to be a lot of plain get_page() calls that aren't related to I/O. > try_get_page() is obviously one of those "new reference sites" and it > only has one call site outisde of the gup code: generic_pipe_buf_get(), > which is effectively patched by the patch that started this thread. The > fact that this one oddball site _and_ gup are patched now is a good sign. > > *But*, I still don't know how that could work nicely: > > > static inline __must_check bool try_get_page(struct page *page) > > { > > page = compound_head(page); > > if (WARN_ON_ONCE(page_ref_count(page) <= 0)) > > return false; > > page_ref_inc(page); > > return true; > > } > > If try_get_page() collides with a freeze_page_refs(), it'll hit the > WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure > that warning is _actually_ valid since freeze_page_refs() isn't truly a > 0 refcount. But, the fact that this hasn't been encountered means that > the testing here is potentially lacking. This is indeed interesting. In particular if you compare try_get_page with try_get_compound_head in gup.c, which does instead: if (WARN_ON_ONCE(page_ref_count(head) < 0)) return NULL; which seems more reasonable to me, given the presence of the page_ref_freeze method. So I'm not sure why try_get_page has <= 0. I think I understand why we haven't seen this in testing: all the places in gup.c where try_get_page is called hold the pte lock; and in the usual case, the pte lock itself already suffices to lock out make_secure_pte before it even tries to use page_ref_freeze. (The intent of holding the pte lock there was really to ensure that the PTE entry is and remains valid throughout the execution of the ultravisor call, which will look at the PTE entry.) However, I guess we could construct cases where the pte lock doesn't suffice to prevent the try_get_page warning: if we create a shared mapping of the secure guest backing store file into a second process. That doesn't ever happen in normal qemu operation, so that's likely why we haven't seen that case. > > We thought we had identified all places where we needed to place > > arch_make_page_accessible so that the above assumption is satisfied. > > You've found at least two instances where this wasn't true (thanks!); > > but I still think that this can be fixed by just adding those calls. > > Why do you think that's the extent of the problem? Because the crashes > stopped? > > I'd feel a lot more comfortable if you explained the audits that you've > performed or _why_ you think that. What I've heard thus far is > basically that you've been able to boot a guest and you're ready to ship > this code. Not sure if you can really call this an "audit", but we were really coming from identifying places where I/O can happen on a page cache page, and everything we found (except writeback) went through gup. We obviously missed the sendfile case here; not sure what the best way would be to verify nothing else was missed. > But, with regular RCU, you're right, it _does_ appear that it would hit > that retry loop, but then it would *succeed* in getting a reference. In > the end, this just supports the seq
Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages
On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote: > The larger point, though, is that the s390 code ensures no extra > references exist upon entering make_secure_pte(), but it still has no > mechanism to prevent future, new references to page cache pages from > being created. Hi Dave, I worked with Claudio and Christian on the initial design of our approach, so let me chime in here as well. You're right that there is no mechanism to prevent new references, but that's really never been the goal either. We're simply trying to ensure that no I/O is ever done on a page that is in the "secure" (or inaccessible) state. To do so, we rely on the assumption that all code that starts I/O on a page cache page will *first*: - mark the page as pending I/O by either taking an extra page count, or by setting the Writeback flag; then: - call arch_make_page_accessible(); then: - start I/O; and only after I/O has finished: - remove the "pending I/O" marker (Writeback and/or extra ref) We thought we had identified all places where we needed to place arch_make_page_accessible so that the above assumption is satisfied. You've found at least two instances where this wasn't true (thanks!); but I still think that this can be fixed by just adding those calls. Now, if the above assumption holds, then I believe we're safe: - before we make any page secure, we verify that it is not "pending I/O" as defined above (neither Writeback flag, nor and extra page count) - *during* the process of making the page secure, we're protected against any potential races due to changes in that status, since we hold the page lock (and therefore the Writeback flag cannot change), and we've frozen page references (so those cannot change). This implies that before I/O has started, the page was made accessible; and as long as the page is marked "pending I/O" it will not be made inaccessible again. > The one existing user of expected_page_refs() freezes the refs then > *removes* the page from the page cache (that's what the xas_lock_irq() > is for). That stops *new* refs from being acquired. > > The s390 code is missing an equivalent mechanism. > > One example: > > page_freeze_refs(); > // page->_count==0 now > find_get_page(); > // ^ sees a "freed" page > page_unfreeze_refs(); > > find_get_page() will either fail to *find* the page because it will see > page->_refcount==0 think it is freed (not great), or it will > VM_BUG_ON_PAGE() in __page_cache_add_speculative(). I don't really see how that could happen; my understanding is that page_freeze_refs simply causes potential users to spin and wait until it is no longer frozen. For example, find_get_page will in the end call down to find_get_entry, which does: if (!page_cache_get_speculative(page)) goto repeat; Am I misunderstanding anything here? > My bigger point is that this patches doesn't systematically stop finding > page cache pages that are arch-inaccessible. This patch hits *one* of > those sites. As I said above, that wasn't really the goal for our approach. In particular, note that we *must* have secure pages present in the page table of the secure guest (that is a requirement of the architecture; note that the "secure" status doesn't just apply to the phyiscal page, but a triple of "*this* host physical page is the secure backing store of *this* guest physical page in *this* secure guest", which the HW/FW tracks based on the specific page table entry). As a consequence, the page really also has to remain present in the page cache (I don't think Linux mm code would be able to handle the case where a file-backed page is in the page table but not page cache). I'm not sure what exactly the requirements for your use case are; if those are significantly differently, maybe we can work together to find an approach that works for both? Bye, Ulrich -- Dr. Ulrich Weigand GNU/Linux compilers and toolchain ulrich.weig...@de.ibm.com
Issue with reading sysfs files
es/virtual/net/lo/name_assign_type: Invalid argument /sys/devices/virtual/net/lo/speed Found in 4.12.14-95.6-default of SLES SP4 (x86_64) Regards, Ulrich
Re: [PATCH v2] i2c: i2c-mt65xx: fix NULL ptr dereference
> On September 30, 2019 at 5:28 PM Fabien Parent wrote: > > > Since commit abf4923e97c3 ("i2c: mediatek: disable zero-length transfers > for mt8183"), there is a NULL pointer dereference for all the SoCs > that don't have any quirk. mtk_i2c_functionality is not checking that > the quirks pointer is not NULL before starting to use it. > > This commit add a call to i2c_check_quirks which will check whether > the quirks pointer is set, and if so will check if the IP has the > NO_ZERO_LEN quirk. > > Fixes: abf4923e97c3 ("i2c: mediatek: disable zero-length transfers for > mt8183") > Signed-off-by: Fabien Parent Thank you! Tested successfully on Acer R13 Chromebook (mt8173). Tested-by: Ulrich Hecht CU Uli
Re: [PATCH] thermal: rcar_gen3_thermal: Use devm_add_action_or_reset() helper
> On July 31, 2019 at 2:50 PM Geert Uytterhoeven > wrote: > > > Use the devm_add_action_or_reset() helper instead of open-coding the > same operations. > > Signed-off-by: Geert Uytterhoeven > --- > drivers/thermal/rcar_gen3_thermal.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/thermal/rcar_gen3_thermal.c > b/drivers/thermal/rcar_gen3_thermal.c > index a56463308694e937..2db7e7f8baf939fd 100644 > --- a/drivers/thermal/rcar_gen3_thermal.c > +++ b/drivers/thermal/rcar_gen3_thermal.c > @@ -443,11 +443,10 @@ static int rcar_gen3_thermal_probe(struct > platform_device *pdev) > if (ret) > goto error_unregister; > > - ret = devm_add_action(dev, rcar_gen3_hwmon_action, zone); > - if (ret) { > - rcar_gen3_hwmon_action(zone); > + ret = devm_add_action_or_reset(dev, rcar_gen3_hwmon_action, > +zone); > + if (ret) > goto error_unregister; > - } > > ret = of_thermal_get_ntrips(tsc->zone); > if (ret < 0) > -- > 2.17.1 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH] serial: sh-sci: Use DEVICE_ATTR_RW() for rx_fifo_trigger
> On July 31, 2019 at 2:45 PM Geert Uytterhoeven > wrote: > > > While commit b6b996b6cdeecf7e ("treewide: Use DEVICE_ATTR_RW") converted > the rx_fifo_timeout attribute, it forgot to convert rx_fifo_trigger due > to a slightly different function naming. > > Signed-off-by: Geert Uytterhoeven > --- > drivers/tty/serial/sh-sci.c | 14 ++ > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c > index d18c680aa64b3427..57638175639e0f3f 100644 > --- a/drivers/tty/serial/sh-sci.c > +++ b/drivers/tty/serial/sh-sci.c > @@ -1092,9 +1092,8 @@ static void rx_fifo_timer_fn(struct timer_list *t) > scif_set_rtrg(port, 1); > } > > -static ssize_t rx_trigger_show(struct device *dev, > -struct device_attribute *attr, > -char *buf) > +static ssize_t rx_fifo_trigger_show(struct device *dev, > + struct device_attribute *attr, char *buf) > { > struct uart_port *port = dev_get_drvdata(dev); > struct sci_port *sci = to_sci_port(port); > @@ -1102,10 +1101,9 @@ static ssize_t rx_trigger_show(struct device *dev, > return sprintf(buf, "%d\n", sci->rx_trigger); > } > > -static ssize_t rx_trigger_store(struct device *dev, > - struct device_attribute *attr, > - const char *buf, > - size_t count) > +static ssize_t rx_fifo_trigger_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > { > struct uart_port *port = dev_get_drvdata(dev); > struct sci_port *sci = to_sci_port(port); > @@ -1123,7 +1121,7 @@ static ssize_t rx_trigger_store(struct device *dev, > return count; > } > > -static DEVICE_ATTR(rx_fifo_trigger, 0644, rx_trigger_show, rx_trigger_store); > +static DEVICE_ATTR_RW(rx_fifo_trigger); > > static ssize_t rx_fifo_timeout_show(struct device *dev, > struct device_attribute *attr, > -- > 2.17.1 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH v2 17/19] drm: rcar-du: crtc: Enable and disable CMMs
> On July 6, 2019 at 4:07 PM Jacopo Mondi wrote: > > > Enable/disable the CMM associated with a CRTC at crtc start and stop > time and enable the CMM unit through the Display Extensional Functions > register at group setup time. > > Signed-off-by: Jacopo Mondi > --- > drivers/gpu/drm/rcar-du/rcar_du_crtc.c | 7 +++ > drivers/gpu/drm/rcar-du/rcar_du_group.c | 8 > drivers/gpu/drm/rcar-du/rcar_du_regs.h | 5 + > 3 files changed, 20 insertions(+) > > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_crtc.c > b/drivers/gpu/drm/rcar-du/rcar_du_crtc.c > index 23f1d6cc1719..3dac605c3a67 100644 > --- a/drivers/gpu/drm/rcar-du/rcar_du_crtc.c > +++ b/drivers/gpu/drm/rcar-du/rcar_du_crtc.c > @@ -21,6 +21,7 @@ > #include > #include > > +#include "rcar_cmm.h" > #include "rcar_du_crtc.h" > #include "rcar_du_drv.h" > #include "rcar_du_encoder.h" > @@ -619,6 +620,9 @@ static void rcar_du_crtc_stop(struct rcar_du_crtc *rcrtc) > if (rcar_du_has(rcrtc->dev, RCAR_DU_FEATURE_VSP1_SOURCE)) > rcar_du_vsp_disable(rcrtc); > > + if (rcrtc->cmm) > + rcar_cmm_disable(rcrtc->cmm); > + > /* >* Select switch sync mode. This stops display operation and configures >* the HSYNC and VSYNC signals as inputs. > @@ -686,6 +690,9 @@ static void rcar_du_crtc_atomic_enable(struct drm_crtc > *crtc, > } > > rcar_du_crtc_start(rcrtc); > + > + if (rcrtc->cmm) > + rcar_cmm_enable(rcrtc->cmm); > } > > static void rcar_du_crtc_atomic_disable(struct drm_crtc *crtc, > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_group.c > b/drivers/gpu/drm/rcar-du/rcar_du_group.c > index 9eee47969e77..d252c9bb9809 100644 > --- a/drivers/gpu/drm/rcar-du/rcar_du_group.c > +++ b/drivers/gpu/drm/rcar-du/rcar_du_group.c > @@ -147,6 +147,14 @@ static void rcar_du_group_setup(struct rcar_du_group > *rgrp) > > rcar_du_group_setup_pins(rgrp); > > + if (rcar_du_has(rcdu, RCAR_DU_FEATURE_CMM)) { > + u32 defr7 = DEFR7_CODE | > + (rgrp->cmms_mask & BIT(1) ? DEFR7_CMME1 : 0) | > + (rgrp->cmms_mask & BIT(0) ? DEFR7_CMME0 : 0); > + > + rcar_du_group_write(rgrp, DEFR7, defr7); > + } > + > if (rcdu->info->gen >= 2) { > rcar_du_group_setup_defr8(rgrp); > rcar_du_group_setup_didsr(rgrp); > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_regs.h > b/drivers/gpu/drm/rcar-du/rcar_du_regs.h > index bc87f080b170..fb9964949368 100644 > --- a/drivers/gpu/drm/rcar-du/rcar_du_regs.h > +++ b/drivers/gpu/drm/rcar-du/rcar_du_regs.h > @@ -197,6 +197,11 @@ > #define DEFR6_MLOS1 (1 << 2) > #define DEFR6_DEFAULT(DEFR6_CODE | DEFR6_TCNE1) > > +#define DEFR70x000ec > +#define DEFR7_CODE (0x7779 << 16) > +#define DEFR7_CMME1 BIT(6) > +#define DEFR7_CMME0 BIT(4) > + Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH] serial: sh-sci: Fix HSCIF RX sampling point calculation
> On April 1, 2019 at 1:25 PM Geert Uytterhoeven > wrote: > > > There are several issues with the formula used for calculating the > deviation from the intended rate: > 1. While min_err and last_stop are signed, srr and baud are unsigned. > Hence the signed values are promoted to unsigned, which will lead > to a bogus value of deviation if min_err is negative, > 2. Srr is the register field value, which is one less than the actual > sampling rate factor, > 3. The divisions do not use rounding. > > Fix this by casting unsigned variables to int, adding one to srr, and > using a single DIV_ROUND_CLOSEST(). > > Fixes: 63ba1e00f178a448 ("serial: sh-sci: Support for HSCIF RX sampling point > adjustment") > Signed-off-by: Geert Uytterhoeven Reviewed-by: Ulrich Hecht CU Uli
leap seconds and 61st second in minute
Hi! I was currently following some discussion on the topic of leap seconds, and due to the basic role of time in the kernel, I'd like to send a "heads up" ("food for thought") with some proposal (not to start some useless discussion): The UNIX timescale running in UTC had (I suppose) the idea that no timezone switches will affect it. Unfortunately leap-seconds have not been considered; maybe also because at that time everybody would be happy if the system's time was correct to a few seconds. I don't know. However leap seconds are a kind of "time zone switch" conceptually. Unfortunately the unix time scale does not consider them (as noted in the time(2) manual page). That's one part of posix. The other part of POSIX claims that during an inserted leap second there should be a 61st second in the minute. Unfortunately (AFAIK) there's no interface from kernel's leap second handling to glibc allowing it to actually return 60 as the number of seconds. (localtime(3) only gets a pointer to time_t) OTOH in the NTPv4 clock model there is a TAI offset included (which can be updated by NTP). AFAIK the kernel also has the timezone offset for some time to handle RTCs that run local time. Obviously if the kernel knew the number of leap seconds (the correction to the time() timescale) conversion from UNIX timescale to TAI would be easy. So roughly if the kernel exports the time and type of the next_leap second scheduled, some future localtime could actually return the 61st second in the minute. As it is now applications will all see some magic duplication of the 60th second. (Maybe that' why Google does "leap smear"). If the kernel API would also export the TAI offset, one could even offer a TAI-based time, or, maybe even better: The kernel could run on TAI internally, and convert to UNIX time scale as needed. I'll leave exact specification and implementation to the really clever guys. Regards, Ulrich P.S. Not subscribed to the kernel-list so if you want to talk to me keep me on CC: preferably
Re: Licensing of include/linux/hash.h
>>>>> On Mon, 11 Feb 2019, Ulrich Mueller wrote: >>>>> On Mon, 11 Feb 2019, Domenico Andreoli wrote: >> On Mon, Feb 11, 2019 at 12:08:32AM +0100, Kristian Fiskerstrand wrote: >>> It was [pointed out] by one of our license group that [hash.h] is the >>> same that has a GPL-2+ in [fio] which has a signed-off-by. >>> >>> References: >>> [pointed out] >>> https://bugs.gentoo.org/677586#c1 >>> >>> [hash.h] >>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/fio.git/commit/hash.h?id=bdc7211e190482f0c17c109a0d90834a6611be1c >> Yes, the Signed-off-by is from Jens Axboe (in CC) but he's not the >> original author, I guess he just copied the file as Arnaldo did. The >> file he committed has not any reference to the license. >>> [fio] >>> https://metadata.ftp-master.debian.org/changelogs/main/f/fio/fio_3.12-2_copyright >> I'm afraid that this entry in wrong. I'll seek confirmation with >> Martin Steigerwald. > Not sure if this will help, but hash.h originally appeared in > Linux 2.5.7. In the following commit one can see that most of its code > was moved or copied from mm/filemap.c: > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=882ad449046cec136c484dd2b3659fb4c683e0a3 > filemap.c has a copyright line by Linus, but git blame shows that > the relevant code was added in 2002 with a commit by Rik van Riel: > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e5191c50d12621e04f8bc247dcb6a0f4ad76ae26 > The relevant thread in LKML is here: > https://lkml.org/lkml/2002/2/18/1970 Sorry, that link should read: https://lkml.org/lkml/2002/2/18/197
Re: Licensing of include/linux/hash.h
>>>>> On Mon, 11 Feb 2019, Domenico Andreoli wrote: > On Mon, Feb 11, 2019 at 12:08:32AM +0100, Kristian Fiskerstrand wrote: >> It was [pointed out] by one of our license group that [hash.h] is the >> same that has a GPL-2+ in [fio] which has a signed-off-by. >> >> References: >> [pointed out] >> https://bugs.gentoo.org/677586#c1 >> >> [hash.h] >> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/fio.git/commit/hash.h?id=bdc7211e190482f0c17c109a0d90834a6611be1c > Yes, the Signed-off-by is from Jens Axboe (in CC) but he's not the > original author, I guess he just copied the file as Arnaldo did. The > file he committed has not any reference to the license. >> [fio] >> https://metadata.ftp-master.debian.org/changelogs/main/f/fio/fio_3.12-2_copyright > I'm afraid that this entry in wrong. I'll seek confirmation with > Martin Steigerwald. Not sure if this will help, but hash.h originally appeared in Linux 2.5.7. In the following commit one can see that most of its code was moved or copied from mm/filemap.c: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=882ad449046cec136c484dd2b3659fb4c683e0a3 filemap.c has a copyright line by Linus, but git blame shows that the relevant code was added in 2002 with a commit by Rik van Riel: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e5191c50d12621e04f8bc247dcb6a0f4ad76ae26 The relevant thread in LKML is here: https://lkml.org/lkml/2002/2/18/1970 Ulrich
Re: [PATCH] serial: sh-sci: Fix use-after-free on subsequent port startup
On Tue, Jul 17, 2018 at 10:58 AM, Geert Uytterhoeven wrote: > sci_request_irq() checks port->irqstr[j] for a NULL pointer, to decide > if a fallback interrupt name string should be allocated or not. > > While this string is freed during port shutdown, the pointer is not > zeroed. Hence on a subsequent startup of the port, it will still be > pointing to the freed memory, leading to e.g. > > WARNING: CPU: 0 PID: 404 at fs/proc/generic.c:388 __proc_create+0xbc/0x260 > name len 0 > > or to a crash (the latter is more likely with CONFIG_DEBUG_SLAB=y, due > to the poisoning of freed memory). > > Instead of zeroeing the pointer at multiple places, preinitialize > port->irqstr[j] to zero to fix this. > > Fixes: 8b0bbd956228ae87 ("serial: sh-sci: Add support for R7S9210") > Signed-off-by: Geert Uytterhoeven > --- > drivers/tty/serial/sh-sci.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c > index 5808edfe3f7be404..f8e53ac5c17dfb94 100644 > --- a/drivers/tty/serial/sh-sci.c > +++ b/drivers/tty/serial/sh-sci.c > @@ -1872,6 +1872,7 @@ static int sci_request_irq(struct sci_port *port) > } > > desc = sci_irq_desc + i; > + port->irqstr[j] = NULL; > if (SCIx_TEIDRI_IRQ_EXISTS(port)) { > /* > * ERI and BRI are muxed, just register ERI and > -- > 2.17.1 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH] serial: sh-sci: Fix use-after-free on subsequent port startup
On Tue, Jul 17, 2018 at 10:58 AM, Geert Uytterhoeven wrote: > sci_request_irq() checks port->irqstr[j] for a NULL pointer, to decide > if a fallback interrupt name string should be allocated or not. > > While this string is freed during port shutdown, the pointer is not > zeroed. Hence on a subsequent startup of the port, it will still be > pointing to the freed memory, leading to e.g. > > WARNING: CPU: 0 PID: 404 at fs/proc/generic.c:388 __proc_create+0xbc/0x260 > name len 0 > > or to a crash (the latter is more likely with CONFIG_DEBUG_SLAB=y, due > to the poisoning of freed memory). > > Instead of zeroeing the pointer at multiple places, preinitialize > port->irqstr[j] to zero to fix this. > > Fixes: 8b0bbd956228ae87 ("serial: sh-sci: Add support for R7S9210") > Signed-off-by: Geert Uytterhoeven > --- > drivers/tty/serial/sh-sci.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c > index 5808edfe3f7be404..f8e53ac5c17dfb94 100644 > --- a/drivers/tty/serial/sh-sci.c > +++ b/drivers/tty/serial/sh-sci.c > @@ -1872,6 +1872,7 @@ static int sci_request_irq(struct sci_port *port) > } > > desc = sci_irq_desc + i; > + port->irqstr[j] = NULL; > if (SCIx_TEIDRI_IRQ_EXISTS(port)) { > /* > * ERI and BRI are muxed, just register ERI and > -- > 2.17.1 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH/RFT 6/6] i2c: recovery: remove bogus check if SDA GPIO is set to output
On Fri, Jul 13, 2018 at 11:09 PM, Wolfram Sang wrote: > This check did not work as intended. I2C is open drain, so this function > will likely always have presented the GPIO as input because > gpiod_get_direction doesn't know about open drain states. Remove this > check for now. We can add it again once we know how to get more precise > information about the GPIO. > > Signed-off-by: Wolfram Sang > --- > drivers/i2c/i2c-core-base.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c > index 301285c54603..7c5f012f561c 100644 > --- a/drivers/i2c/i2c-core-base.c > +++ b/drivers/i2c/i2c-core-base.c > @@ -261,9 +261,7 @@ static void i2c_init_recovery(struct i2c_adapter *adap) > bri->set_scl = set_scl_gpio_value; > if (bri->sda_gpiod) { > bri->get_sda = get_sda_gpio_value; > - /* FIXME: add proper flag instead of '0' once > available */ > - if (gpiod_get_direction(bri->sda_gpiod) == 0) > - bri->set_sda = set_sda_gpio_value; > + bri->set_sda = set_sda_gpio_value; > } > return; > } > -- > 2.11.0 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH/RFT 6/6] i2c: recovery: remove bogus check if SDA GPIO is set to output
On Fri, Jul 13, 2018 at 11:09 PM, Wolfram Sang wrote: > This check did not work as intended. I2C is open drain, so this function > will likely always have presented the GPIO as input because > gpiod_get_direction doesn't know about open drain states. Remove this > check for now. We can add it again once we know how to get more precise > information about the GPIO. > > Signed-off-by: Wolfram Sang > --- > drivers/i2c/i2c-core-base.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c > index 301285c54603..7c5f012f561c 100644 > --- a/drivers/i2c/i2c-core-base.c > +++ b/drivers/i2c/i2c-core-base.c > @@ -261,9 +261,7 @@ static void i2c_init_recovery(struct i2c_adapter *adap) > bri->set_scl = set_scl_gpio_value; > if (bri->sda_gpiod) { > bri->get_sda = get_sda_gpio_value; > - /* FIXME: add proper flag instead of '0' once > available */ > - if (gpiod_get_direction(bri->sda_gpiod) == 0) > - bri->set_sda = set_sda_gpio_value; > + bri->set_sda = set_sda_gpio_value; > } > return; > } > -- > 2.11.0 > Reviewed-by: Ulrich Hecht CU Uli
Re: [PATCH] mux: include compiler.h from mux/consumer.h
On Mon, Jul 31, 2017 at 6:28 PM, Peter Rosin <p...@axentia.se> wrote: > On 2017-07-31 18:00, Greg Kroah-Hartman wrote: >> On Mon, Jul 31, 2017 at 12:04:35PM +0200, Peter Rosin wrote: >>> From: Ulrich Hecht <ulrich.hecht+rene...@gmail.com> >>> >>> Required for __must_check. >>> >>> Signed-off-by: Ulrich Hecht <ulrich.hecht+rene...@gmail.com> >>> Signed-off-by: Peter Rosin <p...@axentia.se> >>> --- >>> include/linux/mux/consumer.h | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> Hi Greg, >>> >>> Please pick this up for 4.14. >> >> What commit is this patch fixing? > > I don't think it's needed for any existing user, I at least do not have any > report about that. So, I don't think it's needed for stable. > > Ulrich, the way I understood this, it was needed for your serdev series [1], > and not for any existing usage. Right? That is correct. CU Uli
Re: [PATCH] mux: include compiler.h from mux/consumer.h
On Mon, Jul 31, 2017 at 6:28 PM, Peter Rosin wrote: > On 2017-07-31 18:00, Greg Kroah-Hartman wrote: >> On Mon, Jul 31, 2017 at 12:04:35PM +0200, Peter Rosin wrote: >>> From: Ulrich Hecht >>> >>> Required for __must_check. >>> >>> Signed-off-by: Ulrich Hecht >>> Signed-off-by: Peter Rosin >>> --- >>> include/linux/mux/consumer.h | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> Hi Greg, >>> >>> Please pick this up for 4.14. >> >> What commit is this patch fixing? > > I don't think it's needed for any existing user, I at least do not have any > report about that. So, I don't think it's needed for stable. > > Ulrich, the way I understood this, it was needed for your serdev series [1], > and not for any existing usage. Right? That is correct. CU Uli
Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use
>>> Jeffrey Walton <noloa...@gmail.com> schrieb am 17.06.2017 um 16:23 in >>> Nachricht <cah8yc8nhx2r9cfq0gnejaurgsfas8v16dvhv35brnln-ypr...@mail.gmail.com>: [...] > But its not clear to me how to ensure uniqueness when its based on > randomness from the generators. Even with a perfect random generator non-unique values are possible (that's why it's random). It's unlikely, but it can happen. The question is whether the probability of non-unique values from /dev/urandom is any higher than that for values read from /dev/random. One _might_ be able to predict the values from /dev/urandom. Regards, Ulrich > > Jeff > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to open-iscsi+unsubscr...@googlegroups.com. > To post to this group, send email to open-is...@googlegroups.com. > Visit this group at https://groups.google.com/group/open-iscsi. > For more options, visit https://groups.google.com/d/optout.
Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use
>>> Jeffrey Walton schrieb am 17.06.2017 um 16:23 in >>> Nachricht : [...] > But its not clear to me how to ensure uniqueness when its based on > randomness from the generators. Even with a perfect random generator non-unique values are possible (that's why it's random). It's unlikely, but it can happen. The question is whether the probability of non-unique values from /dev/urandom is any higher than that for values read from /dev/random. One _might_ be able to predict the values from /dev/urandom. Regards, Ulrich > > Jeff > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to open-iscsi+unsubscr...@googlegroups.com. > To post to this group, send email to open-is...@googlegroups.com. > Visit this group at https://groups.google.com/group/open-iscsi. > For more options, visit https://groups.google.com/d/optout.
Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use
>>> Stephan Müller <smuel...@chronox.de> schrieb am 26.06.2017 um 19:38 in Nachricht <1678474.gnybdsl...@tauon.chronox.de>: > Am Montag, 26. Juni 2017, 03:23:09 CEST schrieb Nicholas A. Bellinger: > > Hi Nicholas, > >> Hi Stephan, Lee & Jason, >> >> (Adding target-devel CC') >> >> Apologies for coming late to the discussion. Comments below. >> >> On Sun, 2017-06-18 at 10:04 +0200, Stephan Müller wrote: >> > Am Samstag, 17. Juni 2017, 05:45:57 CEST schrieb Lee Duncan: >> > >> > Hi Lee, >> > >> > > In your testing, how long might a process have to wait? Are we talking >> > > seconds? Longer? What about timeouts? >> > >> > In current kernels (starting with 4.8) this timeout should clear within a >> > few seconds after boot. >> > >> > In older kernels (pre 4.8), my KVM takes up to 90 seconds to reach that >> > seeding point. I have heard that on IBM System Z this trigger point >> > requires minutes to be reached. >> >> I share the same concern as Lee wrt to introducing latency into the >> existing iscsi-target login sequence. >> >> Namely in the use-cases where a single node is supporting ~1K unique >> iscsi-target IQNs, and fail-over + re-balancing of IQNs where 100s of >> login attempts are expected to occur in parallel. >> >> If environments exist that require non trivial amounts of time for RNG >> seeding to be ready for iscsi-target usage, then enforcing this >> requirement at iscsi login time can open up problems, especially when >> iscsi host environments may be sensitive to login timeouts, I/O >> timeouts, et al. >> >> That said, I'd prefer to simply wait for RNG to be seeded at modprobe >> iscsi_target_mod time, instead of trying to enforce randomness during >> login. >> >> This way, those of use who have distributed storage platforms can know >> to avoid putting a node into a state to accept iscsi-target IQN export >> migration, before modprobe iscsi_target_mod has successfully loaded and >> RNG seeding has been confirmed. >> >> WDYT..? > > We may have a chicken and egg problem when the wait is at the modprobe time. > > Assume the modprobe happens during initramfs time to get access to the root > file system. In this case, you entire boot process will lock up for an > indefinite amount of time. The reason is that in order to obtain events > detected by the RNG, devices need to be initialized and working. Such > devices > commonly start working after the the root partition is mounted as it > contains > all drivers, all configuration information etc. > > Note, during the development of my /dev/random implementation, I added the > getrandom-like blocking behavior to /dev/urandom (which is the equivalent to > > Jason's patch except that it applies to user space). The boot process locked I thought reads from urandom never block by definition. An older manual page (man urandom) also says: "A read from the /dev/urandom device will not block waiting for more entropy." Regards, Ulrich > > up since systemd wanted data from /dev/urandom while it processed the > initramfs. As it did not get any, the boot process did not commence that > could > deliver new events to be picked up by the RNG. > > As I do not have such an iscsi system, I cannot test Jason's patch. But > maybe > the mentioned chicken-and-egg problem I mentioned above is already visible > with the current patch as it will lead to a blocking of the mounting of the > root partition in case the root partition is on an iscsi target. > > Ciao > Stephan > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to open-iscsi+unsubscr...@googlegroups.com. > To post to this group, send email to open-is...@googlegroups.com. > Visit this group at https://groups.google.com/group/open-iscsi. > For more options, visit https://groups.google.com/d/optout.
Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use
>>> Stephan Müller schrieb am 26.06.2017 um 19:38 in Nachricht <1678474.gnybdsl...@tauon.chronox.de>: > Am Montag, 26. Juni 2017, 03:23:09 CEST schrieb Nicholas A. Bellinger: > > Hi Nicholas, > >> Hi Stephan, Lee & Jason, >> >> (Adding target-devel CC') >> >> Apologies for coming late to the discussion. Comments below. >> >> On Sun, 2017-06-18 at 10:04 +0200, Stephan Müller wrote: >> > Am Samstag, 17. Juni 2017, 05:45:57 CEST schrieb Lee Duncan: >> > >> > Hi Lee, >> > >> > > In your testing, how long might a process have to wait? Are we talking >> > > seconds? Longer? What about timeouts? >> > >> > In current kernels (starting with 4.8) this timeout should clear within a >> > few seconds after boot. >> > >> > In older kernels (pre 4.8), my KVM takes up to 90 seconds to reach that >> > seeding point. I have heard that on IBM System Z this trigger point >> > requires minutes to be reached. >> >> I share the same concern as Lee wrt to introducing latency into the >> existing iscsi-target login sequence. >> >> Namely in the use-cases where a single node is supporting ~1K unique >> iscsi-target IQNs, and fail-over + re-balancing of IQNs where 100s of >> login attempts are expected to occur in parallel. >> >> If environments exist that require non trivial amounts of time for RNG >> seeding to be ready for iscsi-target usage, then enforcing this >> requirement at iscsi login time can open up problems, especially when >> iscsi host environments may be sensitive to login timeouts, I/O >> timeouts, et al. >> >> That said, I'd prefer to simply wait for RNG to be seeded at modprobe >> iscsi_target_mod time, instead of trying to enforce randomness during >> login. >> >> This way, those of use who have distributed storage platforms can know >> to avoid putting a node into a state to accept iscsi-target IQN export >> migration, before modprobe iscsi_target_mod has successfully loaded and >> RNG seeding has been confirmed. >> >> WDYT..? > > We may have a chicken and egg problem when the wait is at the modprobe time. > > Assume the modprobe happens during initramfs time to get access to the root > file system. In this case, you entire boot process will lock up for an > indefinite amount of time. The reason is that in order to obtain events > detected by the RNG, devices need to be initialized and working. Such > devices > commonly start working after the the root partition is mounted as it > contains > all drivers, all configuration information etc. > > Note, during the development of my /dev/random implementation, I added the > getrandom-like blocking behavior to /dev/urandom (which is the equivalent to > > Jason's patch except that it applies to user space). The boot process locked I thought reads from urandom never block by definition. An older manual page (man urandom) also says: "A read from the /dev/urandom device will not block waiting for more entropy." Regards, Ulrich > > up since systemd wanted data from /dev/urandom while it processed the > initramfs. As it did not get any, the boot process did not commence that > could > deliver new events to be picked up by the RNG. > > As I do not have such an iscsi system, I cannot test Jason's patch. But > maybe > the mentioned chicken-and-egg problem I mentioned above is already visible > with the current patch as it will lead to a blocking of the mounting of the > root partition in case the root partition is on an iscsi target. > > Ciao > Stephan > > -- > You received this message because you are subscribed to the Google Groups > "open-iscsi" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to open-iscsi+unsubscr...@googlegroups.com. > To post to this group, send email to open-is...@googlegroups.com. > Visit this group at https://groups.google.com/group/open-iscsi. > For more options, visit https://groups.google.com/d/optout.
Antw: Q: si_code for SIGBUS caused by mmap() write error
I'm sorry for using that address as sender in my previous message; it was an oversight! The CC: address was correct, however. You can drop above address from your replies.
Antw: Q: si_code for SIGBUS caused by mmap() write error
I'm sorry for using that address as sender in my previous message; it was an oversight! The CC: address was correct, however. You can drop above address from your replies.
Kernel lockup with Intel Iris graphics
Hi folks, maybe someone has a idea on this: https://bugzilla.opensuse.org/show_bug.cgi?id=1032832 Regards, Ulrich
Kernel lockup with Intel Iris graphics
Hi folks, maybe someone has a idea on this: https://bugzilla.opensuse.org/show_bug.cgi?id=1032832 Regards, Ulrich
Re: linux-next: build warning after merge of the tty tree
On Mon, Feb 6, 2017 at 9:50 AM, Greg KH <g...@kroah.com> wrote: > On Mon, Feb 06, 2017 at 03:16:38PM +1100, Stephen Rothwell wrote: >> Hi Greg, >> >> After merging the tty tree, today's linux-next build >> (arm_multi_v7_defconfig) produced this warning: >> >> drivers/tty/serial/sh-sci.c:977:12: warning: 'scif_set_rtrg' defined but not >> used [-Wunused-function] >> static int scif_set_rtrg(struct uart_port *port, int rx_trig) >> ^ >> >> Introduced by commit >> >> a380ed461f66 ("serial: sh-sci: implement FIFO threshold register setting") >> >> Forgot to add a call to this new function? > > I think this is fixed by a patch I just took into my tree, which isn't > in linux-next yet. Right Ulrich? That is correct, it's called in "[PATCH v4 2/4] serial: sh-sci: SCIFA/B RX FIFO software timeout". CU Uli
Re: linux-next: build warning after merge of the tty tree
On Mon, Feb 6, 2017 at 9:50 AM, Greg KH wrote: > On Mon, Feb 06, 2017 at 03:16:38PM +1100, Stephen Rothwell wrote: >> Hi Greg, >> >> After merging the tty tree, today's linux-next build >> (arm_multi_v7_defconfig) produced this warning: >> >> drivers/tty/serial/sh-sci.c:977:12: warning: 'scif_set_rtrg' defined but not >> used [-Wunused-function] >> static int scif_set_rtrg(struct uart_port *port, int rx_trig) >> ^ >> >> Introduced by commit >> >> a380ed461f66 ("serial: sh-sci: implement FIFO threshold register setting") >> >> Forgot to add a call to this new function? > > I think this is fixed by a patch I just took into my tree, which isn't > in linux-next yet. Right Ulrich? That is correct, it's called in "[PATCH v4 2/4] serial: sh-sci: SCIFA/B RX FIFO software timeout". CU Uli
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi once more! I managed to get the call traces of involved processes: 1) The process doing read(): Dec 7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State Dec 7 13:51:16 h10 kernel: [183809.594447] taskPC stack pid father Dec 7 13:51:16 h10 kernel: [183809.594750] randio D 8801703a9d68 0 2762 53250 0x0004 Dec 7 13:51:16 h10 kernel: [183809.594758] 880100887ad8 0046 880100886010 00010900 Dec 7 13:51:16 h10 kernel: [183809.594765] 00010900 00010900 00010900 880100887fd8 Dec 7 13:51:16 h10 kernel: [183809.594772] 880100887fd8 00010900 88016bb6a280 88017670c300 Dec 7 13:51:16 h10 kernel: [183809.594778] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.594806] [] io_schedule+0x9c/0xf0 Dec 7 13:51:16 h10 kernel: [183809.594820] [] __lock_page+0x93/0xc0 Dec 7 13:51:16 h10 kernel: [183809.594834] [] truncate_inode_pages_range+0x294/0x460 Dec 7 13:51:16 h10 kernel: [183809.594845] [] __blkdev_put+0x1d7/0x210 Dec 7 13:51:16 h10 kernel: [183809.594856] [] __fput+0xb3/0x200 Dec 7 13:51:16 h10 kernel: [183809.594868] [] filp_close+0x5c/0x90 Dec 7 13:51:16 h10 kernel: [183809.594880] [] put_files_struct+0x7a/0xd0 Dec 7 13:51:16 h10 kernel: [183809.594889] [] do_exit+0x1d0/0x470 Dec 7 13:51:16 h10 kernel: [183809.594897] [] do_group_exit+0x3d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594907] [] get_signal_to_deliver+0x247/0x480 Dec 7 13:51:16 h10 kernel: [183809.594919] [] do_signal+0x71/0x1b0 Dec 7 13:51:16 h10 kernel: [183809.594928] [] do_notify_resume+0x98/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594940] [] int_signal+0x12/0x17 Dec 7 13:51:16 h10 kernel: [183809.594988] [<7f64f28cbba0>] 0x7f64f28cbb9f 2) The process trying to modify the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.594995] blocktune D 88014e048000 0 58867 1 0x Dec 7 13:51:16 h10 kernel: [183809.595000] 880128defd18 0086 880128dee010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595007] 00010900 00010900 00010900 880128deffd8 Dec 7 13:51:16 h10 kernel: [183809.595013] 880128deffd8 00010900 88012889a3c0 8801767bc1c0 Dec 7 13:51:16 h10 kernel: [183809.595019] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595026] [] schedule_timeout+0x1b0/0x2a0 Dec 7 13:51:16 h10 kernel: [183809.595040] [] msleep+0x1d/0x30 Dec 7 13:51:16 h10 kernel: [183809.595052] [] __blk_drain_queue+0xbc/0x140 Dec 7 13:51:16 h10 kernel: [183809.595063] [] elv_quiesce_start+0x51/0x90 Dec 7 13:51:16 h10 kernel: [183809.595071] [] elevator_switch+0x4a/0x150 Dec 7 13:51:16 h10 kernel: [183809.595079] [] elevator_change+0x6d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595086] [] elv_iosched_store+0x27/0x60 Dec 7 13:51:16 h10 kernel: [183809.595096] [] queue_attr_store+0x67/0xc0 Dec 7 13:51:16 h10 kernel: [183809.595106] [] sysfs_write_file+0xcb/0x160 Dec 7 13:51:16 h10 kernel: [183809.595115] [] vfs_write+0xce/0x140 Dec 7 13:51:16 h10 kernel: [183809.595123] [] sys_write+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595131] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595140] [<7f12b7f70c00>] 0x7f12b7f70bff 3) The process trying to read the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718 0 45053 5957 0x0004 Dec 7 13:51:16 h10 kernel: [183809.595155] 880122f7be00 0082 880122f7a010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595161] 00010900 00010900 00010900 880122f7bfd8 Dec 7 13:51:16 h10 kernel: [183809.595167] 880122f7bfd8 00010900 8801154ea1c0 81a11020 Dec 7 13:51:16 h10 kernel: [183809.595174] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595181] [] __mutex_lock_slowpath+0x160/0x1f0 Dec 7 13:51:16 h10 kernel: [183809.595189] [] mutex_lock+0x1a/0x40 Dec 7 13:51:16 h10 kernel: [183809.595196] [] queue_attr_show+0x49/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595203] [] sysfs_read_file+0xfe/0x1c0 Dec 7 13:51:16 h10 kernel: [183809.595212] [] vfs_read+0xc7/0x130 Dec 7 13:51:16 h10 kernel: [183809.595219] [] sys_read+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595226] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595235] [<7fb04560dba0>] 0x7fb04560db9f >>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : >>> 161 : 60728>: > Hi again! > > An addition: Processes doing such I/O seem to be unkillable, and I also > cannot change the queue parameters while this problem occurs (the process > trying to write (e.g.: to queue/scheduler) is also blocked. The process > status of the process doing I/O looks like this: > # cat /proc/2762/
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi once more! I managed to get the call traces of involved processes: 1) The process doing read(): Dec 7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State Dec 7 13:51:16 h10 kernel: [183809.594447] taskPC stack pid father Dec 7 13:51:16 h10 kernel: [183809.594750] randio D 8801703a9d68 0 2762 53250 0x0004 Dec 7 13:51:16 h10 kernel: [183809.594758] 880100887ad8 0046 880100886010 00010900 Dec 7 13:51:16 h10 kernel: [183809.594765] 00010900 00010900 00010900 880100887fd8 Dec 7 13:51:16 h10 kernel: [183809.594772] 880100887fd8 00010900 88016bb6a280 88017670c300 Dec 7 13:51:16 h10 kernel: [183809.594778] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.594806] [] io_schedule+0x9c/0xf0 Dec 7 13:51:16 h10 kernel: [183809.594820] [] __lock_page+0x93/0xc0 Dec 7 13:51:16 h10 kernel: [183809.594834] [] truncate_inode_pages_range+0x294/0x460 Dec 7 13:51:16 h10 kernel: [183809.594845] [] __blkdev_put+0x1d7/0x210 Dec 7 13:51:16 h10 kernel: [183809.594856] [] __fput+0xb3/0x200 Dec 7 13:51:16 h10 kernel: [183809.594868] [] filp_close+0x5c/0x90 Dec 7 13:51:16 h10 kernel: [183809.594880] [] put_files_struct+0x7a/0xd0 Dec 7 13:51:16 h10 kernel: [183809.594889] [] do_exit+0x1d0/0x470 Dec 7 13:51:16 h10 kernel: [183809.594897] [] do_group_exit+0x3d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594907] [] get_signal_to_deliver+0x247/0x480 Dec 7 13:51:16 h10 kernel: [183809.594919] [] do_signal+0x71/0x1b0 Dec 7 13:51:16 h10 kernel: [183809.594928] [] do_notify_resume+0x98/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594940] [] int_signal+0x12/0x17 Dec 7 13:51:16 h10 kernel: [183809.594988] [<7f64f28cbba0>] 0x7f64f28cbb9f 2) The process trying to modify the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.594995] blocktune D 88014e048000 0 58867 1 0x Dec 7 13:51:16 h10 kernel: [183809.595000] 880128defd18 0086 880128dee010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595007] 00010900 00010900 00010900 880128deffd8 Dec 7 13:51:16 h10 kernel: [183809.595013] 880128deffd8 00010900 88012889a3c0 8801767bc1c0 Dec 7 13:51:16 h10 kernel: [183809.595019] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595026] [] schedule_timeout+0x1b0/0x2a0 Dec 7 13:51:16 h10 kernel: [183809.595040] [] msleep+0x1d/0x30 Dec 7 13:51:16 h10 kernel: [183809.595052] [] __blk_drain_queue+0xbc/0x140 Dec 7 13:51:16 h10 kernel: [183809.595063] [] elv_quiesce_start+0x51/0x90 Dec 7 13:51:16 h10 kernel: [183809.595071] [] elevator_switch+0x4a/0x150 Dec 7 13:51:16 h10 kernel: [183809.595079] [] elevator_change+0x6d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595086] [] elv_iosched_store+0x27/0x60 Dec 7 13:51:16 h10 kernel: [183809.595096] [] queue_attr_store+0x67/0xc0 Dec 7 13:51:16 h10 kernel: [183809.595106] [] sysfs_write_file+0xcb/0x160 Dec 7 13:51:16 h10 kernel: [183809.595115] [] vfs_write+0xce/0x140 Dec 7 13:51:16 h10 kernel: [183809.595123] [] sys_write+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595131] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595140] [<7f12b7f70c00>] 0x7f12b7f70bff 3) The process trying to read the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718 0 45053 5957 0x0004 Dec 7 13:51:16 h10 kernel: [183809.595155] 880122f7be00 0082 880122f7a010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595161] 00010900 00010900 00010900 880122f7bfd8 Dec 7 13:51:16 h10 kernel: [183809.595167] 880122f7bfd8 00010900 8801154ea1c0 81a11020 Dec 7 13:51:16 h10 kernel: [183809.595174] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595181] [] __mutex_lock_slowpath+0x160/0x1f0 Dec 7 13:51:16 h10 kernel: [183809.595189] [] mutex_lock+0x1a/0x40 Dec 7 13:51:16 h10 kernel: [183809.595196] [] queue_attr_show+0x49/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595203] [] sysfs_read_file+0xfe/0x1c0 Dec 7 13:51:16 h10 kernel: [183809.595212] [] vfs_read+0xc7/0x130 Dec 7 13:51:16 h10 kernel: [183809.595219] [] sys_read+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595226] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595235] [<7fb04560dba0>] 0x7fb04560db9f >>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : >>> 161 : 60728>: > Hi again! > > An addition: Processes doing such I/O seem to be unkillable, and I also > cannot change the queue parameters while this problem occurs (the process > trying to write (e.g.: to queue/scheduler) is also blocked. The process > status of the process doing I/O looks like this: > # cat /proc/2762/
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! An addition: Processes doing such I/O seem to be unkillable, and I also cannot change the queue parameters while this problem occurs (the process trying to write (e.g.: to queue/scheduler) is also blocked. The process status of the process doing I/O looks like this: # cat /proc/2762/status Name: randio State: D (disk sleep) Tgid: 2762 Pid:2762 PPid: 53250 TracerPid: 0 Uid:0 0 0 0 Gid:0 0 0 0 FDSize: 0 Groups: 0 105 Threads:1 SigQ: 5/38340 SigPnd: ShdPnd: 4102 SigBlk: SigIgn: SigCgt: 00018000 CapInh: CapPrm: CapEff: CapBnd: Cpus_allowed: , Cpus_allowed_list: 0-63 Mems_allowed: ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001 Mems_allowed_list: 0 voluntary_ctxt_switches:5 nonvoluntary_ctxt_switches: 1 Best regards, Ulrich >>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : >>> 161 : 60728>: > Hi again! > > Maybe someone can confirm this: > If you have a device (e.g. multipath map) that limits max_sectors_kb to > maybe 64, and then define an LVM LV using that multipath map as PV, the LV > still has a larger max_sectors_kb. If you send a big request (read in my > case), the kernel will complain: > > kernel: [173116.098798] blk_rq_check_limits: over max size limit. > > Note that this message does not give any clue to the device being involved, > nor the size of the IO attempted, nor the limit of the IO size. > > My expectation would be that the higher layer reports back an I/O error, and > the user process receives an I/O error, or, alternatively, the big request is > split into acceptable chunks before passing it to the lower layers. > > However none of the above happens; instead the request seems to block the > request queue, because later TUR-checks also fail: > kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. > kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. > multipathd: 66:384: mark as failed > multipathd: NAP_S11: remaining active paths: 1 > multipathd: 66:352: mark as failed > multipathd: NAP_S11: Entering recovery mode: max_retries=6 > multipathd: NAP_S11: remaining active paths: 0 > > (somewhat later) > multipathd: NAP_S11: sdkh - tur checker reports path is up > multipathd: 66:336: reinstated > multipathd: NAP_S11: Recovered to normal mode > kernel: [173117.286712] device-mapper: multipath: Could not failover device > 66:368: Handler scsi_dh_alua error 8. > (I don't know the implications of this) > > Of course this error does not appear as long as all devices use the same > maximum request size, but tests have shown that different SAN disk systems > prefer different request sizes (as they split large requests internally to > handle them in chunks anyway). > > Last seen with this kernel (SLES11 SP4 on x86_64): Linux version > 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch > revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) > > Regards, > Ulrich > > >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : > >>> 161 > : > 60728>: > > Hello! > > > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I > noticed > > that I/Os dropped, until everything stood still more or less. Looking into > > the syslog I found that multipath's TUR-checker considered the paths (FC, > > BTW) as dead. Amazingly I did not have this problem when I did read-only > > tests. > > > > The start looks like this: > > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: > Failing > > path 8:32. > > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > > size limit.
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! An addition: Processes doing such I/O seem to be unkillable, and I also cannot change the queue parameters while this problem occurs (the process trying to write (e.g.: to queue/scheduler) is also blocked. The process status of the process doing I/O looks like this: # cat /proc/2762/status Name: randio State: D (disk sleep) Tgid: 2762 Pid:2762 PPid: 53250 TracerPid: 0 Uid:0 0 0 0 Gid:0 0 0 0 FDSize: 0 Groups: 0 105 Threads:1 SigQ: 5/38340 SigPnd: ShdPnd: 4102 SigBlk: SigIgn: SigCgt: 00018000 CapInh: CapPrm: CapEff: CapBnd: Cpus_allowed: , Cpus_allowed_list: 0-63 Mems_allowed: ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001 Mems_allowed_list: 0 voluntary_ctxt_switches:5 nonvoluntary_ctxt_switches: 1 Best regards, Ulrich >>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : >>> 161 : 60728>: > Hi again! > > Maybe someone can confirm this: > If you have a device (e.g. multipath map) that limits max_sectors_kb to > maybe 64, and then define an LVM LV using that multipath map as PV, the LV > still has a larger max_sectors_kb. If you send a big request (read in my > case), the kernel will complain: > > kernel: [173116.098798] blk_rq_check_limits: over max size limit. > > Note that this message does not give any clue to the device being involved, > nor the size of the IO attempted, nor the limit of the IO size. > > My expectation would be that the higher layer reports back an I/O error, and > the user process receives an I/O error, or, alternatively, the big request is > split into acceptable chunks before passing it to the lower layers. > > However none of the above happens; instead the request seems to block the > request queue, because later TUR-checks also fail: > kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. > kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. > multipathd: 66:384: mark as failed > multipathd: NAP_S11: remaining active paths: 1 > multipathd: 66:352: mark as failed > multipathd: NAP_S11: Entering recovery mode: max_retries=6 > multipathd: NAP_S11: remaining active paths: 0 > > (somewhat later) > multipathd: NAP_S11: sdkh - tur checker reports path is up > multipathd: 66:336: reinstated > multipathd: NAP_S11: Recovered to normal mode > kernel: [173117.286712] device-mapper: multipath: Could not failover device > 66:368: Handler scsi_dh_alua error 8. > (I don't know the implications of this) > > Of course this error does not appear as long as all devices use the same > maximum request size, but tests have shown that different SAN disk systems > prefer different request sizes (as they split large requests internally to > handle them in chunks anyway). > > Last seen with this kernel (SLES11 SP4 on x86_64): Linux version > 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch > revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) > > Regards, > Ulrich > > >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : > >>> 161 > : > 60728>: > > Hello! > > > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I > noticed > > that I/Os dropped, until everything stood still more or less. Looking into > > the syslog I found that multipath's TUR-checker considered the paths (FC, > > BTW) as dead. Amazingly I did not have this problem when I did read-only > > tests. > > > > The start looks like this: > > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: > Failing > > path 8:32. > > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > > size limit.
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! Maybe someone can confirm this: If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 64, and then define an LVM LV using that multipath map as PV, the LV still has a larger max_sectors_kb. If you send a big request (read in my case), the kernel will complain: kernel: [173116.098798] blk_rq_check_limits: over max size limit. Note that this message does not give any clue to the device being involved, nor the size of the IO attempted, nor the limit of the IO size. My expectation would be that the higher layer reports back an I/O error, and the user process receives an I/O error, or, alternatively, the big request is split into acceptable chunks before passing it to the lower layers. However none of the above happens; instead the request seems to block the request queue, because later TUR-checks also fail: kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. multipathd: 66:384: mark as failed multipathd: NAP_S11: remaining active paths: 1 multipathd: 66:352: mark as failed multipathd: NAP_S11: Entering recovery mode: max_retries=6 multipathd: NAP_S11: remaining active paths: 0 (somewhat later) multipathd: NAP_S11: sdkh - tur checker reports path is up multipathd: 66:336: reinstated multipathd: NAP_S11: Recovered to normal mode kernel: [173117.286712] device-mapper: multipath: Could not failover device 66:368: Handler scsi_dh_alua error 8. (I don't know the implications of this) Of course this error does not appear as long as all devices use the same maximum request size, but tests have shown that different SAN disk systems prefer different request sizes (as they split large requests internally to handle them in chunks anyway). Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) Regards, Ulrich >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : >>> 161 : 60728>: > Hello! > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed > that I/Os dropped, until everything stood still more or less. Looking into > the syslog I found that multipath's TUR-checker considered the paths (FC, > BTW) as dead. Amazingly I did not have this problem when I did read-only > tests. > > The start looks like this: > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing > path 8:32. > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing > path 129:112. > [...] > It seems the TUR-checker does some ping-pong-like game: Paths go up and down > > Now for the Linux part: I found the relevant message in blk-core.c > (blk_rq_check_limits()). > First s/agaist/against/ in " *Such request stacking drivers should check > those requests agaist", the there's the problem that the message neither > outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the > underlying device. That makes debugging somewhat difficult if you customize > the block queue settings per device as I did: > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-31 (noop) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTest-31 (128) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-32 (noop) > Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTest-32 (128) >
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! Maybe someone can confirm this: If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 64, and then define an LVM LV using that multipath map as PV, the LV still has a larger max_sectors_kb. If you send a big request (read in my case), the kernel will complain: kernel: [173116.098798] blk_rq_check_limits: over max size limit. Note that this message does not give any clue to the device being involved, nor the size of the IO attempted, nor the limit of the IO size. My expectation would be that the higher layer reports back an I/O error, and the user process receives an I/O error, or, alternatively, the big request is split into acceptable chunks before passing it to the lower layers. However none of the above happens; instead the request seems to block the request queue, because later TUR-checks also fail: kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. multipathd: 66:384: mark as failed multipathd: NAP_S11: remaining active paths: 1 multipathd: 66:352: mark as failed multipathd: NAP_S11: Entering recovery mode: max_retries=6 multipathd: NAP_S11: remaining active paths: 0 (somewhat later) multipathd: NAP_S11: sdkh - tur checker reports path is up multipathd: 66:336: reinstated multipathd: NAP_S11: Recovered to normal mode kernel: [173117.286712] device-mapper: multipath: Could not failover device 66:368: Handler scsi_dh_alua error 8. (I don't know the implications of this) Of course this error does not appear as long as all devices use the same maximum request size, but tests have shown that different SAN disk systems prefer different request sizes (as they split large requests internally to handle them in chunks anyway). Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) Regards, Ulrich >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : >>> 161 : 60728>: > Hello! > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed > that I/Os dropped, until everything stood still more or less. Looking into > the syslog I found that multipath's TUR-checker considered the paths (FC, > BTW) as dead. Amazingly I did not have this problem when I did read-only > tests. > > The start looks like this: > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing > path 8:32. > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing > path 129:112. > [...] > It seems the TUR-checker does some ping-pong-like game: Paths go up and down > > Now for the Linux part: I found the relevant message in blk-core.c > (blk_rq_check_limits()). > First s/agaist/against/ in " *Such request stacking drivers should check > those requests agaist", the there's the problem that the message neither > outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the > underlying device. That makes debugging somewhat difficult if you customize > the block queue settings per device as I did: > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-31 (noop) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTest-31 (128) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-32 (noop) > Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTest-32 (128) >
Antw: Re: MBR partitions slow?
>>> Mark D Rustad <mrus...@gmail.com> schrieb am 31.08.2016 um 17:32 in >>> Nachricht <e2d72371-913b-4460-a370-c141835ad...@gmail.com>: > Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > >> So without partition the throughput is about twice as high! Why? > > My first thought is that by starting at block 0 the accesses were aligned > with the flash block size of the device. By starting at a partition, the > accesses probably were not so aligned. Hi! Thanks for answering. Yes, you are right: Usually I use fdisk to create partitions, and the tool does proper aligning for the partitions. In my case YaST insisted on having a partition before creating a filesystem, so I created on within YaST, and that partition turned out to be badly aligned (I think Yast uses cfdisk internally). I'm sorry that I didn't think about that earlier! Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related... Regards, Ulrich > > -- > Mark Rustad, mrus...@gmail.com
Antw: Re: MBR partitions slow?
>>> Mark D Rustad schrieb am 31.08.2016 um 17:32 in >>> Nachricht : > Ulrich Windl wrote: > >> So without partition the throughput is about twice as high! Why? > > My first thought is that by starting at block 0 the accesses were aligned > with the flash block size of the device. By starting at a partition, the > accesses probably were not so aligned. Hi! Thanks for answering. Yes, you are right: Usually I use fdisk to create partitions, and the tool does proper aligning for the partitions. In my case YaST insisted on having a partition before creating a filesystem, so I created on within YaST, and that partition turned out to be badly aligned (I think Yast uses cfdisk internally). I'm sorry that I didn't think about that earlier! Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related... Regards, Ulrich > > -- > Mark Rustad, mrus...@gmail.com
Antw: MBR partitions slow?
Update: I found out the bad performance was caused by partition alignment, and not by the pertition per se (YaST created the partition next to the MBR). I compared two partitions, number one badly aligned, and number 2 properly aligned. Then I got these results: Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes 64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 16384 bytes / 16777216 bytes Disk identifier: 0x00016340 Device Boot Start End Blocks Id System /dev/disk/by-id/dm-name-FirstTest-32-part1 1 5242879 2621439+ 83 Linux Partition 1 does not start on physical sector boundary. /dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 2621440 83 Linux h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s time for fstat(): 0.60s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 0x7f826a8a1000: 0.38s time to zap 2684.4MiB: 11.734121s (228.76 MiB/s) time to sync 2684.4MiB: 3.515991s (763.47 MiB/s) time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2 time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s time for fstat(): 0.69s time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 0x7fe18823e000: 0.44s time to zap 2684.4MiB: 4.861062s (552.22 MiB/s) time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s) time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s So the correctly aligned partition is two to three times faster than the badly aligned partition (write-only case), and it's about the performance of an unpartitioned disk. Regards, Ulrich >>> Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> schrieb am 30.08.2016 um >>> 11:32 in Nachricht <57C552B6.33D : 161 : 60728>: > Hello! > > (I'm not subscribed to this list, but I'm hoping to get a reply anyway) > While testing some SAN storage system, I needed a utility to erase disks > quickly. I wrote my own one that mmap()s the block device, memset()s the > area, then msync()s the changes, and finally close()s the file descriptor. > > On one disk I had a primary MBR partition spanning the whole disk, like this > (output from some of my obscure tools): > disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 > (10737418240 bytes) > partition 1 (1-20971520) > Total Sectors = 20971519 > > When wiping, I started (for no good reason) to wipe partition 1, then I > wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the > disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 > SP4. > For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM > bandwidth is about 9GB/s. > > To my surprise I found out that the partition eats significant performance > (not quite 50%, but a lot): > > ### Partition > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to > open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): > 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size > 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 > MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at > 0x7fbc86739000: 0.052170s time to close > /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s > > ### Whole disk > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open > /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): > 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size > 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 > MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at > 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: > 0.290470s > > Reproducible: > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 > time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s > time for fstat(): 0.65s > time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at > 0x7f1cc17ab000: 0.37s > time to zap 10.7Gib: 24.624000s (436.06 MiB/s) > time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) > time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s > time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s > > So without partition the throughput is about twice as high! Why? > > Regards > Ulrich > >
Antw: MBR partitions slow?
Update: I found out the bad performance was caused by partition alignment, and not by the pertition per se (YaST created the partition next to the MBR). I compared two partitions, number one badly aligned, and number 2 properly aligned. Then I got these results: Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes 64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 16384 bytes / 16777216 bytes Disk identifier: 0x00016340 Device Boot Start End Blocks Id System /dev/disk/by-id/dm-name-FirstTest-32-part1 1 5242879 2621439+ 83 Linux Partition 1 does not start on physical sector boundary. /dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 2621440 83 Linux h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s time for fstat(): 0.60s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 0x7f826a8a1000: 0.38s time to zap 2684.4MiB: 11.734121s (228.76 MiB/s) time to sync 2684.4MiB: 3.515991s (763.47 MiB/s) time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2 time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s time for fstat(): 0.69s time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 0x7fe18823e000: 0.44s time to zap 2684.4MiB: 4.861062s (552.22 MiB/s) time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s) time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s So the correctly aligned partition is two to three times faster than the badly aligned partition (write-only case), and it's about the performance of an unpartitioned disk. Regards, Ulrich >>> Ulrich Windl schrieb am 30.08.2016 um >>> 11:32 in Nachricht <57C552B6.33D : 161 : 60728>: > Hello! > > (I'm not subscribed to this list, but I'm hoping to get a reply anyway) > While testing some SAN storage system, I needed a utility to erase disks > quickly. I wrote my own one that mmap()s the block device, memset()s the > area, then msync()s the changes, and finally close()s the file descriptor. > > On one disk I had a primary MBR partition spanning the whole disk, like this > (output from some of my obscure tools): > disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 > (10737418240 bytes) > partition 1 (1-20971520) > Total Sectors = 20971519 > > When wiping, I started (for no good reason) to wipe partition 1, then I > wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the > disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 > SP4. > For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM > bandwidth is about 9GB/s. > > To my surprise I found out that the partition eats significant performance > (not quite 50%, but a lot): > > ### Partition > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to > open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): > 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size > 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 > MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at > 0x7fbc86739000: 0.052170s time to close > /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s > > ### Whole disk > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open > /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): > 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size > 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 > MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at > 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: > 0.290470s > > Reproducible: > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 > time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s > time for fstat(): 0.65s > time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at > 0x7f1cc17ab000: 0.37s > time to zap 10.7Gib: 24.624000s (436.06 MiB/s) > time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) > time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s > time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s > > So without partition the throughput is about twice as high! Why? > > Regards > Ulrich > >
MBR partitions slow?
Hello! (I'm not subscribed to this list, but I'm hoping to get a reply anyway) While testing some SAN storage system, I needed a utility to erase disks quickly. I wrote my own one that mmap()s the block device, memset()s the area, then msync()s the changes, and finally close()s the file descriptor. On one disk I had a primary MBR partition spanning the whole disk, like this (output from some of my obscure tools): disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 (10737418240 bytes) partition 1 (1-20971520) Total Sectors = 20971519 When wiping, I started (for no good reason) to wipe partition 1, then I wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4. For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM bandwidth is about 9GB/s. To my surprise I found out that the partition eats significant performance (not quite 50%, but a lot): ### Partition h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s ### Whole disk h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s Reproducible: h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s time for fstat(): 0.65s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7f1cc17ab000: 0.37s time to zap 10.7Gib: 24.624000s (436.06 MiB/s) time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s So without partition the throughput is about twice as high! Why? Regards Ulrich
MBR partitions slow?
Hello! (I'm not subscribed to this list, but I'm hoping to get a reply anyway) While testing some SAN storage system, I needed a utility to erase disks quickly. I wrote my own one that mmap()s the block device, memset()s the area, then msync()s the changes, and finally close()s the file descriptor. On one disk I had a primary MBR partition spanning the whole disk, like this (output from some of my obscure tools): disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 (10737418240 bytes) partition 1 (1-20971520) Total Sectors = 20971519 When wiping, I started (for no good reason) to wipe partition 1, then I wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4. For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM bandwidth is about 9GB/s. To my surprise I found out that the partition eats significant performance (not quite 50%, but a lot): ### Partition h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s ### Whole disk h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s Reproducible: h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s time for fstat(): 0.65s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7f1cc17ab000: 0.37s time to zap 10.7Gib: 24.624000s (436.06 MiB/s) time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s So without partition the throughput is about twice as high! Why? Regards Ulrich
3.0.101: "blk_rq_check_limits: over max size limit."
Hello! While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed that I/Os dropped, until everything stood still more or less. Looking into the syslog I found that multipath's TUR-checker considered the paths (FC, BTW) as dead. Amazingly I did not have this problem when I did read-only tests. The start looks like this: Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing path 8:32. Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing path 129:112. [...] It seems the TUR-checker does some ping-pong-like game: Paths go up and down Now for the Linux part: I found the relevant message in blk-core.c (blk_rq_check_limits()). First s/agaist/against/ in " *Such request stacking drivers should check those requests agaist", the there's the problem that the message neither outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the underlying device. That makes debugging somewhat difficult if you customize the block queue settings per device as I did: Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/rotational for FirstTest-31 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/add_random for FirstTest-31 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/scheduler for FirstTest-31 (noop) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/max_sectors_kb for FirstTest-31 (128) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/rotational for FirstTest-32 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/add_random for FirstTest-32 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/scheduler for FirstTest-32 (noop) Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of queue/max_sectors_kb for FirstTest-32 (128) I suspect the "queue/max_sectors_kb=128" is the culprit: # multipath -ll FirstTest-32 FirstTest-32 (360002ac40001b383) dm-7 3PARdata,VV size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 2:0:0:1 sdet 129:80 failed ready running |- 2:0:2:1 sdev 129:112 failed ready running |- 1:0:0:1 sdb 8:16failed ready running `- 1:0:1:1 sdc 8:32failed ready running # cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb 128 128 128 128 128 While writing this message, I noticed that I had created a primary partition of dm-7: # dmsetup ls |grep Fi FirstTest-32_part1 (253:8) FirstTest-32(253:7) # cat /sys/block/dm-8/queue/max_sectors_kb 1024 After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did not get better. Can't blk_rq_check_limits() do anything more clever than returning -EIO? Regards, Ulrich P.S: Keep me in CC:, please!
3.0.101: "blk_rq_check_limits: over max size limit."
Hello! While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed that I/Os dropped, until everything stood still more or less. Looking into the syslog I found that multipath's TUR-checker considered the paths (FC, BTW) as dead. Amazingly I did not have this problem when I did read-only tests. The start looks like this: Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing path 8:32. Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max size limit. Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing path 129:112. [...] It seems the TUR-checker does some ping-pong-like game: Paths go up and down Now for the Linux part: I found the relevant message in blk-core.c (blk_rq_check_limits()). First s/agaist/against/ in " *Such request stacking drivers should check those requests agaist", the there's the problem that the message neither outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the underlying device. That makes debugging somewhat difficult if you customize the block queue settings per device as I did: Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/rotational for FirstTest-31 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/add_random for FirstTest-31 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/scheduler for FirstTest-31 (noop) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/max_sectors_kb for FirstTest-31 (128) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/rotational for FirstTest-32 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/add_random for FirstTest-32 (0) Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of queue/scheduler for FirstTest-32 (noop) Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of queue/max_sectors_kb for FirstTest-32 (128) I suspect the "queue/max_sectors_kb=128" is the culprit: # multipath -ll FirstTest-32 FirstTest-32 (360002ac40001b383) dm-7 3PARdata,VV size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 2:0:0:1 sdet 129:80 failed ready running |- 2:0:2:1 sdev 129:112 failed ready running |- 1:0:0:1 sdb 8:16failed ready running `- 1:0:1:1 sdc 8:32failed ready running # cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb 128 128 128 128 128 While writing this message, I noticed that I had created a primary partition of dm-7: # dmsetup ls |grep Fi FirstTest-32_part1 (253:8) FirstTest-32(253:7) # cat /sys/block/dm-8/queue/max_sectors_kb 1024 After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did not get better. Can't blk_rq_check_limits() do anything more clever than returning -EIO? Regards, Ulrich P.S: Keep me in CC:, please!
cpuset cleanup race
I came across a problem with code which uses a cpuset CG and tries to be responsible and clean up after itself. The code attached at the bottom illustrates the problem. It's only long because it has no dependencies aside from the basic runtime and should work on all machines. You need to run it with privileges high enough to create a CG. The code is really simple: - a (new) CG in cpuset is created - one of the cores of the root cpuset is selected - the thread (and therefore entire process) is switched to the cpuset - a thread is created which does nothing but terminate immediately - the parent waits for the thread - then the parent removes itself from the cpuset - finally the parent tries to remove the created cpuset The last part is where things go wrong. Usually* the rmdir() call made to remove the cpuset fails because the cpuset is still busy. The program prints the members of the cpuset CG: it's the child thread. * I wrote "usually" because slowing down the parent code will help. I.e., there is a race. Successful slowdowns I found: - compile with -fsanitize=address (seems already enough) - very short wait, e.g., 1ns (you can see this by starting the program with the parameter "wait") You might want to compile the code with optimization. It is a race, after all. The pthread_join() call made by the parent won't return until the kernel signals through the futex set up at clone() time that the thread has terminated. From the perspective of the userlevel code the thread is gone. But not all bookkeeping related to the terminated thread seems to has been finished, it seems. I didn't look at the code but I can imagine that the futex notification happens as soon as all observable aspects of the thread are gone. This is of course good to not delay the waiter. Hopefully the cgroup bookkeeping can also be moved before the notification. I tested it with a recent kernel (4.5.0-0.rc7) but I doubt it's a recent issue. ~ #include #include #include #include #include #include #include #include #include #include static void *tf(void *p) { return NULL; } int main(int argc, char *argv[]) { const char *csname = argc == 1 ? "test" : argv[1]; struct mntent *me; FILE *fp = setmntent(_PATH_MOUNTED, "r"); if (fp == NULL) error(1, errno, "cannot read mounted filesystem information"); while ((me = getmntent(fp)) != NULL) { if (strcmp(me->mnt_type, "cgroup") == 0 && hasmntopt(me, "cpuset") != NULL) break; } if (me == NULL) error(1, 0, "cpuset filesystem not mounted"); endmntent(fp); char *cshier = NULL; asprintf(, "%s/%s", me->mnt_dir, csname); if (mkdir(cshier, 0777) == 0) printf("new cpuset control group: %s\n", cshier); else if (errno != EEXIST) error(1, errno, "cannot create cpuset group %s", cshier); char *csrootmems; asprintf(, "%s/cpuset.mems", me->mnt_dir); fp = fopen(csrootmems, "r"); if (fp == NULL) error(1, errno, "cannot read /cpuset.mems"); char *val = NULL; size_t vallen = 0; ssize_t n = getline(, , fp); fclose(fp); free(csrootmems); char *testmems; asprintf(, "%s/cpuset.mems", cshier); fp = fopen(testmems, "w"); if (fp == NULL) error(1, errno, "cannot read /%s/cpuset.mems", csname); fwrite(val, n, 1, fp); fclose(fp); free(testmems); free(val); cpu_set_t cs; int first = 0; sched_getaffinity(0, sizeof(cs), ); while (! CPU_ISSET(first, )) ++first; char *testcpus; asprintf(, "%s/cpuset.cpus", cshier); fp = fopen(testcpus, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/cpuset.cpus", csname); fprintf(fp, "%d", first); fclose(fp); free(testcpus); char *testtasks; asprintf(, "%s/tasks", cshier); fp = fopen(testtasks, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/tasks", csname); fprintf(fp, "%d", (int) getpid()); fclose(fp); pthread_t th; pthread_create(, NULL, tf, NULL); pthread_join(th, NULL); char *roottasks; asprintf(, "%s/tasks", me->mnt_dir); fp = fopen(roottasks, "w"); if (fp == NULL) error(1, errno, "cannot write /tasks"); fprintf(fp, "%d", (int) getpid()); fclose(fp); free(roottasks); if (strcmp(csname, "wait") == 0) { struct timespec s = { 0, 1 }; nanosleep(, NULL); } if (rmdir(cshier) != 0) { printf("PID = %ld\nremaining = ", (long) getpid()); fp = fopen(testtasks, "r"); char *line = NULL; size_t linelen = 0; while ((n = getline(, , fp)) > 0) fputs(line, stdout); fclose(fp); free(line); error(1, errno, "couldn't remove cpuset %s", cshier); } free(cshier); free(testtasks); return 0; }
cpuset cleanup race
I came across a problem with code which uses a cpuset CG and tries to be responsible and clean up after itself. The code attached at the bottom illustrates the problem. It's only long because it has no dependencies aside from the basic runtime and should work on all machines. You need to run it with privileges high enough to create a CG. The code is really simple: - a (new) CG in cpuset is created - one of the cores of the root cpuset is selected - the thread (and therefore entire process) is switched to the cpuset - a thread is created which does nothing but terminate immediately - the parent waits for the thread - then the parent removes itself from the cpuset - finally the parent tries to remove the created cpuset The last part is where things go wrong. Usually* the rmdir() call made to remove the cpuset fails because the cpuset is still busy. The program prints the members of the cpuset CG: it's the child thread. * I wrote "usually" because slowing down the parent code will help. I.e., there is a race. Successful slowdowns I found: - compile with -fsanitize=address (seems already enough) - very short wait, e.g., 1ns (you can see this by starting the program with the parameter "wait") You might want to compile the code with optimization. It is a race, after all. The pthread_join() call made by the parent won't return until the kernel signals through the futex set up at clone() time that the thread has terminated. From the perspective of the userlevel code the thread is gone. But not all bookkeeping related to the terminated thread seems to has been finished, it seems. I didn't look at the code but I can imagine that the futex notification happens as soon as all observable aspects of the thread are gone. This is of course good to not delay the waiter. Hopefully the cgroup bookkeeping can also be moved before the notification. I tested it with a recent kernel (4.5.0-0.rc7) but I doubt it's a recent issue. ~ #include #include #include #include #include #include #include #include #include #include static void *tf(void *p) { return NULL; } int main(int argc, char *argv[]) { const char *csname = argc == 1 ? "test" : argv[1]; struct mntent *me; FILE *fp = setmntent(_PATH_MOUNTED, "r"); if (fp == NULL) error(1, errno, "cannot read mounted filesystem information"); while ((me = getmntent(fp)) != NULL) { if (strcmp(me->mnt_type, "cgroup") == 0 && hasmntopt(me, "cpuset") != NULL) break; } if (me == NULL) error(1, 0, "cpuset filesystem not mounted"); endmntent(fp); char *cshier = NULL; asprintf(, "%s/%s", me->mnt_dir, csname); if (mkdir(cshier, 0777) == 0) printf("new cpuset control group: %s\n", cshier); else if (errno != EEXIST) error(1, errno, "cannot create cpuset group %s", cshier); char *csrootmems; asprintf(, "%s/cpuset.mems", me->mnt_dir); fp = fopen(csrootmems, "r"); if (fp == NULL) error(1, errno, "cannot read /cpuset.mems"); char *val = NULL; size_t vallen = 0; ssize_t n = getline(, , fp); fclose(fp); free(csrootmems); char *testmems; asprintf(, "%s/cpuset.mems", cshier); fp = fopen(testmems, "w"); if (fp == NULL) error(1, errno, "cannot read /%s/cpuset.mems", csname); fwrite(val, n, 1, fp); fclose(fp); free(testmems); free(val); cpu_set_t cs; int first = 0; sched_getaffinity(0, sizeof(cs), ); while (! CPU_ISSET(first, )) ++first; char *testcpus; asprintf(, "%s/cpuset.cpus", cshier); fp = fopen(testcpus, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/cpuset.cpus", csname); fprintf(fp, "%d", first); fclose(fp); free(testcpus); char *testtasks; asprintf(, "%s/tasks", cshier); fp = fopen(testtasks, "w"); if (fp == NULL) error(1, errno, "cannot write /%s/tasks", csname); fprintf(fp, "%d", (int) getpid()); fclose(fp); pthread_t th; pthread_create(, NULL, tf, NULL); pthread_join(th, NULL); char *roottasks; asprintf(, "%s/tasks", me->mnt_dir); fp = fopen(roottasks, "w"); if (fp == NULL) error(1, errno, "cannot write /tasks"); fprintf(fp, "%d", (int) getpid()); fclose(fp); free(roottasks); if (strcmp(csname, "wait") == 0) { struct timespec s = { 0, 1 }; nanosleep(, NULL); } if (rmdir(cshier) != 0) { printf("PID = %ld\nremaining = ", (long) getpid()); fp = fopen(testtasks, "r"); char *line = NULL; size_t linelen = 0; while ((n = getline(, , fp)) > 0) fputs(line, stdout); fclose(fp); free(line); error(1, errno, "couldn't remove cpuset %s", cshier); } free(cshier); free(testtasks); return 0; }
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
Josh, in https://lkml.org/lkml/2016/3/15/1 you stated that the soft lockup messages do not occur with kernel v4.5. Hence, I believe this should not be reproducible with kernel v4.4 either. The relevant changes in update_watchdog_all_cpus() were introduced in kernel v4.3 by patches that I mentioned in my previous reply. I see that watchdog: use park/unpark functions in update_watchdog_all_cpus() and related patches are included in this change log https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/watchdog.c?id=refs/tags/v4.4.6 In terms of kernel v4.1, it seems that the issue is mitigated by the access rights of the watchdog parameters in /proc/sys/kernel as only a privileged user should be able to write to, for example /proc/sys/kernel/nmi_watchdog Also, based on the analysis in my previous reply, I think these soft lockup messages are 'false positives' as the repeated cancel/restart of watchdog_timer_fn() prevents the 'watchdog/N' thread from running (i.e. I think the thread is not prevented from running by something actually hogging CPU N). Regards, Uli - Original Message - From: "Josh Hunt" <joh...@akamai.com> To: "Ulrich Obergfell" <uober...@redhat.com> Cc: "Don Zickus" <dzic...@redhat.com>, a...@linux-foundation.org, linux-kernel@vger.kernel.org Sent: Thursday, March 17, 2016 5:08:03 PM Subject: Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old [...] As you mention my patch will mask this problem for 4.1 which is why I wanted to get it into stable. Do you think there is any way to mitigate this issue for the stable kernels (4.1 to 4.4) if the user changes the values doing something like: foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! $foo )); sleep .1; done & sleep 30 && kill %1 ? I realize this isn't a real-world use-case (I hope :)), but shows there is still a way to cause the box to soft lockup with this code path. [...]
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
Josh, in https://lkml.org/lkml/2016/3/15/1 you stated that the soft lockup messages do not occur with kernel v4.5. Hence, I believe this should not be reproducible with kernel v4.4 either. The relevant changes in update_watchdog_all_cpus() were introduced in kernel v4.3 by patches that I mentioned in my previous reply. I see that watchdog: use park/unpark functions in update_watchdog_all_cpus() and related patches are included in this change log https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/watchdog.c?id=refs/tags/v4.4.6 In terms of kernel v4.1, it seems that the issue is mitigated by the access rights of the watchdog parameters in /proc/sys/kernel as only a privileged user should be able to write to, for example /proc/sys/kernel/nmi_watchdog Also, based on the analysis in my previous reply, I think these soft lockup messages are 'false positives' as the repeated cancel/restart of watchdog_timer_fn() prevents the 'watchdog/N' thread from running (i.e. I think the thread is not prevented from running by something actually hogging CPU N). Regards, Uli - Original Message - From: "Josh Hunt" To: "Ulrich Obergfell" Cc: "Don Zickus" , a...@linux-foundation.org, linux-kernel@vger.kernel.org Sent: Thursday, March 17, 2016 5:08:03 PM Subject: Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old [...] As you mention my patch will mask this problem for 4.1 which is why I wanted to get it into stable. Do you think there is any way to mitigate this issue for the stable kernels (4.1 to 4.4) if the user changes the values doing something like: foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! $foo )); sleep .1; done & sleep 30 && kill %1 ? I realize this isn't a real-world use-case (I hope :)), but shows there is still a way to cause the box to soft lockup with this code path. [...]
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
tch and it did not help. After that I did a git bisect to figure out when the soft lockup was fixed and it appears to be resolved after one of the commits in this series: commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 Author: Ulrich Obergfell <uober...@redhat.com> Date: Fri Sep 4 15:45:15 2015 -0700 watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() I didn't identify the exact commit. It would be nice to resolve the soft lockup for the stable folks since 4.1 and 4.4 are longterm stable releases and would see this problem. I did not have time to debug it any more outside of this bisection today. If you have something you'd like me to try which may work for the stable kernels I'm happy to test it. For the record I'm able to reproduce the soft lockup on 4.1 doing: while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & sleep 30 && kill %1 && sleep 5 Josh
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
t to figure out when the soft lockup was fixed and it appears to be resolved after one of the commits in this series: commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 Author: Ulrich Obergfell Date: Fri Sep 4 15:45:15 2015 -0700 watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() I didn't identify the exact commit. It would be nice to resolve the soft lockup for the stable folks since 4.1 and 4.4 are longterm stable releases and would see this problem. I did not have time to debug it any more outside of this bisection today. If you have something you'd like me to try which may work for the stable kernels I'm happy to test it. For the record I'm able to reproduce the soft lockup on 4.1 doing: while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & sleep 30 && kill %1 && sleep 5 Josh
and NTP incompatibility (MOD_NANO)
Hi folks! I'd wish ntp_loopfilter.c would compile without problems. The mess is (I had fixed it about 15 years ago (keyword "PPSkit")) that Linux uses ADJ_* flags to do traditional adjtime() things, as well as NTP kernel-related things (That's why the Linux syscall is named adjtimex()). NTP however expects the flags to have names like MOD_* to modify settings, and STA_* for status bits to test. When lacks a MOD_NANO, but has a STA_NANO, compilation of ntpd/ntp_loopfilter.c fails. I'd wish this to be fixed. I don't know which "Bonehead weakener" is needed to get the fix permanently into Linux. My proposal many years ago was to use ADJ_* style flags only for the Linux specialities (the adjtime() part of adjtimex()), and use the MOD_* style flags for the NTP part of adjtimex(). Regards, Ulrich --- Appendix: Linux has only these: /* xntp 3.4 compatibility names */ #define MOD_OFFSET ADJ_OFFSET #define MOD_FREQUENCY ADJ_FREQUENCY #define MOD_MAXERROR ADJ_MAXERROR #define MOD_ESTERROR ADJ_ESTERROR #define MOD_STATUS ADJ_STATUS #define MOD_TIMECONST ADJ_TIMECONST #define MOD_CLKB ADJ_TICK #define MOD_CLKA ADJ_OFFSET_SINGLESHOT /* 0x8000 in original */
and NTP incompatibility (MOD_NANO)
Hi folks! I'd wish ntp_loopfilter.c would compile without problems. The mess is (I had fixed it about 15 years ago (keyword "PPSkit")) that Linux uses ADJ_* flags to do traditional adjtime() things, as well as NTP kernel-related things (That's why the Linux syscall is named adjtimex()). NTP however expects the flags to have names like MOD_* to modify settings, and STA_* for status bits to test. When lacks a MOD_NANO, but has a STA_NANO, compilation of ntpd/ntp_loopfilter.c fails. I'd wish this to be fixed. I don't know which "Bonehead weakener" is needed to get the fix permanently into Linux. My proposal many years ago was to use ADJ_* style flags only for the Linux specialities (the adjtime() part of adjtimex()), and use the MOD_* style flags for the NTP part of adjtimex(). Regards, Ulrich --- Appendix: Linux has only these: /* xntp 3.4 compatibility names */ #define MOD_OFFSET ADJ_OFFSET #define MOD_FREQUENCY ADJ_FREQUENCY #define MOD_MAXERROR ADJ_MAXERROR #define MOD_ESTERROR ADJ_ESTERROR #define MOD_STATUS ADJ_STATUS #define MOD_TIMECONST ADJ_TIMECONST #define MOD_CLKB ADJ_TICK #define MOD_CLKA ADJ_OFFSET_SINGLESHOT /* 0x8000 in original */
Suggestion for improving documentation (/usr/src/linux/Documentation/sysctl/vm.txt)
Hi! I just read the documentation for the "swappiness" sysctl parameter, and I realized that 1) the documentation does not talk about the valid range of the parameter (0-100?) 2) the documentation does not talk about the units the parameter uses (Percent?) == swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. The default value is 60. == My proposal to improve sysctl documentation is obvious, and no, I don't mean "Use the source, Luke!" ;-) (I'm not subscribed to LKML) Regards, Ulrich
Suggestion for improving documentation (/usr/src/linux/Documentation/sysctl/vm.txt)
Hi! I just read the documentation for the "swappiness" sysctl parameter, and I realized that 1) the documentation does not talk about the valid range of the parameter (0-100?) 2) the documentation does not talk about the units the parameter uses (Percent?) == swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. The default value is 60. == My proposal to improve sysctl documentation is obvious, and no, I don't mean "Use the source, Luke!" ;-) (I'm not subscribed to LKML) Regards, Ulrich
Re: [PATCH 2/2] workqueue: implement lockup detector
Tejun, > Sure, separating the knobs out isn't difficult. I still don't like > the idea of having multiple set of similar knobs controlling about the > same thing tho. > > For example, let's say there's a user who boots with "nosoftlockup" > explicitly. I'm pretty sure the user wouldn't be intending to keep > workqueue watchdog running. The same goes for threshold adjustments, > so here's my question. What are the reasons for the concern? What > are we worrying about? I'm not sure if it is obvious to a user that a stall of workqueues is "about the same thing" as a soft lockup, and that one could thus argue that both should be controlled by the same knob. Looking at this from perspective of usability, I would still vote for having separate knobs for each lockup detector. For example /proc/sys/kernel/wq_watchdog_thresh could control the on|off state of the workqueue watchdog and the timeout at the same time (0 means off, > 0 means on and specifies the timeout). Separating wq_watchdog_thresh from watchdog_thresh might also be useful for diagnostic purposes for example, if during the investigation of a problem one would want to explicitly increase or lower one threshold without impacting the other. >> And another question that comes to my mind is: Would the workqueue watchdog >> participate in the lockup detector suspend/resume mechanism, and if yes, how >> would it be integrated into this ? > > From the usage, I can't quite tell what the purpose of the mechanism > is. The only user seems to be fixup_ht_bug() and when it fails it > says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so > if you could give me a pointer, it'd be great. But at any rate, if > shutting down watchdog is all that's necessary, it shouldn't be a > problem to integrate. The patch post that introduced the mechanism is here: http://marc.info/?l=linux-kernel=143843318208917=2 The watchdog_{suspend|resume} functions were later renamed: http://marc.info/?l=linux-kernel=143894132129982=2 At the moment I don't see a reason why the workqueue watchdog would have to participate in that mechanism. However, if the workqueue watchdog would be connected to the soft lockup detector as you proposed, I think it should be participating for the 'sake of consistency' (it would seem hard to under- stand if the interface would only suspend parts of the lockup detector). Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] workqueue: implement lockup detector
Tejun, > Sure, separating the knobs out isn't difficult. I still don't like > the idea of having multiple set of similar knobs controlling about the > same thing tho. > > For example, let's say there's a user who boots with "nosoftlockup" > explicitly. I'm pretty sure the user wouldn't be intending to keep > workqueue watchdog running. The same goes for threshold adjustments, > so here's my question. What are the reasons for the concern? What > are we worrying about? I'm not sure if it is obvious to a user that a stall of workqueues is "about the same thing" as a soft lockup, and that one could thus argue that both should be controlled by the same knob. Looking at this from perspective of usability, I would still vote for having separate knobs for each lockup detector. For example /proc/sys/kernel/wq_watchdog_thresh could control the on|off state of the workqueue watchdog and the timeout at the same time (0 means off, > 0 means on and specifies the timeout). Separating wq_watchdog_thresh from watchdog_thresh might also be useful for diagnostic purposes for example, if during the investigation of a problem one would want to explicitly increase or lower one threshold without impacting the other. >> And another question that comes to my mind is: Would the workqueue watchdog >> participate in the lockup detector suspend/resume mechanism, and if yes, how >> would it be integrated into this ? > > From the usage, I can't quite tell what the purpose of the mechanism > is. The only user seems to be fixup_ht_bug() and when it fails it > says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so > if you could give me a pointer, it'd be great. But at any rate, if > shutting down watchdog is all that's necessary, it shouldn't be a > problem to integrate. The patch post that introduced the mechanism is here: http://marc.info/?l=linux-kernel=143843318208917=2 The watchdog_{suspend|resume} functions were later renamed: http://marc.info/?l=linux-kernel=143894132129982=2 At the moment I don't see a reason why the workqueue watchdog would have to participate in that mechanism. However, if the workqueue watchdog would be connected to the soft lockup detector as you proposed, I think it should be participating for the 'sake of consistency' (it would seem hard to under- stand if the interface would only suspend parts of the lockup detector). Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] workqueue: implement lockup detector
Tejun, I share Don's concern about connecting the soft lockup detector and the workqueue watchdog to the same kernel parameter in /proc. I would feel more comfortable if the workqueue watchdog had its dedicated parameter. I also see a scenario that the proposed patch does not handle well: The watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not necessary to disable and re-enable the watchdog. The flow of execution looks like this. proc_watchdog_thresh proc_watchdog_update if (watchdog_enabled && watchdog_thresh) watchdog_enable_all_cpus if (!watchdog_running) { ... } else { // // update 'on the fly' // update_watchdog_all_cpus() } The patched watchdog_enable_all_cpus() function disables the workqueue watchdog unconditionally at [1]. However, the workqueue watchdog remains disabled if the code path [2] is executed (and wq_watchdog_thresh is not updated as well). static int watchdog_enable_all_cpus(void) { int err = 0; [1] --> disable_workqueue_watchdog(); if (!watchdog_running) { ... } else { .- /* | * Enable/disable the lockup detectors or | * change the sample period 'on the fly'. | */ [2] <err = update_watchdog_all_cpus(); | | if (err) { | watchdog_disable_all_cpus(); | pr_err("Failed to update lockup detectors, disabled\n"); '- } } if (err) watchdog_enabled = 0; return err; } And another question that comes to my mind is: Would the workqueue watchdog participate in the lockup detector suspend/resume mechanism, and if yes, how would it be integrated into this ? Regards, Uli - Original Message - From: "Tejun Heo" To: "Don Zickus" Cc: "Ulrich Obergfell" , "Ingo Molnar" , "Peter Zijlstra" , "Andrew Morton" , linux-kernel@vger.kernel.org, kernel-t...@fb.com Sent: Thursday, December 3, 2015 8:43:58 PM Subject: Re: [PATCH 2/2] workqueue: implement lockup detector Hello, Don. On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote: > This sort of looks like the hung task detector.. > > I am a little concerned because we just made a big effort to properly > separate the hardlockup and softlockup paths and yet retain the flexibility > to enable/disable them separately. Now it seems the workqueue detector is > permanently entwined with the softlockup detector. I am not entirely sure > that is correct thing to do. The only area they get entwined is how it's controlled from userland. While it isn't quite the same as softlockup detection, I think what it monitors is close enough that it makes sense to put them under the same interface. > It also seems awkward for the lockup code to have to jump to the workqueue > code to function properly. :-/ Though we have made exceptions for the virt > stuff and the workqueue code is simple.. Softlockup code doesn't depend on workqueue in any way. Workqueue tags on touch_softlockup to detect cases which shouldn't be warned and its enabledness is controlled together with softlockup and that's it. > Actually, I am curious, it seems if you just added a > /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for > modifying the watchdog code to begin with. As you really aren't using any > of it other than piggybacking on the touch_softlockup_watchdog stuff, which > could probably be easily added without all the extra enable/disable changes > in watchdog.c. Yeah, except for touch signal, it's purely interface thing. I don't feel too strong about this but it seems a bit silly to introduce a whole different set of interface for this. e.g. if the user wanted to disable softlockup detection, it'd be weird to leave wq lockup detection running. The same goes for threshold. > Again, this looks like what the hung task detector is doing, which I > struggled with years ago to integrate with the lockup code because in the > end I had trouble re-using much of it. So, it's a stall detector and there are inherent similarities but the conditions tested are pretty different and it's a lot lighter. I'm not really sure what you're meaning to say. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] workqueue: implement lockup detector
Tejun, I share Don's concern about connecting the soft lockup detector and the workqueue watchdog to the same kernel parameter in /proc. I would feel more comfortable if the workqueue watchdog had its dedicated parameter. I also see a scenario that the proposed patch does not handle well: The watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not necessary to disable and re-enable the watchdog. The flow of execution looks like this. proc_watchdog_thresh proc_watchdog_update if (watchdog_enabled && watchdog_thresh) watchdog_enable_all_cpus if (!watchdog_running) { ... } else { // // update 'on the fly' // update_watchdog_all_cpus() } The patched watchdog_enable_all_cpus() function disables the workqueue watchdog unconditionally at [1]. However, the workqueue watchdog remains disabled if the code path [2] is executed (and wq_watchdog_thresh is not updated as well). static int watchdog_enable_all_cpus(void) { int err = 0; [1] --> disable_workqueue_watchdog(); if (!watchdog_running) { ... } else { .- /* | * Enable/disable the lockup detectors or | * change the sample period 'on the fly'. | */ [2] <err = update_watchdog_all_cpus(); | | if (err) { | watchdog_disable_all_cpus(); | pr_err("Failed to update lockup detectors, disabled\n"); '- } } if (err) watchdog_enabled = 0; return err; } And another question that comes to my mind is: Would the workqueue watchdog participate in the lockup detector suspend/resume mechanism, and if yes, how would it be integrated into this ? Regards, Uli - Original Message - From: "Tejun Heo" <t...@kernel.org> To: "Don Zickus" <dzic...@redhat.com> Cc: "Ulrich Obergfell" <uober...@redhat.com>, "Ingo Molnar" <mi...@redhat.com>, "Peter Zijlstra" <pet...@infradead.org>, "Andrew Morton" <a...@linux-foundation.org>, linux-kernel@vger.kernel.org, kernel-t...@fb.com Sent: Thursday, December 3, 2015 8:43:58 PM Subject: Re: [PATCH 2/2] workqueue: implement lockup detector Hello, Don. On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote: > This sort of looks like the hung task detector.. > > I am a little concerned because we just made a big effort to properly > separate the hardlockup and softlockup paths and yet retain the flexibility > to enable/disable them separately. Now it seems the workqueue detector is > permanently entwined with the softlockup detector. I am not entirely sure > that is correct thing to do. The only area they get entwined is how it's controlled from userland. While it isn't quite the same as softlockup detection, I think what it monitors is close enough that it makes sense to put them under the same interface. > It also seems awkward for the lockup code to have to jump to the workqueue > code to function properly. :-/ Though we have made exceptions for the virt > stuff and the workqueue code is simple.. Softlockup code doesn't depend on workqueue in any way. Workqueue tags on touch_softlockup to detect cases which shouldn't be warned and its enabledness is controlled together with softlockup and that's it. > Actually, I am curious, it seems if you just added a > /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for > modifying the watchdog code to begin with. As you really aren't using any > of it other than piggybacking on the touch_softlockup_watchdog stuff, which > could probably be easily added without all the extra enable/disable changes > in watchdog.c. Yeah, except for touch signal, it's purely interface thing. I don't feel too strong about this but it seems a bit silly to introduce a whole different set of interface for this. e.g. if the user wanted to disable softlockup detection, it'd be weird to leave wq lockup detection running. The same goes for threshold. > Again, this looks like what the hung task detector is doing, which I > struggled with years ago to integrate with the lockup code because in the > end I had trouble re-using much of it. So, it's a stall detector and there are inherent similarities but the conditions tested are pretty different and it's a lot lighter. I'm not really sure what you're meaning to say. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] watchdog: avoid race between lockup detector suspend/resume and CPU hotplug
The lockup detector suspend/resume interface that was introduced by commit 8c073d27d7ad293bf734cc8475689413afadab81 does not protect itself against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while the lockup detector is suspended, and the thread could thus interfere unexpectedly with the code that requested to suspend the lockup detector. Avoid the race by calling get_online_cpus() in lockup_detector_suspend() put_online_cpus() in lockup_detector_resume() Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 0a23125..7357842 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -719,6 +719,7 @@ int lockup_detector_suspend(void) { int ret = 0; + get_online_cpus(); mutex_lock(_proc_mutex); /* * Multiple suspend requests can be active in parallel (counted by @@ -759,6 +760,7 @@ void lockup_detector_resume(void) watchdog_unpark_threads(); mutex_unlock(_proc_mutex); + put_online_cpus(); } static int update_watchdog_all_cpus(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] watchdog: address various races (CPU hotplug, timer expiry)
This patch set addresses various races in relation to CPU hotplug and a race in relation to watchdog timer expiry. I discovered the corner cases during code inspection. I haven't seen any of these issues occur in practice. Ulrich Obergfell (4): watchdog: avoid race between lockup detector suspend/resume and CPU hotplug watchdog: avoid races between /proc handlers and CPU hotplug watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads() watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn() kernel/watchdog.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn()
Theoretically it is possible that the watchdog timer expires right at the time when a user sets 'watchdog_thresh' to zero (note: this disables the lockup detectors). In this scenario, the is_softlockup() function - which is called by the timer - could produce a false positive. Fix this by checking the current value of 'watchdog_thresh'. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 84c4744..18f34cf 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -289,7 +289,7 @@ static int is_softlockup(unsigned long touch_ts) { unsigned long now = get_timestamp(); - if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) { + if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){ /* Warn about unreasonable delays. */ if (time_after(now, touch_ts + get_softlockup_thresh())) return now - touch_ts; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] watchdog: avoid races between /proc handlers and CPU hotplug
The handler functions for watchdog parameters in /proc/sys/kernel do not protect themselves against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while a parameter is being modified, and the thread could thus use a parameter value that is 'in transition'. For example, if 'watchdog_thresh' is being set to zero (note: this disables the lockup detectors) the thread would erroneously use the value zero as the sample period. To avoid such races and to keep the /proc handler code consistent, call {get|put}_online_cpus() in proc_watchdog_common() {get|put}_online_cpus() in proc_watchdog_thresh() {get|put}_online_cpus() in proc_watchdog_cpumask() Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 7357842..13fdda1 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -857,6 +857,7 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, int err, old, new; int *watchdog_param = (int *)table->data; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -908,6 +909,7 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } @@ -949,6 +951,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, { int err, old; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -974,6 +977,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } @@ -988,6 +992,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int write, { int err; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -1015,6 +1020,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()
watchdog_{park|unpark}_threads() are now called in code paths that protect themselves against CPU hotplug, so {get|put}_online_cpus() calls are redundant and can be removed. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 13fdda1..84c4744 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -683,33 +683,35 @@ static struct smp_hotplug_thread watchdog_threads = { * be parked and the watchdog threads of other CPUs can still be runnable. * Callers are expected to handle this special condition as appropriate in * their context. + * + * This function may only be called in a context that is protected against + * races with CPU hotplug - for example, via get_online_cpus(). */ static int watchdog_park_threads(void) { int cpu, ret = 0; - get_online_cpus(); for_each_watchdog_cpu(cpu) { ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); if (ret) break; } - put_online_cpus(); return ret; } /* * unpark all watchdog threads that are specified in 'watchdog_cpumask' + * + * This function may only be called in a context that is protected against + * races with CPU hotplug - for example, via get_online_cpus(). */ static void watchdog_unpark_threads(void) { int cpu; - get_online_cpus(); for_each_watchdog_cpu(cpu) kthread_unpark(per_cpu(softlockup_watchdog, cpu)); - put_online_cpus(); } /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] watchdog: avoid races between /proc handlers and CPU hotplug
The handler functions for watchdog parameters in /proc/sys/kernel do not protect themselves against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while a parameter is being modified, and the thread could thus use a parameter value that is 'in transition'. For example, if 'watchdog_thresh' is being set to zero (note: this disables the lockup detectors) the thread would erroneously use the value zero as the sample period. To avoid such races and to keep the /proc handler code consistent, call {get|put}_online_cpus() in proc_watchdog_common() {get|put}_online_cpus() in proc_watchdog_thresh() {get|put}_online_cpus() in proc_watchdog_cpumask() Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 7357842..13fdda1 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -857,6 +857,7 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, int err, old, new; int *watchdog_param = (int *)table->data; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -908,6 +909,7 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } @@ -949,6 +951,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, { int err, old; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -974,6 +977,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } @@ -988,6 +992,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int write, { int err; + get_online_cpus(); mutex_lock(_proc_mutex); if (watchdog_suspended) { @@ -1015,6 +1020,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int write, } out: mutex_unlock(_proc_mutex); + put_online_cpus(); return err; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()
watchdog_{park|unpark}_threads() are now called in code paths that protect themselves against CPU hotplug, so {get|put}_online_cpus() calls are redundant and can be removed. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 13fdda1..84c4744 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -683,33 +683,35 @@ static struct smp_hotplug_thread watchdog_threads = { * be parked and the watchdog threads of other CPUs can still be runnable. * Callers are expected to handle this special condition as appropriate in * their context. + * + * This function may only be called in a context that is protected against + * races with CPU hotplug - for example, via get_online_cpus(). */ static int watchdog_park_threads(void) { int cpu, ret = 0; - get_online_cpus(); for_each_watchdog_cpu(cpu) { ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); if (ret) break; } - put_online_cpus(); return ret; } /* * unpark all watchdog threads that are specified in 'watchdog_cpumask' + * + * This function may only be called in a context that is protected against + * races with CPU hotplug - for example, via get_online_cpus(). */ static void watchdog_unpark_threads(void) { int cpu; - get_online_cpus(); for_each_watchdog_cpu(cpu) kthread_unpark(per_cpu(softlockup_watchdog, cpu)); - put_online_cpus(); } /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn()
Theoretically it is possible that the watchdog timer expires right at the time when a user sets 'watchdog_thresh' to zero (note: this disables the lockup detectors). In this scenario, the is_softlockup() function - which is called by the timer - could produce a false positive. Fix this by checking the current value of 'watchdog_thresh'. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 84c4744..18f34cf 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -289,7 +289,7 @@ static int is_softlockup(unsigned long touch_ts) { unsigned long now = get_timestamp(); - if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) { + if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){ /* Warn about unreasonable delays. */ if (time_after(now, touch_ts + get_softlockup_thresh())) return now - touch_ts; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] watchdog: address various races (CPU hotplug, timer expiry)
This patch set addresses various races in relation to CPU hotplug and a race in relation to watchdog timer expiry. I discovered the corner cases during code inspection. I haven't seen any of these issues occur in practice. Ulrich Obergfell (4): watchdog: avoid race between lockup detector suspend/resume and CPU hotplug watchdog: avoid races between /proc handlers and CPU hotplug watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads() watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn() kernel/watchdog.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] watchdog: avoid race between lockup detector suspend/resume and CPU hotplug
The lockup detector suspend/resume interface that was introduced by commit 8c073d27d7ad293bf734cc8475689413afadab81 does not protect itself against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while the lockup detector is suspended, and the thread could thus interfere unexpectedly with the code that requested to suspend the lockup detector. Avoid the race by calling get_online_cpus() in lockup_detector_suspend() put_online_cpus() in lockup_detector_resume() Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 0a23125..7357842 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -719,6 +719,7 @@ int lockup_detector_suspend(void) { int ret = 0; + get_online_cpus(); mutex_lock(_proc_mutex); /* * Multiple suspend requests can be active in parallel (counted by @@ -759,6 +760,7 @@ void lockup_detector_resume(void) watchdog_unpark_threads(); mutex_unlock(_proc_mutex); + put_online_cpus(); } static int update_watchdog_all_cpus(void) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:irq/core] irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support
Commit-ID: 26c21dd9885a2d8a4f4d539917c4877ffd399286 Gitweb: http://git.kernel.org/tip/26c21dd9885a2d8a4f4d539917c4877ffd399286 Author: Ulrich Hecht AuthorDate: Wed, 30 Sep 2015 12:03:07 +0200 Committer: Thomas Gleixner CommitDate: Wed, 30 Sep 2015 21:02:22 +0200 irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support Works the same as on r8a7779. Signed-off-by: Ulrich Hecht Signed-off-by: Geert Uytterhoeven Cc: Jason Cooper Cc: Marc Zyngier Link: http://lkml.kernel.org/r/1443607387-19147-1-git-send-email-geert+brene...@glider.be Signed-off-by: Thomas Gleixner --- drivers/irqchip/irq-renesas-intc-irqpin.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-renesas-intc-irqpin.c b/drivers/irqchip/irq-renesas-intc-irqpin.c index 9525335..c325806 100644 --- a/drivers/irqchip/irq-renesas-intc-irqpin.c +++ b/drivers/irqchip/irq-renesas-intc-irqpin.c @@ -361,14 +361,16 @@ static const struct irq_domain_ops intc_irqpin_irq_domain_ops = { .xlate = irq_domain_xlate_twocell, }; -static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a7779 = { +static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a777x = { .irlm_bit = 23, /* ICR0.IRLM0 */ }; static const struct of_device_id intc_irqpin_dt_ids[] = { { .compatible = "renesas,intc-irqpin", }, + { .compatible = "renesas,intc-irqpin-r8a7778", + .data = _irqpin_irlm_r8a777x }, { .compatible = "renesas,intc-irqpin-r8a7779", - .data = _irqpin_irlm_r8a7779 }, + .data = _irqpin_irlm_r8a777x }, {}, }; MODULE_DEVICE_TABLE(of, intc_irqpin_dt_ids); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
Andrew, > ... what inspired this patchset? > Are you experiencing kthread_park() failures in practice? I did not experience kthread_park() failures in practice. Looking at watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 I realized that there is a theoretical corner case which would not be handled well. Let's assume that kthread_park() would return an error in the following flow of execution (the user changes watchdog_thresh). proc_watchdog_thresh set_sample_period() // // The watchdog_thresh and sample_period variable are now set to // the new value. // proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Let's say the system has eight CPUs and that kthread_park() failed to park watchdog/4. In this example watchdog/0 .. watchdog/3 are already parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know exactly what happened to watchdog/4). watchdog_park_threads() unparks the threads if kthread_park() of one thread fails. for_each_watchdog_cpu(cpu) { ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); if (ret) break; } if (ret) { for_each_watchdog_cpu(cpu) kthread_unpark(per_cpu(softlockup_watchdog, cpu)); } watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value when they are unparked (please see the watchdog_enable() function), whereas watchdog/5 .. watchdog/7 will continue to use the old value for the hard lockup detector and begin using the new value for the soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7 in the unparked state, so it skips these threads). The inconsistency which results from using different watchdog_thresh values can cause unexpected behaviour of the lockup detectors (e.g. false positives). The new error handling that is introduced by this patch set aims to handle the above corner case in a better way (this was my original motivation to come up with a patch set). However, I also think that _if_ kthread_park() would ever be changed in the future so that it could return errors under various (other) conditions, the patch set should prepare the watchdog code for this possibility. Since I did not experience kthread_park() failures in practice, I used some instrumentation to fake error returns from kthread_park() in order to test the patches. Regards, Uli - Original Message - From: "Andrew Morton" To: "Ulrich Obergfell" Cc: linux-kernel@vger.kernel.org, dzic...@redhat.com, atom...@redhat.com Sent: Wednesday, September 30, 2015 1:30:36 AM Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park() On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell wrote: > The original watchdog_park_threads() function that was introduced by > commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple > approach to handle errors returned by kthread_park(): It attempts to > roll back all watchdog threads to the unparked state. However, this > may be undesired behaviour from the perspective of the caller which > may want to handle errors as appropriate in its specific context. > Currently, there are two possible call chains: > > - watchdog suspend/resume interface > > lockup_detector_suspend > watchdog_park_threads > > - write to parameters in /proc/sys/kernel > > proc_watchdog_update > watchdog_enable_all_cpus > update_watchdog_all_cpus > watchdog_park_threads > > Instead of 'blindly' attempting to unpark the watchdog threads if a > kthread_park() call fails, the new approach is to disable the lockup > detectors in the above call chains. Failure becomes visible to the > user as follows: > > - error messages from lockup_detector_suspend() >or watchdog_enable_all_cpus() > > - the state that can be read from /proc/sys/kernel/watchdog_enabled > > - the 'write' system call in the latter call chain returns an error > hm, you made me look at kthread parking. Why does it exist? What is a "parked" thread anyway, and how does it differ from, say, a sleeping one? The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the patch added no useful documentation, sigh. Anwyay... what inspired this patchset? Are you experiencing kthread_park() failures in practice? If so, what is causing them? And what is the user-visible effect of these failures? This is all pretty important context for such a patchset. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:irq/core] irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support
Commit-ID: 26c21dd9885a2d8a4f4d539917c4877ffd399286 Gitweb: http://git.kernel.org/tip/26c21dd9885a2d8a4f4d539917c4877ffd399286 Author: Ulrich Hecht <ulrich.hecht+rene...@gmail.com> AuthorDate: Wed, 30 Sep 2015 12:03:07 +0200 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Wed, 30 Sep 2015 21:02:22 +0200 irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support Works the same as on r8a7779. Signed-off-by: Ulrich Hecht <ulrich.hecht+rene...@gmail.com> Signed-off-by: Geert Uytterhoeven <geert+rene...@glider.be> Cc: Jason Cooper <ja...@lakedaemon.net> Cc: Marc Zyngier <marc.zyng...@arm.com> Link: http://lkml.kernel.org/r/1443607387-19147-1-git-send-email-geert+brene...@glider.be Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- drivers/irqchip/irq-renesas-intc-irqpin.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-renesas-intc-irqpin.c b/drivers/irqchip/irq-renesas-intc-irqpin.c index 9525335..c325806 100644 --- a/drivers/irqchip/irq-renesas-intc-irqpin.c +++ b/drivers/irqchip/irq-renesas-intc-irqpin.c @@ -361,14 +361,16 @@ static const struct irq_domain_ops intc_irqpin_irq_domain_ops = { .xlate = irq_domain_xlate_twocell, }; -static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a7779 = { +static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a777x = { .irlm_bit = 23, /* ICR0.IRLM0 */ }; static const struct of_device_id intc_irqpin_dt_ids[] = { { .compatible = "renesas,intc-irqpin", }, + { .compatible = "renesas,intc-irqpin-r8a7778", + .data = _irqpin_irlm_r8a777x }, { .compatible = "renesas,intc-irqpin-r8a7779", - .data = _irqpin_irlm_r8a7779 }, + .data = _irqpin_irlm_r8a777x }, {}, }; MODULE_DEVICE_TABLE(of, intc_irqpin_dt_ids); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
Andrew, > ... what inspired this patchset? > Are you experiencing kthread_park() failures in practice? I did not experience kthread_park() failures in practice. Looking at watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 I realized that there is a theoretical corner case which would not be handled well. Let's assume that kthread_park() would return an error in the following flow of execution (the user changes watchdog_thresh). proc_watchdog_thresh set_sample_period() // // The watchdog_thresh and sample_period variable are now set to // the new value. // proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Let's say the system has eight CPUs and that kthread_park() failed to park watchdog/4. In this example watchdog/0 .. watchdog/3 are already parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know exactly what happened to watchdog/4). watchdog_park_threads() unparks the threads if kthread_park() of one thread fails. for_each_watchdog_cpu(cpu) { ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); if (ret) break; } if (ret) { for_each_watchdog_cpu(cpu) kthread_unpark(per_cpu(softlockup_watchdog, cpu)); } watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value when they are unparked (please see the watchdog_enable() function), whereas watchdog/5 .. watchdog/7 will continue to use the old value for the hard lockup detector and begin using the new value for the soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7 in the unparked state, so it skips these threads). The inconsistency which results from using different watchdog_thresh values can cause unexpected behaviour of the lockup detectors (e.g. false positives). The new error handling that is introduced by this patch set aims to handle the above corner case in a better way (this was my original motivation to come up with a patch set). However, I also think that _if_ kthread_park() would ever be changed in the future so that it could return errors under various (other) conditions, the patch set should prepare the watchdog code for this possibility. Since I did not experience kthread_park() failures in practice, I used some instrumentation to fake error returns from kthread_park() in order to test the patches. Regards, Uli - Original Message - From: "Andrew Morton" <a...@linux-foundation.org> To: "Ulrich Obergfell" <uober...@redhat.com> Cc: linux-kernel@vger.kernel.org, dzic...@redhat.com, atom...@redhat.com Sent: Wednesday, September 30, 2015 1:30:36 AM Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park() On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uober...@redhat.com> wrote: > The original watchdog_park_threads() function that was introduced by > commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple > approach to handle errors returned by kthread_park(): It attempts to > roll back all watchdog threads to the unparked state. However, this > may be undesired behaviour from the perspective of the caller which > may want to handle errors as appropriate in its specific context. > Currently, there are two possible call chains: > > - watchdog suspend/resume interface > > lockup_detector_suspend > watchdog_park_threads > > - write to parameters in /proc/sys/kernel > > proc_watchdog_update > watchdog_enable_all_cpus > update_watchdog_all_cpus > watchdog_park_threads > > Instead of 'blindly' attempting to unpark the watchdog threads if a > kthread_park() call fails, the new approach is to disable the lockup > detectors in the above call chains. Failure becomes visible to the > user as follows: > > - error messages from lockup_detector_suspend() >or watchdog_enable_all_cpus() > > - the state that can be read from /proc/sys/kernel/watchdog_enabled > > - the 'write' system call in the latter call chain returns an error > hm, you made me look at kthread parking. Why does it exist? What is a "parked" thread anyway, and how does it differ from, say, a sleeping one? The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the patch added no useful documentation, sigh. Anwyay... what inspired this patchset? Are you experiencing kthread_park() failures in practice? If so, what is causing them? And what is the user-visible effect of these failures? This is all pretty important context for such a patchset. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef
It makes sense to place watchdog_{dis|enable}_all_cpus() outside of the ifdef so that _both_ are available even if CONFIG_SYSCTL is not defined. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index cd9a504..eb9527c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void) static int watchdog_nmi_enable(unsigned int cpu); static void watchdog_nmi_disable(unsigned int cpu); +static int watchdog_enable_all_cpus(void); +static void watchdog_disable_all_cpus(void); + /* watchdog kicker functions */ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) { @@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void) return err; } -/* prepare/enable/disable routines */ -/* sysctl functions */ -#ifdef CONFIG_SYSCTL static void watchdog_disable_all_cpus(void) { if (watchdog_running) { @@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void) } } +#ifdef CONFIG_SYSCTL + /* * Update the run state of the lockup detectors. */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers
update_watchdog_all_cpus() now passes errors from watchdog_park_threads() up to functions in the call chain. This allows watchdog_enable_all_cpus() and proc_watchdog_update() to handle such errors too. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 30 +++--- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index eb9527c..457113c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -731,10 +731,17 @@ void lockup_detector_resume(void) mutex_unlock(_proc_mutex); } -static void update_watchdog_all_cpus(void) +static int update_watchdog_all_cpus(void) { - watchdog_park_threads(); + int ret; + + ret = watchdog_park_threads(); + if (ret) + return ret; + watchdog_unpark_threads(); + + return 0; } static int watchdog_enable_all_cpus(void) @@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void) * Enable/disable the lockup detectors or * change the sample period 'on the fly'. */ - update_watchdog_all_cpus(); + err = update_watchdog_all_cpus(); + + if (err) { + watchdog_disable_all_cpus(); + pr_err("Failed to update lockup detectors, disabled\n"); + } } + if (err) + watchdog_enabled = 0; + return err; } @@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, } while (cmpxchg(_enabled, old, new) != old); /* -* Update the run state of the lockup detectors. -* Restore 'watchdog_enabled' on failure. +* Update the run state of the lockup detectors. There is _no_ +* need to check the value returned by proc_watchdog_update() +* and to restore the previous value of 'watchdog_enabled' as +* both lockup detectors are disabled if proc_watchdog_update() +* returns an error. */ err = proc_watchdog_update(); - if (err) - watchdog_enabled = old; } out: mutex_unlock(_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()
Restore the previous value of watchdog_thresh _and_ sample_period if proc_watchdog_update() returns an error. The variables must be consistent to avoid false positives of the lockup detectors. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 64ed1c3..cd9a504 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, goto out; /* -* Update the sample period. -* Restore 'watchdog_thresh' on failure. +* Update the sample period. Restore on failure. */ set_sample_period(); err = proc_watchdog_update(); - if (err) + if (err) { watchdog_thresh = old; + set_sample_period(); + } out: mutex_unlock(_proc_mutex); return err; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error
If kthread_park() returns an error, watchdog_park_threads() should not blindly 'roll back' the already parked threads to the unparked state. Instead leave it up to the callers to handle such errors appropriately in their context. For example, it is redundant to unpark the threads if the lockup detectors will soon be disabled by the callers anyway. Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 3bc22a9..af70bf2 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = { /* * park all watchdog threads that are specified in 'watchdog_cpumask' + * + * This function returns an error if kthread_park() of a watchdog thread + * fails. In this situation, the watchdog threads of some CPUs can already + * be parked and the watchdog threads of other CPUs can still be runnable. + * Callers are expected to handle this special condition as appropriate in + * their context. */ static int watchdog_park_threads(void) { @@ -665,10 +671,6 @@ static int watchdog_park_threads(void) if (ret) break; } - if (ret) { - for_each_watchdog_cpu(cpu) - kthread_unpark(per_cpu(softlockup_watchdog, cpu)); - } put_online_cpus(); return ret; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()
lockup_detector_suspend() now handles errors from watchdog_park_threads(). Signed-off-by: Ulrich Obergfell --- kernel/watchdog.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 457113c..3bc22a9 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -707,6 +707,11 @@ int lockup_detector_suspend(void) if (ret == 0) watchdog_suspended++; + else { + watchdog_disable_all_cpus(); + pr_err("Failed to suspend lockup detectors, disabled\n"); + watchdog_enabled = 0; + } mutex_unlock(_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] improve handling of errors returned by kthread_park()
The original watchdog_park_threads() function that was introduced by commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple approach to handle errors returned by kthread_park(): It attempts to roll back all watchdog threads to the unparked state. However, this may be undesired behaviour from the perspective of the caller which may want to handle errors as appropriate in its specific context. Currently, there are two possible call chains: - watchdog suspend/resume interface lockup_detector_suspend watchdog_park_threads - write to parameters in /proc/sys/kernel proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Instead of 'blindly' attempting to unpark the watchdog threads if a kthread_park() call fails, the new approach is to disable the lockup detectors in the above call chains. Failure becomes visible to the user as follows: - error messages from lockup_detector_suspend() or watchdog_enable_all_cpus() - the state that can be read from /proc/sys/kernel/watchdog_enabled - the 'write' system call in the latter call chain returns an error Ulrich Obergfell (5): watchdog: fix error handling in proc_watchdog_thresh() watchdog: move watchdog_disable_all_cpus() outside of ifdef watchdog: implement error handling in update_watchdog_all_cpus() and callers watchdog: implement error handling in lockup_detector_suspend() watchdog: do not unpark threads in watchdog_park_threads() on error kernel/watchdog.c | 60 +++ 1 file changed, 43 insertions(+), 17 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers
update_watchdog_all_cpus() now passes errors from watchdog_park_threads() up to functions in the call chain. This allows watchdog_enable_all_cpus() and proc_watchdog_update() to handle such errors too. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 30 +++--- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index eb9527c..457113c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -731,10 +731,17 @@ void lockup_detector_resume(void) mutex_unlock(_proc_mutex); } -static void update_watchdog_all_cpus(void) +static int update_watchdog_all_cpus(void) { - watchdog_park_threads(); + int ret; + + ret = watchdog_park_threads(); + if (ret) + return ret; + watchdog_unpark_threads(); + + return 0; } static int watchdog_enable_all_cpus(void) @@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void) * Enable/disable the lockup detectors or * change the sample period 'on the fly'. */ - update_watchdog_all_cpus(); + err = update_watchdog_all_cpus(); + + if (err) { + watchdog_disable_all_cpus(); + pr_err("Failed to update lockup detectors, disabled\n"); + } } + if (err) + watchdog_enabled = 0; + return err; } @@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, } while (cmpxchg(_enabled, old, new) != old); /* -* Update the run state of the lockup detectors. -* Restore 'watchdog_enabled' on failure. +* Update the run state of the lockup detectors. There is _no_ +* need to check the value returned by proc_watchdog_update() +* and to restore the previous value of 'watchdog_enabled' as +* both lockup detectors are disabled if proc_watchdog_update() +* returns an error. */ err = proc_watchdog_update(); - if (err) - watchdog_enabled = old; } out: mutex_unlock(_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()
Restore the previous value of watchdog_thresh _and_ sample_period if proc_watchdog_update() returns an error. The variables must be consistent to avoid false positives of the lockup detectors. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 64ed1c3..cd9a504 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, goto out; /* -* Update the sample period. -* Restore 'watchdog_thresh' on failure. +* Update the sample period. Restore on failure. */ set_sample_period(); err = proc_watchdog_update(); - if (err) + if (err) { watchdog_thresh = old; + set_sample_period(); + } out: mutex_unlock(_proc_mutex); return err; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error
If kthread_park() returns an error, watchdog_park_threads() should not blindly 'roll back' the already parked threads to the unparked state. Instead leave it up to the callers to handle such errors appropriately in their context. For example, it is redundant to unpark the threads if the lockup detectors will soon be disabled by the callers anyway. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 3bc22a9..af70bf2 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = { /* * park all watchdog threads that are specified in 'watchdog_cpumask' + * + * This function returns an error if kthread_park() of a watchdog thread + * fails. In this situation, the watchdog threads of some CPUs can already + * be parked and the watchdog threads of other CPUs can still be runnable. + * Callers are expected to handle this special condition as appropriate in + * their context. */ static int watchdog_park_threads(void) { @@ -665,10 +671,6 @@ static int watchdog_park_threads(void) if (ret) break; } - if (ret) { - for_each_watchdog_cpu(cpu) - kthread_unpark(per_cpu(softlockup_watchdog, cpu)); - } put_online_cpus(); return ret; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()
lockup_detector_suspend() now handles errors from watchdog_park_threads(). Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 457113c..3bc22a9 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -707,6 +707,11 @@ int lockup_detector_suspend(void) if (ret == 0) watchdog_suspended++; + else { + watchdog_disable_all_cpus(); + pr_err("Failed to suspend lockup detectors, disabled\n"); + watchdog_enabled = 0; + } mutex_unlock(_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] improve handling of errors returned by kthread_park()
The original watchdog_park_threads() function that was introduced by commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple approach to handle errors returned by kthread_park(): It attempts to roll back all watchdog threads to the unparked state. However, this may be undesired behaviour from the perspective of the caller which may want to handle errors as appropriate in its specific context. Currently, there are two possible call chains: - watchdog suspend/resume interface lockup_detector_suspend watchdog_park_threads - write to parameters in /proc/sys/kernel proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Instead of 'blindly' attempting to unpark the watchdog threads if a kthread_park() call fails, the new approach is to disable the lockup detectors in the above call chains. Failure becomes visible to the user as follows: - error messages from lockup_detector_suspend() or watchdog_enable_all_cpus() - the state that can be read from /proc/sys/kernel/watchdog_enabled - the 'write' system call in the latter call chain returns an error Ulrich Obergfell (5): watchdog: fix error handling in proc_watchdog_thresh() watchdog: move watchdog_disable_all_cpus() outside of ifdef watchdog: implement error handling in update_watchdog_all_cpus() and callers watchdog: implement error handling in lockup_detector_suspend() watchdog: do not unpark threads in watchdog_park_threads() on error kernel/watchdog.c | 60 +++ 1 file changed, 43 insertions(+), 17 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef
It makes sense to place watchdog_{dis|enable}_all_cpus() outside of the ifdef so that _both_ are available even if CONFIG_SYSCTL is not defined. Signed-off-by: Ulrich Obergfell <uober...@redhat.com> --- kernel/watchdog.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index cd9a504..eb9527c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void) static int watchdog_nmi_enable(unsigned int cpu); static void watchdog_nmi_disable(unsigned int cpu); +static int watchdog_enable_all_cpus(void); +static void watchdog_disable_all_cpus(void); + /* watchdog kicker functions */ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) { @@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void) return err; } -/* prepare/enable/disable routines */ -/* sysctl functions */ -#ifdef CONFIG_SYSCTL static void watchdog_disable_all_cpus(void) { if (watchdog_running) { @@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void) } } +#ifdef CONFIG_SYSCTL + /* * Update the run state of the lockup detectors. */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] watchdog: use pr_debug() in fixup_ht_bug() failure path
Signed-off-by: Ulrich Obergfell --- arch/x86/kernel/cpu/perf_event_intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 0357bf7..abb25c3 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -3369,7 +3369,7 @@ static __init int fixup_ht_bug(void) } if (lockup_detector_suspend() != 0) { - pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n"); + pr_debug("failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n"); return 0; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] watchdog: avoid races in watchdog_nmi_{en|disable} functions - follow up patches
This small series follows up to patch discussion related to $subject patch series which is now in linux-next at: - introduce watchdog_park_threads() and watchdog_unpark_threads() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=89f536017155c7f76c8a620e6aed03809e53cd4c - introduce watchdog_suspend() and watchdog_resume() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=e0dd9ee3ecf3394e93343cf9542d8ca1b54eca08 - use park/unpark functions in update_watchdog_all_cpus() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=eef1a759acfb98766fef20afca9e06f014299e3f - use suspend/resume interface in fixup_ht_bug() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=77592a82363dd508ce3dbe4eaaa2c6c45b966ebb Changes based on patch discussion: - Patch 1/2: rename watchdog_{suspend|resume}, add comment blocks http://marc.info/?l=linux-kernel=143844050610220=2 http://marc.info/?l=linux-kernel=143876224114821=2 - Patch 2/2: use pr_debug() instead of pr_info() http://marc.info/?l=linux-kernel=143869949229461=2 Ulrich Obergfell (2): watchdog: rename watchdog_suspend() and watchdog_resume() watchdog: use pr_debug() in fixup_ht_bug() failure path arch/x86/kernel/cpu/perf_event_intel.c | 6 +++--- include/linux/nmi.h| 8 kernel/watchdog.c | 26 ++ 3 files changed, 29 insertions(+), 11 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] watchdog: rename watchdog_suspend() and watchdog_resume()
Rename watchdog_suspend() to lockup_detector_suspend() and watchdog_resume() to lockup_detector_resume() to avoid confusion with the watchdog subsystem and to be consistent with the existing name lockup_detector_init(). Also provide comment blocks to explain the watchdog_running and watchdog_suspended variables and their relationship. Signed-off-by: Ulrich Obergfell --- arch/x86/kernel/cpu/perf_event_intel.c | 4 ++-- include/linux/nmi.h| 8 kernel/watchdog.c | 26 ++ 3 files changed, 28 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index d4e1b0c..0357bf7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -3368,7 +3368,7 @@ static __init int fixup_ht_bug(void) return 0; } - if (watchdog_suspend() != 0) { + if (lockup_detector_suspend() != 0) { pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n"); return 0; } @@ -3379,7 +3379,7 @@ static __init int fixup_ht_bug(void) x86_pmu.commit_scheduling = NULL; x86_pmu.stop_scheduling = NULL; - watchdog_resume(); + lockup_detector_resume(); get_online_cpus(); diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 46e28e9e..78488e0 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -84,15 +84,15 @@ extern int proc_watchdog_thresh(struct ctl_table *, int , void __user *, size_t *, loff_t *); extern int proc_watchdog_cpumask(struct ctl_table *, int, void __user *, size_t *, loff_t *); -extern int watchdog_suspend(void); -extern void watchdog_resume(void); +extern int lockup_detector_suspend(void); +extern void lockup_detector_resume(void); #else -static inline int watchdog_suspend(void) +static inline int lockup_detector_suspend(void) { return 0; } -static inline void watchdog_resume(void) +static inline void lockup_detector_resume(void) { } #endif diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 69666f4..64ed1c3 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -67,8 +67,26 @@ unsigned long *watchdog_cpumask_bits = cpumask_bits(_cpumask); #define for_each_watchdog_cpu(cpu) \ for_each_cpu_and((cpu), cpu_online_mask, _cpumask) -static int __read_mostly watchdog_suspended; +/* + * The 'watchdog_running' variable is set to 1 when the watchdog threads + * are registered/started and is set to 0 when the watchdog threads are + * unregistered/stopped, so it is an indicator whether the threads exist. + */ static int __read_mostly watchdog_running; +/* + * If a subsystem has a need to deactivate the watchdog temporarily, it + * can use the suspend/resume interface to achieve this. The content of + * the 'watchdog_suspended' variable reflects this state. Existing threads + * are parked/unparked by the lockup_detector_{suspend|resume} functions + * (see comment blocks pertaining to those functions for further details). + * + * 'watchdog_suspended' also prevents threads from being registered/started + * or unregistered/stopped via parameters in /proc/sys/kernel, so the state + * of 'watchdog_running' cannot change while the watchdog is deactivated + * temporarily (see related code in 'proc' handlers). + */ +static int __read_mostly watchdog_suspended; + static u64 __read_mostly sample_period; static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); @@ -669,7 +687,7 @@ static void watchdog_unpark_threads(void) /* * Suspend the hard and soft lockup detector by parking the watchdog threads. */ -int watchdog_suspend(void) +int lockup_detector_suspend(void) { int ret = 0; @@ -679,7 +697,7 @@ int watchdog_suspend(void) * the 'watchdog_suspended' variable). If the watchdog threads are * running, the first caller takes care that they will be parked. * The state of 'watchdog_running' cannot change while a suspend -* request is active (see related changes in 'proc' handlers). +* request is active (see related code in 'proc' handlers). */ if (watchdog_running && !watchdog_suspended) ret = watchdog_park_threads(); @@ -695,7 +713,7 @@ int watchdog_suspend(void) /* * Resume the hard and soft lockup detector by unparking the watchdog threads. */ -void watchdog_resume(void) +void lockup_detector_resume(void) { mutex_lock(_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] watchdog: avoid races in watchdog_nmi_{en|disable} functions - follow up patches
This small series follows up to patch discussion related to $subject patch series which is now in linux-next at: - introduce watchdog_park_threads() and watchdog_unpark_threads() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=89f536017155c7f76c8a620e6aed03809e53cd4c - introduce watchdog_suspend() and watchdog_resume() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=e0dd9ee3ecf3394e93343cf9542d8ca1b54eca08 - use park/unpark functions in update_watchdog_all_cpus() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=eef1a759acfb98766fef20afca9e06f014299e3f - use suspend/resume interface in fixup_ht_bug() https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=77592a82363dd508ce3dbe4eaaa2c6c45b966ebb Changes based on patch discussion: - Patch 1/2: rename watchdog_{suspend|resume}, add comment blocks http://marc.info/?l=linux-kernelm=143844050610220w=2 http://marc.info/?l=linux-kernelm=143876224114821w=2 - Patch 2/2: use pr_debug() instead of pr_info() http://marc.info/?l=linux-kernelm=143869949229461w=2 Ulrich Obergfell (2): watchdog: rename watchdog_suspend() and watchdog_resume() watchdog: use pr_debug() in fixup_ht_bug() failure path arch/x86/kernel/cpu/perf_event_intel.c | 6 +++--- include/linux/nmi.h| 8 kernel/watchdog.c | 26 ++ 3 files changed, 29 insertions(+), 11 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] watchdog: rename watchdog_suspend() and watchdog_resume()
Rename watchdog_suspend() to lockup_detector_suspend() and watchdog_resume() to lockup_detector_resume() to avoid confusion with the watchdog subsystem and to be consistent with the existing name lockup_detector_init(). Also provide comment blocks to explain the watchdog_running and watchdog_suspended variables and their relationship. Signed-off-by: Ulrich Obergfell uober...@redhat.com --- arch/x86/kernel/cpu/perf_event_intel.c | 4 ++-- include/linux/nmi.h| 8 kernel/watchdog.c | 26 ++ 3 files changed, 28 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index d4e1b0c..0357bf7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -3368,7 +3368,7 @@ static __init int fixup_ht_bug(void) return 0; } - if (watchdog_suspend() != 0) { + if (lockup_detector_suspend() != 0) { pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n); return 0; } @@ -3379,7 +3379,7 @@ static __init int fixup_ht_bug(void) x86_pmu.commit_scheduling = NULL; x86_pmu.stop_scheduling = NULL; - watchdog_resume(); + lockup_detector_resume(); get_online_cpus(); diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 46e28e9e..78488e0 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -84,15 +84,15 @@ extern int proc_watchdog_thresh(struct ctl_table *, int , void __user *, size_t *, loff_t *); extern int proc_watchdog_cpumask(struct ctl_table *, int, void __user *, size_t *, loff_t *); -extern int watchdog_suspend(void); -extern void watchdog_resume(void); +extern int lockup_detector_suspend(void); +extern void lockup_detector_resume(void); #else -static inline int watchdog_suspend(void) +static inline int lockup_detector_suspend(void) { return 0; } -static inline void watchdog_resume(void) +static inline void lockup_detector_resume(void) { } #endif diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 69666f4..64ed1c3 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -67,8 +67,26 @@ unsigned long *watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask); #define for_each_watchdog_cpu(cpu) \ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask) -static int __read_mostly watchdog_suspended; +/* + * The 'watchdog_running' variable is set to 1 when the watchdog threads + * are registered/started and is set to 0 when the watchdog threads are + * unregistered/stopped, so it is an indicator whether the threads exist. + */ static int __read_mostly watchdog_running; +/* + * If a subsystem has a need to deactivate the watchdog temporarily, it + * can use the suspend/resume interface to achieve this. The content of + * the 'watchdog_suspended' variable reflects this state. Existing threads + * are parked/unparked by the lockup_detector_{suspend|resume} functions + * (see comment blocks pertaining to those functions for further details). + * + * 'watchdog_suspended' also prevents threads from being registered/started + * or unregistered/stopped via parameters in /proc/sys/kernel, so the state + * of 'watchdog_running' cannot change while the watchdog is deactivated + * temporarily (see related code in 'proc' handlers). + */ +static int __read_mostly watchdog_suspended; + static u64 __read_mostly sample_period; static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); @@ -669,7 +687,7 @@ static void watchdog_unpark_threads(void) /* * Suspend the hard and soft lockup detector by parking the watchdog threads. */ -int watchdog_suspend(void) +int lockup_detector_suspend(void) { int ret = 0; @@ -679,7 +697,7 @@ int watchdog_suspend(void) * the 'watchdog_suspended' variable). If the watchdog threads are * running, the first caller takes care that they will be parked. * The state of 'watchdog_running' cannot change while a suspend -* request is active (see related changes in 'proc' handlers). +* request is active (see related code in 'proc' handlers). */ if (watchdog_running !watchdog_suspended) ret = watchdog_park_threads(); @@ -695,7 +713,7 @@ int watchdog_suspend(void) /* * Resume the hard and soft lockup detector by unparking the watchdog threads. */ -void watchdog_resume(void) +void lockup_detector_resume(void) { mutex_lock(watchdog_proc_mutex); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] watchdog: use pr_debug() in fixup_ht_bug() failure path
Signed-off-by: Ulrich Obergfell uober...@redhat.com --- arch/x86/kernel/cpu/perf_event_intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 0357bf7..abb25c3 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -3369,7 +3369,7 @@ static __init int fixup_ht_bug(void) } if (lockup_detector_suspend() != 0) { - pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n); + pr_debug(failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n); return 0; } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] watchdog: introduce watchdog_suspend() and watchdog_resume()
> - Original Message - > From: "Andrew Morton" > ... > On Sat, 1 Aug 2015 14:49:23 +0200 Ulrich Obergfell > wrote: > >> This interface can be utilized to deactivate the hard and soft lockup >> detector temporarily. Callers are expected to minimize the duration of >> deactivation. Multiple deactivations are allowed to occur in parallel >> but should be rare in practice. >> >> ... >> >> --- a/include/linux/nmi.h >> +++ b/include/linux/nmi.h >> @@ -80,6 +80,8 @@ extern int proc_watchdog_thresh(struct ctl_table *, int , >> void __user *, size_t *, loff_t *); >> extern int proc_watchdog_cpumask(struct ctl_table *, int, >> void __user *, size_t *, loff_t *); >> +extern int watchdog_suspend(void); >> +extern void watchdog_resume(void); >> #endif >> >> #ifdef CONFIG_HAVE_ACPI_APEI_NMI >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c >> index 5571f20..98d44b1 100644 >> --- a/kernel/watchdog.c >> +++ b/kernel/watchdog.c >> @@ -67,6 +67,7 @@ unsigned long *watchdog_cpumask_bits = >> cpumask_bits(_cpumask); >> #define for_each_watchdog_cpu(cpu) \ >> for_each_cpu_and((cpu), cpu_online_mask, _cpumask) >> >> +static int __read_mostly watchdog_suspended = 0; > > With my compiler the "= 0" increases the size of watchdog.o data. For > some reason by 16 bytes(!). I see that you already fixed this. Many Thanks. >> static int __read_mostly watchdog_running; >> static u64 __read_mostly sample_period; > > The relationship between watchdog_running and watchdog_suspended hurts > my brain a bit. It appears that all watchdog_running transitions > happen under watchdog_proc_mutex so I don't think it's racy, but I > wonder if things would be simpler if these were folded into a single > up/down counter. The 'watchdog_running' variable indicates whether watchdog threads exist. [Whether they have been launched via smpboot_register_percpu_thread().] The 'watchdog_suspended' variable indicates whether existing threads are currently parked, and whether multiple requests to suspend the watchdog have been made by different callers. I think folding them into one variable would not improve the readability of the code, because instead of two variables with distinct semantics we would then have one variable with multiple semantics (which I think would also make code more complex). If we would want to improve the readability I'd prefer to either just add a comment block to explain the semantics of the variables or to rename them -for example- like: watchdog_running to watchdog_threads_exist watchdog_suspended to watchdog_suspend_count Please let me know if you would want any changes in this regard. I also received these suggestions: - rename the watchdog_{suspend|resume} functions as discussed in http://marc.info/?l=linux-kernel=143844050610220=2 - use pr_debug() instead of pr_info() as discussed in http://marc.info/?l=linux-kernel=143869949229461=2 Please let me know if you want me to post a 'version 2' of the patch set or if you want me to post these changes as separate follow-up patches. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] watchdog: introduce watchdog_suspend() and watchdog_resume()
- Original Message - From: Andrew Morton a...@linux-foundation.org ... On Sat, 1 Aug 2015 14:49:23 +0200 Ulrich Obergfell uober...@redhat.com wrote: This interface can be utilized to deactivate the hard and soft lockup detector temporarily. Callers are expected to minimize the duration of deactivation. Multiple deactivations are allowed to occur in parallel but should be rare in practice. ... --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -80,6 +80,8 @@ extern int proc_watchdog_thresh(struct ctl_table *, int , void __user *, size_t *, loff_t *); extern int proc_watchdog_cpumask(struct ctl_table *, int, void __user *, size_t *, loff_t *); +extern int watchdog_suspend(void); +extern void watchdog_resume(void); #endif #ifdef CONFIG_HAVE_ACPI_APEI_NMI diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 5571f20..98d44b1 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -67,6 +67,7 @@ unsigned long *watchdog_cpumask_bits = cpumask_bits(watchdog_cpumask); #define for_each_watchdog_cpu(cpu) \ for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask) +static int __read_mostly watchdog_suspended = 0; With my compiler the = 0 increases the size of watchdog.o data. For some reason by 16 bytes(!). I see that you already fixed this. Many Thanks. static int __read_mostly watchdog_running; static u64 __read_mostly sample_period; The relationship between watchdog_running and watchdog_suspended hurts my brain a bit. It appears that all watchdog_running transitions happen under watchdog_proc_mutex so I don't think it's racy, but I wonder if things would be simpler if these were folded into a single up/down counter. The 'watchdog_running' variable indicates whether watchdog threads exist. [Whether they have been launched via smpboot_register_percpu_thread().] The 'watchdog_suspended' variable indicates whether existing threads are currently parked, and whether multiple requests to suspend the watchdog have been made by different callers. I think folding them into one variable would not improve the readability of the code, because instead of two variables with distinct semantics we would then have one variable with multiple semantics (which I think would also make code more complex). If we would want to improve the readability I'd prefer to either just add a comment block to explain the semantics of the variables or to rename them -for example- like: watchdog_running to watchdog_threads_exist watchdog_suspended to watchdog_suspend_count Please let me know if you would want any changes in this regard. I also received these suggestions: - rename the watchdog_{suspend|resume} functions as discussed in http://marc.info/?l=linux-kernelm=143844050610220w=2 - use pr_debug() instead of pr_info() as discussed in http://marc.info/?l=linux-kernelm=143869949229461w=2 Please let me know if you want me to post a 'version 2' of the patch set or if you want me to post these changes as separate follow-up patches. Regards, Uli -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads()
> - Original Message - > From: "Michal Hocko" ... > On Sat 01-08-15 14:49:22, Ulrich Obergfell wrote: >> These functions are intended to be used only from inside kernel/watchdog.c >> to park/unpark all watchdog threads that are specified in watchdog_cpumask. > > I would suggest merging this into Patch2. It is usually better to add > new functions along with their users. Michal, watchdog_{park|unpark}_threads are called by watchdog_{suspend|resume} in Patch 2/4 and by update_watchdog_all_cpus() in Patch 3/4, so I would have to merge three patches into one, and I would end up with one large patch and one smaller patch. I think the result would be harder to read, review and understand. Thus I'd prefer to leave the patches split as they currently are. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] watchdog: use suspend/resume interface in fixup_ht_bug()
> - Original Message - > From: "Don Zickus" ... > On Tue, Aug 04, 2015 at 03:31:30PM +0200, Michal Hocko wrote: >> On Sat 01-08-15 14:49:25, Ulrich Obergfell wrote: >> [...] >> > @@ -3368,7 +3368,10 @@ static __init int fixup_ht_bug(void) >> >return 0; >> >} >> > >> > - watchdog_nmi_disable_all(); >> > + if (watchdog_suspend() != 0) { >> > + pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 >> > workaround\n"); >> > + return 0; >> > + } >> >> Is this really worth reporting to the log? What is an admin supposed to >> do about it? > > I think it was more for developers to aid in debugging a strange behaviour > of the performance counters. > >> >> Ok, so kthread_park fails only when the kernel thread has already >> exited. Can this ever happen during this call path? > > It might be overkill, but it is just a harmless informational failure > message. Don, Michal, the module prints a message if the workaround is enabled and if the workaround is disabled. Hence, I think we should keep the messages consistent and thus inform the user also if we fail to disable the workaround. Even though at the moment this seems to be an unlikely failure case, I agree with Don that the message could be useful in debugging. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] watchdog: use suspend/resume interface in fixup_ht_bug()
- Original Message - From: Don Zickus dzic...@redhat.com ... On Tue, Aug 04, 2015 at 03:31:30PM +0200, Michal Hocko wrote: On Sat 01-08-15 14:49:25, Ulrich Obergfell wrote: [...] @@ -3368,7 +3368,10 @@ static __init int fixup_ht_bug(void) return 0; } - watchdog_nmi_disable_all(); + if (watchdog_suspend() != 0) { + pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 workaround\n); + return 0; + } Is this really worth reporting to the log? What is an admin supposed to do about it? I think it was more for developers to aid in debugging a strange behaviour of the performance counters. looking into the code Ok, so kthread_park fails only when the kernel thread has already exited. Can this ever happen during this call path? It might be overkill, but it is just a harmless informational failure message. Don, Michal, the module prints a message if the workaround is enabled and if the workaround is disabled. Hence, I think we should keep the messages consistent and thus inform the user also if we fail to disable the workaround. Even though at the moment this seems to be an unlikely failure case, I agree with Don that the message could be useful in debugging. Regards, Uli -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/