from:"Ulrich"

[PATCH] serial: sh-sci: remove obsolete latency workaround

2021-04-15 Thread Ulrich Hecht

Since the transition to hrtimers there is no more need to set a minimum
RX timeout to work around latency issues.

Signed-off-by: Ulrich Hecht 
---
 drivers/tty/serial/sh-sci.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index e3af97a59856..ef37fdf37612 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -2609,21 +2609,10 @@ static void sci_set_termios(struct uart_port *port, 
struct ktermios *termios,
udelay(DIV_ROUND_UP(10 * 100, baud));
}
 
-   /*
-* Calculate delay for 2 DMA buffers (4 FIFO).
-* See serial_core.c::uart_update_timeout().
-* With 10 bits (CS8), 250Hz, 115200 baud and 64 bytes FIFO, the above
-* function calculates 1 jiffie for the data plus 5 jiffies for the
-* "slop(e)." Then below we calculate 5 jiffies (20ms) for 2 DMA
-* buffers (4 FIFO sizes), but when performing a faster transfer, the
-* value obtained by this formula is too small. Therefore, if the value
-* is smaller than 20ms, use 20ms as the timeout value for DMA.
-*/
+   /* Calculate delay for 2 DMA buffers (4 FIFO). */
s->rx_frame = (1 * bits) / (baud / 100);
 #ifdef CONFIG_SERIAL_SH_SCI_DMA
s->rx_timeout = s->buf_len_rx * 2 * s->rx_frame;
-   if (s->rx_timeout < 20)
-   s->rx_timeout = 20;
 #endif
 
if ((termios->c_cflag & CREAD) != 0)
-- 
2.20.1

Antw: [EXT] [systemd-devel] [SPECIFICATION RFC] The firmware and bootloader log specification

2020-11-15 Thread Ulrich Windl

>>> Daniel Kiper  schrieb am 14.11.2020 um 00:52 in
Nachricht <20201113235242.k6fzlwmwm2xqh...@tomti.i.net-space.pl>:
...
> The members of struct bf_log_msg:
>   ‑ size: total size of bf_log_msg struct,
>   ‑ ts_nsec: timestamp expressed in nanoseconds starting from 0,

Who or what defines t == 0?
...

Regards,
Ulrich Windl

FYI: PoC: Running 100000 processes in 5.3.18 (SLES15 SP2)

2020-10-02 Thread Ulrich Windl

Hi!

Just in case someone is interested: As a Proof-of-Concept I started 100 
thousand processes on a big machine (72 cores). It worked!
However starting those too more than 30 minutes, and top needs more than 30 
minutes to refresh ist display. Still, interactive input via SSH works nice, 
but any file-system access seems quite slow (my test processes just use CPU; 
the do no t do any I/O).

Kernel messages while the processes were created:
kernel: [65648.247688] perf: interrupt took too long (2516 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79250
kernel: [65997.263218] perf: interrupt took too long (3146 > 3145), lowering 
kernel.perf_event_max_sample_rate to 63500
kernel: [66790.221057] perf: interrupt took too long (3938 > 3932), lowering 
kernel.perf_event_max_sample_rate to 50750
kernel: [69884.371426] perf: interrupt took too long (4925 > 4922), lowering 
kernel.perf_event_max_sample_rate to 40500

Last top output (more than 30 late):
top - 11:16:19 up 19:19,  3 users,  load average: 64164.56, 62997.24, 55597.09
Tasks: 101432 total, 60249 running, 41183 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.0 us,  2.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 772127.6+total, 755924.2+free, 14253.01+used, 1950.363 buff/cache
MiB Swap: 773120.0+total, 772958.1+free,  161.816 used. 754248.8+avail Mem
...

That's a load, isn't it? ;-)

 # cat /proc/uptime
72084.21 9356423.41
# cat /proc/loadavg
64188.31 64188.81 63636.08 64228/102328 134935

Regards,
Ulrich

Re: [PATCH v3] pinctrl: renesas: r8a7790: Add VIN1-B and VIN2-G pins, groups and functions

2020-09-19 Thread Ulrich Hecht

Hi!

Thanks for the patch. Verified it against data sheet, the magic numbers check 
out.

Reviewed-by: Ulrich Hecht 

CU
Uli

> On 09/17/2020 9:59 PM Lad Prabhakar  
> wrote:
> 
>  
> Add pins, groups and functions for the VIN1-B [data/sync/field/clkenb/clk]
> and VIN2-G8.
> 
> Signed-off-by: Lad Prabhakar 
> Reviewed-by: Biju Das 
> ---
> v2->v3:
> * Included vin1_data4_b, field_b  and clkenb_b
> * Renamed vin2_data8g to vin2_g8
> * Rebased patch on latest changes
> 
> v1->v2:
> * Added complete list of VIN1-B pins
> * Renamed vin2_data8_g to vin2_data8g
> * Sorted vin1_sync_b pins
> 
> v1 - https://patchwork.kernel.org/patch/11761191/
> ---
>  drivers/pinctrl/renesas/pfc-r8a7790.c | 132 +-
>  1 file changed, 131 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pinctrl/renesas/pfc-r8a7790.c 
> b/drivers/pinctrl/renesas/pfc-r8a7790.c
> index 60f973c5dffe..3f48d3d879f7 100644
> --- a/drivers/pinctrl/renesas/pfc-r8a7790.c
> +++ b/drivers/pinctrl/renesas/pfc-r8a7790.c
> @@ -3866,6 +3866,72 @@ static const unsigned int vin1_data18_mux[] = {
>   VI1_R4_MARK, VI1_R5_MARK,
>   VI1_R6_MARK, VI1_R7_MARK,
>  };
> +static const union vin_data vin1_data_b_pins = {
> + .data24 = {
> + /* B */
> + RCAR_GP_PIN(3, 0), RCAR_GP_PIN(3, 1),
> + RCAR_GP_PIN(3, 2), RCAR_GP_PIN(3, 3),
> + RCAR_GP_PIN(3, 4), RCAR_GP_PIN(3, 5),
> + RCAR_GP_PIN(3, 6), RCAR_GP_PIN(3, 7),
> + /* G */
> + RCAR_GP_PIN(1, 14), RCAR_GP_PIN(1, 15),
> + RCAR_GP_PIN(1, 17), RCAR_GP_PIN(1, 20),
> + RCAR_GP_PIN(1, 22), RCAR_GP_PIN(1, 12),
> + RCAR_GP_PIN(1, 9), RCAR_GP_PIN(1, 7),
> + /* R */
> + RCAR_GP_PIN(0, 27), RCAR_GP_PIN(0, 28),
> + RCAR_GP_PIN(0, 29), RCAR_GP_PIN(1, 4),
> + RCAR_GP_PIN(1, 5), RCAR_GP_PIN(1, 6),
> + RCAR_GP_PIN(1, 10), RCAR_GP_PIN(1, 8),
> + },
> +};
> +static const union vin_data vin1_data_b_mux = {
> + .data24 = {
> + /* B */
> + VI1_DATA0_VI1_B0_B_MARK, VI1_DATA1_VI1_B1_B_MARK,
> + VI1_DATA2_VI1_B2_B_MARK, VI1_DATA3_VI1_B3_B_MARK,
> + VI1_DATA4_VI1_B4_B_MARK, VI1_DATA5_VI1_B5_B_MARK,
> + VI1_DATA6_VI1_B6_B_MARK, VI1_DATA7_VI1_B7_B_MARK,
> + /* G */
> + VI1_G0_B_MARK, VI1_G1_B_MARK,
> + VI1_G2_B_MARK, VI1_G3_B_MARK,
> + VI1_G4_B_MARK, VI1_G5_B_MARK,
> + VI1_G6_B_MARK, VI1_G7_B_MARK,
> + /* R */
> + VI1_R0_B_MARK, VI1_R1_B_MARK,
> + VI1_R2_B_MARK, VI1_R3_B_MARK,
> + VI1_R4_B_MARK, VI1_R5_B_MARK,
> + VI1_R6_B_MARK, VI1_R7_B_MARK,
> + },
> +};
> +static const unsigned int vin1_data18_b_pins[] = {
> + /* B */
> + RCAR_GP_PIN(3, 2), RCAR_GP_PIN(3, 3),
> + RCAR_GP_PIN(3, 4), RCAR_GP_PIN(3, 5),
> + RCAR_GP_PIN(3, 6), RCAR_GP_PIN(3, 7),
> + /* G */
> + RCAR_GP_PIN(1, 17), RCAR_GP_PIN(1, 20),
> + RCAR_GP_PIN(1, 22), RCAR_GP_PIN(1, 12),
> + RCAR_GP_PIN(1, 9), RCAR_GP_PIN(1, 7),
> + /* R */
> + RCAR_GP_PIN(0, 29), RCAR_GP_PIN(1, 4),
> + RCAR_GP_PIN(1, 5), RCAR_GP_PIN(1, 6),
> + RCAR_GP_PIN(1, 10), RCAR_GP_PIN(1, 8),
> +};
> +static const unsigned int vin1_data18_b_mux[] = {
> + /* B */
> + VI1_DATA2_VI1_B2_B_MARK, VI1_DATA3_VI1_B3_B_MARK,
> + VI1_DATA4_VI1_B4_B_MARK, VI1_DATA5_VI1_B5_B_MARK,
> + VI1_DATA6_VI1_B6_B_MARK, VI1_DATA7_VI1_B7_B_MARK,
> + /* G */
> + VI1_G2_B_MARK, VI1_G3_B_MARK,
> + VI1_G4_B_MARK, VI1_G5_B_MARK,
> + VI1_G6_B_MARK, VI1_G7_B_MARK,
> + /* R */
> + VI1_R2_B_MARK, VI1_R3_B_MARK,
> + VI1_R4_B_MARK, VI1_R5_B_MARK,
> + VI1_R6_B_MARK, VI1_R7_B_MARK,
> +};
>  static const unsigned int vin1_sync_pins[] = {
>   RCAR_GP_PIN(1, 24), /* HSYNC */
>   RCAR_GP_PIN(1, 25), /* VSYNC */
> @@ -3874,24 +3940,50 @@ static const unsigned int vin1_sync_mux[] = {
>   VI1_HSYNC_N_MARK,
>   VI1_VSYNC_N_MARK,
>  };
> +static const unsigned int vin1_sync_b_pins[] = {
> + RCAR_GP_PIN(1, 24), /* HSYNC */
> + RCAR_GP_PIN(1, 25), /* VSYNC */
> +};
> +static const unsigned int vin1_sync_b_mux[] = {
> + VI1_HSYNC_N_B_MARK,
> + VI1_VSYNC_N_B_MARK,
> +};
>  static const unsigned int vin1_field_pins[] = {
>   RCAR_GP_PIN(1, 13),
>  };
>  static const unsigned int vin1_field_mux[] = {
>   VI1_FIELD_MARK,
>  };
> +static const unsigned int vin1_field_b_pins[] = {
> + RCAR_GP_PIN(1, 13),
> +};
> +static const unsigned int vin1_fie

heads up: clock_gettime(CLOCK_MONOTONIC, ...) is not monotonic with Xen PVM

2020-08-11 Thread Ulrich Windl

.tv_nsec);
printf("%s: largest delta is 0.%09ld\n", __func__,
  deltas[CLOCK_SAMPLES - 2].tv_nsec);
if ( deltas[u].tv_nsec > tsp->tv_nsec )
tsp->tv_nsec = deltas[0].tv_nsec;
leave:  return(result);
}

/* main */
int main(int argc, char *argv[])
{
int result  = 0;
ts_tres;

get_res();
printf("resolution is 0.%09ld\n", res.tv_nsec);
return(result);
}
- it is intentional that the program aborts on error
Output from a newer machine:
get_res: resolution is 0.1
get_res: smallest delta is 0.00030
get_res: largest delta is 0.00050
resolution is 0.00030

Regards,
Ulrich Windl
(Keep me on CC if I should read your replies)

Q: Getting consistent CPU status rates from /proc/stat

2020-07-02 Thread Ulrich Windl

in/perl
# Check consistency of CPU states in /proc/stat
# written for SLES12 SP5 and PERL 5.18 by Ulrich Windl

use strict;
use warnings;

use Fcntl qw(:flock SEEK_SET);

my ($rdiff, $rcount, $reading, @readings) = (0, 0, 0, [], []);
my @all_cpu;
if (open(my $fh, '<', my $file = '/proc/stat')) {
do {
my $rref = $readings[$reading];

seek($fh, 0, SEEK_SET) or die "seek() failed SEEK_SET to 0: $!\n";
while (<$fh>) {
my @vals;
if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(@vals, $^N) }))+$/) {
$cpu = length($cpu) > 0 ? $cpu + 1 : 0;
$rref->[$cpu] = \@vals;
}
}
if (++$rcount > 1) {
my ($old, $new) = ($readings[($reading + 1) % 2],
   $readings[$reading]);
$rdiff = 0;
for (my $i = $#$rref; $i >= 0; --$i) {
my ($ocr, $ncr) = ($old->[$i], $new->[$i]);

for (my $j = $#$ncr; $j >= 0; --$j) {
if ((my $delta = $ncr->[$j] - $ocr->[$j]) != 0) {
print "reading $rcount: cpu #$i stat #$j has delta " .
"$delta\n";
$rdiff += abs($delta);
}
}
}
}
$reading = ($reading + 1) % 2;
} while ($rcount < 2 || $rdiff > 0);
@all_cpu = @{$readings[$reading]};  # consistent reading
close($fh);
if ((my $n = scalar(@all_cpu)) >= 2) {
my ($sref, $cref);

$sref = $all_cpu[0];
$cref = $all_cpu[$n] = [ map { 0 } (0 .. $#$sref) ];
for (my $i = 1; $i < $n; ++$i) {
my $iref = $all_cpu[$i];

for (my $j = $#$cref; $j >= 0; --$j) {
$cref->[$j] += $iref->[$j];
}
}
for (my $j = $#$sref; $j >= 0; --$j) {
if ((my $delta = $sref->[$j] - $cref->[$j]) != 0) {
print "mismatch for cpu stat #$j: $sref->[$j] - $cref->[$j] " .
    "is $delta\n";
}
}
} else {
print "No CPU?\n";
}
} else {
die "$file: $!\n";
}
--EOF--

Regards,
Ulrich Windl

Re: [BOOTLOADER SPECIFICATION RFC] The bootloader log format for TrenchBoot and others

2020-06-01 Thread Hans Ulrich Niedermann

On Fri, 29 May 2020 13:27:35 +0200
Daniel Kiper  wrote:

> Below you can find my rough idea of the bootloader log format which is
> generic thing but initially will be used for TrenchBoot work. I
> discussed this proposal with Ross and Daniel S. So, the idea went
> through initial sanitization. Now I would like to take feedback from
> other folks too. So, please take a look and complain...
> 
> In general we want to pass the messages produced by the bootloader to
> the OS kernel and finally to the userspace for further processing and
> analysis. Below is the description of the structures which will be
> used for this thing.

This should mention that this about having one contiguous block of
memory which contains a struct bootloader_log.

>   struct bootloader_log_msgs

Why the plural with the trailing letter s in the struct name? This looks
like a single message to me, and should thus be signular without a
trailing letter s.

>   {
> grub_uint32_t level;
> grub_uint32_t facility;
> char type[];
> char msg[];
>   }

How would this work? How could a compiler know where msg starts? gcc
just reports "error: flexible array member not at end of struct" here
as there is no way of knowing.

Where are the sizes of type[] and msg[] defined?

Only implicitly by just having two NUL terminated strings right after
each other? That would mean you need to change the struct to

struct bootloader_log_msg
{
  grub_uint32_t level;
  grub_uint32_t facility;
  char strings[];
}

and have anyone parsing this structure walk through all characters in
strings[] looking for NULs to eventually find out where the msg string
actually starts. This looks at least a bit ugly to me, unless you
absolutely need to save the last bit of code and data memory in the
bootloader.

To help find where the msg actually starts without needing to look for
a NUL character, you could add a struct member showing where the msg
string actually starts within strings[]:

struct bootloader_log_msg
{
  grub_uint32_t level;
  grub_uint32_t facility;
  grub_uint32_t msg_start;
  char strings[];
}

This msg_start value could be defined as an character offset into
strings[]. Then accessing msg->strings or >strings[0] would access
the type string, and >strings[msg_ofs] would access the message:

struct bootloader_log_msg *msg = ...;
printf("msg type: %s\n", >strings[0]);
printf("msg:  %s\n", >strings[msg->msg_start]);

For quick parsing of the messages, having a "grub_uint32_t size"
struct member to quickly skip the current struct bootloader_log_msg
completely and jump directly to the next struct would be helpful,
regardless of how the strings are actually stored:

struct bootloader_log_msg
{
  grub_uint32_t level;
  grub_uint32_t facility;
  grub_uint32_t size;
  [ ... string storage ... ]
}

>   struct bootloader_log
>   {
> grub_uint32_t version;
> grub_uint32_t producer;
> grub_uint32_t size;
> grub_uint32_t next_off;
> bootloader_log_msgs msgs[];
>   }
> 
> The members of struct bootloader_log:
>   - version: the bootloader log format version number, 1 for now,
>   - producer: the producer/bootloader type; we can steal some values
> from linux/Documentation/x86/boot.rst:type_of_loader,

>   - size: total size of the log buffer including the bootloader_log
> struct,

"size in bytes"?

>   - next_off: offset in bytes, from start of the bootloader_log
> struct, of the next byte after the last log message in the msgs[];
> i.e. the offset of the next available log message slot,

It appears to me that next_off is only interesting for code which
appends log messages to that structure. For reading such a struct,
next_off is useless and could thus be a private variable inside that
bootloader which is not passed on to a next stage bootloader or an OS
kernel.

So specifying next_off as something other than a "reserved" uint32_t is
for when you have a chain of bootloaders which all append messages to
that buffer, and you want to avoid all the bootloaders having to parse
the messages from the previous bootloader's messages in order to find
out where to append messages. If that is the intention, this procedure
should at least be mentioned somewhere.

I am also missing any mention of memory alignment. With the number of
CPUs in the field which cannot directly read misaligned words
increasing, specifying a 4 byte or 8 byte alignment for these
structures could significantly reduce the code complexity for such
CPUs at the cost of a few bytes of memory.

And while on the topic of memory layout: With all these uint32_t
values, is this only for a 32bit protocol, or will this remain 32bit
even for otherwise 64bit code, or what is the plan here?

>   - msgs: the array of log messages.
> 
> The members of struct bootloader_log_msgs:
>   - level: similar to syslog meaning; can be used to differentiate
> normal messages from debug messages; exact

Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

2020-05-05 Thread Ulrich Weigand

On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote:
> > Just looked at 
> > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397  mm: add 'try_get_page()' 
> > helper function
> > 
> > which says:
> > Also like 'get_page()', you can't use this function unless you already
> > had a reference to the page.  The intent is that you can use this
> > exactly like get_page(), but in situations where you want to limit the
> > maximum reference count.
> > 
> > The code currently does an unconditional WARN_ON_ONCE() if we ever hit
> > the reference count issues (either zero or negative), as a notification
> > that the conditional non-increment actually happened.
> > 
> > If try_get_page must not be called with an existing reference, that means
>   s/not//
> > that when we call it the page reference is already higher and our freeze
> > will never succeed. That would imply that we cannot trigger this. No?

Well, my understanding is that the "existing" reference may be one of the
references that is expected by our freeze code, in particular in gup the
existing reference is simply the one from the pte.  So in this case our
freeze *would* succeeed.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com

Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

2020-05-05 Thread Ulrich Weigand

On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote:
> On 5/4/20 6:41 AM, Ulrich Weigand wrote:
> > You're right that there is no mechanism to prevent new references,
> > but that's really never been the goal either.  We're simply trying
> > to ensure that no I/O is ever done on a page that is in the "secure"
> > (or inaccessible) state.  To do so, we rely on the assumption that
> > all code that starts I/O on a page cache page will *first*:
> > - mark the page as pending I/O by either taking an extra page
> >   count, or by setting the Writeback flag; then:
> > - call arch_make_page_accessible(); then:
> > - start I/O; and only after I/O has finished:
> > - remove the "pending I/O" marker (Writeback and/or extra ref)
> 
> Let's ignore writeback for a moment because get_page() is the more
> general case.  The locking sequence is:
> 
> 1. get_page() (or equivalent) "locks out" a page from converting to
>inaccessbile,
> 2. followed by a make_page_accessible() guarantees that the page
>*stays* accessible until
> 3. I/O is safe in this region
> 4. put_page(), removes the "lock out", I/O now unsafe

Yes, exactly.

> They key is, though, the get_page() must happen before
> make_page_accessible() and *every* place that acquires a new reference
> needs a make_page_accessible().

Well, sort of: every place that acquires a new reference *and then
performs I/O* needs a make_page_accessible().  There seem to be a
lot of plain get_page() calls that aren't related to I/O.

> try_get_page() is obviously one of those "new reference sites" and it
> only has one call site outisde of the gup code: generic_pipe_buf_get(),
> which is effectively patched by the patch that started this thread.  The
> fact that this one oddball site _and_ gup are patched now is a good sign.
> 
> *But*, I still don't know how that could work nicely:
> 
> > static inline __must_check bool try_get_page(struct page *page)
> > {
> > page = compound_head(page);
> > if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> > return false;
> > page_ref_inc(page);
> > return true;
> > }
> 
> If try_get_page() collides with a freeze_page_refs(), it'll hit the
> WARN_ON_ONCE(), which is surely there for a good reason.  I'm not sure
> that warning is _actually_ valid since freeze_page_refs() isn't truly a
> 0 refcount.  But, the fact that this hasn't been encountered means that
> the testing here is potentially lacking.

This is indeed interesting.  In particular if you compare try_get_page
with try_get_compound_head in gup.c, which does instead:

if (WARN_ON_ONCE(page_ref_count(head) < 0))
return NULL;

which seems more reasonable to me, given the presence of the
page_ref_freeze method.  So I'm not sure why try_get_page has <= 0.

I think I understand why we haven't seen this in testing: all the
places in gup.c where try_get_page is called hold the pte lock;
and in the usual case, the pte lock itself already suffices to
lock out make_secure_pte before it even tries to use page_ref_freeze.
(The intent of holding the pte lock there was really to ensure that
the PTE entry is and remains valid throughout the execution of
the ultravisor call, which will look at the PTE entry.)

However, I guess we could construct cases where the pte lock doesn't
suffice to prevent the try_get_page warning: if we create a shared
mapping of the secure guest backing store file into a second process.
That doesn't ever happen in normal qemu operation, so that's likely
why we haven't seen that case.

> > We thought we had identified all places where we needed to place
> > arch_make_page_accessible so that the above assumption is satisfied.
> > You've found at least two instances where this wasn't true (thanks!);
> > but I still think that this can be fixed by just adding those calls.
> 
> Why do you think that's the extent of the problem?  Because the crashes
> stopped?
> 
> I'd feel a lot more comfortable if you explained the audits that you've
> performed or _why_ you think that.  What I've heard thus far is
> basically that you've been able to boot a guest and you're ready to ship
> this code.

Not sure if you can really call this an "audit", but we were really
coming from identifying places where I/O can happen on a page cache
page, and everything we found (except writeback) went through gup.
We obviously missed the sendfile case here; not sure what the best
way would be to verify nothing else was missed.

> But, with regular RCU, you're right, it _does_ appear that it would hit
> that retry loop, but then it would *succeed* in getting a reference.  In
> the end, this just supports the seq

Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

2020-05-04 Thread Ulrich Weigand

On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote:
> The larger point, though, is that the s390 code ensures no extra
> references exist upon entering make_secure_pte(), but it still has no
> mechanism to prevent future, new references to page cache pages from
> being created.

Hi Dave, I worked with Claudio and Christian on the initial design
of our approach, so let me chime in here as well.

You're right that there is no mechanism to prevent new references,
but that's really never been the goal either.  We're simply trying
to ensure that no I/O is ever done on a page that is in the "secure"
(or inaccessible) state.  To do so, we rely on the assumption that
all code that starts I/O on a page cache page will *first*:
- mark the page as pending I/O by either taking an extra page
  count, or by setting the Writeback flag; then:
- call arch_make_page_accessible(); then:
- start I/O; and only after I/O has finished:
- remove the "pending I/O" marker (Writeback and/or extra ref)

We thought we had identified all places where we needed to place
arch_make_page_accessible so that the above assumption is satisfied.
You've found at least two instances where this wasn't true (thanks!);
but I still think that this can be fixed by just adding those calls.

Now, if the above assumption holds, then I believe we're safe:
- before we make any page secure, we verify that it is not
  "pending I/O" as defined above (neither Writeback flag, nor
  and extra page count)
- *during* the process of making the page secure, we're protected
  against any potential races due to changes in that status, since
  we hold the page lock (and therefore the Writeback flag cannot
  change), and we've frozen page references (so those cannot change).

This implies that before I/O has started, the page was made
accessible; and as long as the page is marked "pending I/O"
it will not be made inaccessible again.

> The one existing user of expected_page_refs() freezes the refs then
> *removes* the page from the page cache (that's what the xas_lock_irq()
> is for).  That stops *new* refs from being acquired.
> 
> The s390 code is missing an equivalent mechanism.
> 
> One example:
> 
>   page_freeze_refs();
>   // page->_count==0 now
>   find_get_page();
>   // ^ sees a "freed" page
>   page_unfreeze_refs();
> 
> find_get_page() will either fail to *find* the page because it will see
> page->_refcount==0 think it is freed (not great), or it will
> VM_BUG_ON_PAGE() in __page_cache_add_speculative().

I don't really see how that could happen; my understanding is that
page_freeze_refs simply causes potential users to spin and wait
until it is no longer frozen.  For example, find_get_page will
in the end call down to find_get_entry, which does:

if (!page_cache_get_speculative(page))
goto repeat;

Am I misunderstanding anything here?

> My bigger point is that this patches doesn't systematically stop finding
> page cache pages that are arch-inaccessible.  This patch hits *one* of
> those sites.

As I said above, that wasn't really the goal for our approach.

In particular, note that we *must* have secure pages present in the
page table of the secure guest (that is a requirement of the architecture;
note that the "secure" status doesn't just apply to the phyiscal page,
but a triple of "*this* host physical page is the secure backing store
of *this* guest physical page in *this* secure guest", which the HW/FW
tracks based on the specific page table entry).

As a consequence, the page really also has to remain present in the
page cache (I don't think Linux mm code would be able to handle the
case where a file-backed page is in the page table but not page cache).

I'm not sure what exactly the requirements for your use case are; if those
are significantly differently, maybe we can work together to find an
approach that works for both?

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com

Issue with reading sysfs files

2019-10-17 Thread Ulrich Windl

es/virtual/net/lo/name_assign_type: Invalid argument

/sys/devices/virtual/net/lo/speed

Found in 4.12.14-95.6-default of SLES SP4 (x86_64)

Regards,
Ulrich

Re: [PATCH v2] i2c: i2c-mt65xx: fix NULL ptr dereference

2019-10-02 Thread Ulrich Hecht



> On September 30, 2019 at 5:28 PM Fabien Parent  wrote:
> 
> 
> Since commit abf4923e97c3 ("i2c: mediatek: disable zero-length transfers
> for mt8183"), there is a NULL pointer dereference for all the SoCs
> that don't have any quirk. mtk_i2c_functionality is not checking that
> the quirks pointer is not NULL before starting to use it.
> 
> This commit add a call to i2c_check_quirks which will check whether
> the quirks pointer is set, and if so will check if the IP has the
> NO_ZERO_LEN quirk.
> 
> Fixes: abf4923e97c3 ("i2c: mediatek: disable zero-length transfers for 
> mt8183")
> Signed-off-by: Fabien Parent 

Thank you! Tested successfully on Acer R13 Chromebook (mt8173).

Tested-by: Ulrich Hecht 

CU
Uli

Re: [PATCH] thermal: rcar_gen3_thermal: Use devm_add_action_or_reset() helper

2019-07-31 Thread Ulrich Hecht



> On July 31, 2019 at 2:50 PM Geert Uytterhoeven  
> wrote:
> 
> 
> Use the devm_add_action_or_reset() helper instead of open-coding the
> same operations.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/thermal/rcar_gen3_thermal.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/thermal/rcar_gen3_thermal.c 
> b/drivers/thermal/rcar_gen3_thermal.c
> index a56463308694e937..2db7e7f8baf939fd 100644
> --- a/drivers/thermal/rcar_gen3_thermal.c
> +++ b/drivers/thermal/rcar_gen3_thermal.c
> @@ -443,11 +443,10 @@ static int rcar_gen3_thermal_probe(struct 
> platform_device *pdev)
>   if (ret)
>   goto error_unregister;
>  
> - ret = devm_add_action(dev, rcar_gen3_hwmon_action, zone);
> - if (ret) {
> - rcar_gen3_hwmon_action(zone);
> + ret = devm_add_action_or_reset(dev, rcar_gen3_hwmon_action,
> +zone);
> + if (ret)
>   goto error_unregister;
> - }
>  
>   ret = of_thermal_get_ntrips(tsc->zone);
>   if (ret < 0)
> -- 
> 2.17.1
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH] serial: sh-sci: Use DEVICE_ATTR_RW() for rx_fifo_trigger

2019-07-31 Thread Ulrich Hecht



> On July 31, 2019 at 2:45 PM Geert Uytterhoeven  
> wrote:
> 
> 
> While commit b6b996b6cdeecf7e ("treewide: Use DEVICE_ATTR_RW") converted
> the rx_fifo_timeout attribute, it forgot to convert rx_fifo_trigger due
> to a slightly different function naming.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/tty/serial/sh-sci.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
> index d18c680aa64b3427..57638175639e0f3f 100644
> --- a/drivers/tty/serial/sh-sci.c
> +++ b/drivers/tty/serial/sh-sci.c
> @@ -1092,9 +1092,8 @@ static void rx_fifo_timer_fn(struct timer_list *t)
>   scif_set_rtrg(port, 1);
>  }
>  
> -static ssize_t rx_trigger_show(struct device *dev,
> -struct device_attribute *attr,
> -char *buf)
> +static ssize_t rx_fifo_trigger_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
>  {
>   struct uart_port *port = dev_get_drvdata(dev);
>   struct sci_port *sci = to_sci_port(port);
> @@ -1102,10 +1101,9 @@ static ssize_t rx_trigger_show(struct device *dev,
>   return sprintf(buf, "%d\n", sci->rx_trigger);
>  }
>  
> -static ssize_t rx_trigger_store(struct device *dev,
> - struct device_attribute *attr,
> - const char *buf,
> - size_t count)
> +static ssize_t rx_fifo_trigger_store(struct device *dev,
> +  struct device_attribute *attr,
> +  const char *buf, size_t count)
>  {
>   struct uart_port *port = dev_get_drvdata(dev);
>   struct sci_port *sci = to_sci_port(port);
> @@ -1123,7 +1121,7 @@ static ssize_t rx_trigger_store(struct device *dev,
>   return count;
>  }
>  
> -static DEVICE_ATTR(rx_fifo_trigger, 0644, rx_trigger_show, rx_trigger_store);
> +static DEVICE_ATTR_RW(rx_fifo_trigger);
>  
>  static ssize_t rx_fifo_timeout_show(struct device *dev,
>  struct device_attribute *attr,
> -- 
> 2.17.1
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH v2 17/19] drm: rcar-du: crtc: Enable and disable CMMs

2019-07-16 Thread Ulrich Hecht



> On July 6, 2019 at 4:07 PM Jacopo Mondi  wrote:
> 
> 
> Enable/disable the CMM associated with a CRTC at crtc start and stop
> time and enable the CMM unit through the Display Extensional Functions
> register at group setup time.
> 
> Signed-off-by: Jacopo Mondi 
> ---
>  drivers/gpu/drm/rcar-du/rcar_du_crtc.c  | 7 +++
>  drivers/gpu/drm/rcar-du/rcar_du_group.c | 8 
>  drivers/gpu/drm/rcar-du/rcar_du_regs.h  | 5 +
>  3 files changed, 20 insertions(+)
> 
> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_crtc.c 
> b/drivers/gpu/drm/rcar-du/rcar_du_crtc.c
> index 23f1d6cc1719..3dac605c3a67 100644
> --- a/drivers/gpu/drm/rcar-du/rcar_du_crtc.c
> +++ b/drivers/gpu/drm/rcar-du/rcar_du_crtc.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  
> +#include "rcar_cmm.h"
>  #include "rcar_du_crtc.h"
>  #include "rcar_du_drv.h"
>  #include "rcar_du_encoder.h"
> @@ -619,6 +620,9 @@ static void rcar_du_crtc_stop(struct rcar_du_crtc *rcrtc)
>   if (rcar_du_has(rcrtc->dev, RCAR_DU_FEATURE_VSP1_SOURCE))
>   rcar_du_vsp_disable(rcrtc);
>  
> + if (rcrtc->cmm)
> + rcar_cmm_disable(rcrtc->cmm);
> +
>   /*
>* Select switch sync mode. This stops display operation and configures
>* the HSYNC and VSYNC signals as inputs.
> @@ -686,6 +690,9 @@ static void rcar_du_crtc_atomic_enable(struct drm_crtc 
> *crtc,
>   }
>  
>   rcar_du_crtc_start(rcrtc);
> +
> + if (rcrtc->cmm)
> + rcar_cmm_enable(rcrtc->cmm);
>  }
>  
>  static void rcar_du_crtc_atomic_disable(struct drm_crtc *crtc,
> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_group.c 
> b/drivers/gpu/drm/rcar-du/rcar_du_group.c
> index 9eee47969e77..d252c9bb9809 100644
> --- a/drivers/gpu/drm/rcar-du/rcar_du_group.c
> +++ b/drivers/gpu/drm/rcar-du/rcar_du_group.c
> @@ -147,6 +147,14 @@ static void rcar_du_group_setup(struct rcar_du_group 
> *rgrp)
>  
>   rcar_du_group_setup_pins(rgrp);
>  
> + if (rcar_du_has(rcdu, RCAR_DU_FEATURE_CMM)) {
> + u32 defr7 = DEFR7_CODE |
> + (rgrp->cmms_mask & BIT(1) ? DEFR7_CMME1 : 0) |
> + (rgrp->cmms_mask & BIT(0) ? DEFR7_CMME0 : 0);
> +
> + rcar_du_group_write(rgrp, DEFR7, defr7);
> + }
> +
>   if (rcdu->info->gen >= 2) {
>   rcar_du_group_setup_defr8(rgrp);
>   rcar_du_group_setup_didsr(rgrp);
> diff --git a/drivers/gpu/drm/rcar-du/rcar_du_regs.h 
> b/drivers/gpu/drm/rcar-du/rcar_du_regs.h
> index bc87f080b170..fb9964949368 100644
> --- a/drivers/gpu/drm/rcar-du/rcar_du_regs.h
> +++ b/drivers/gpu/drm/rcar-du/rcar_du_regs.h
> @@ -197,6 +197,11 @@
>  #define DEFR6_MLOS1  (1 << 2)
>  #define DEFR6_DEFAULT(DEFR6_CODE | DEFR6_TCNE1)
>  
> +#define DEFR70x000ec
> +#define DEFR7_CODE   (0x7779 << 16)
> +#define DEFR7_CMME1  BIT(6)
> +#define DEFR7_CMME0  BIT(4)
> +

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH] serial: sh-sci: Fix HSCIF RX sampling point calculation

2019-04-01 Thread Ulrich Hecht



> On April 1, 2019 at 1:25 PM Geert Uytterhoeven  
> wrote:
> 
> 
> There are several issues with the formula used for calculating the
> deviation from the intended rate:
>   1. While min_err and last_stop are signed, srr and baud are unsigned.
>  Hence the signed values are promoted to unsigned, which will lead
>  to a bogus value of deviation if min_err is negative,
>   2. Srr is the register field value, which is one less than the actual
>  sampling rate factor,
>   3. The divisions do not use rounding.
> 
> Fix this by casting unsigned variables to int, adding one to srr, and
> using a single DIV_ROUND_CLOSEST().
> 
> Fixes: 63ba1e00f178a448 ("serial: sh-sci: Support for HSCIF RX sampling point 
> adjustment")
> Signed-off-by: Geert Uytterhoeven 

Reviewed-by: Ulrich Hecht 

CU
Uli

leap seconds and 61st second in minute

2019-02-13 Thread Ulrich Windl

Hi!

I was currently following some discussion on the topic of leap seconds, and due 
to the basic role of time in the kernel, I'd like to send a "heads up" ("food 
for thought") with some proposal (not to start some useless discussion):

The UNIX timescale running in UTC had (I suppose) the idea that no timezone 
switches will affect it. Unfortunately leap-seconds have not been considered; 
maybe also because at that time everybody would be happy if the system's time 
was correct to a few seconds. I don't know.
However leap seconds are a kind of "time zone switch" conceptually.

Unfortunately the unix time scale does not consider them (as noted in the 
time(2) manual page). That's one part of posix. The other part of POSIX claims 
that during an inserted leap second there should be a 61st second in the 
minute. Unfortunately (AFAIK) there's no interface from kernel's leap second 
handling to glibc allowing it to actually return 60 as the number of seconds. 
(localtime(3) only gets a pointer to time_t)

OTOH in the NTPv4 clock model there is a TAI offset included (which can be 
updated by NTP). AFAIK the kernel also has the timezone offset for some time to 
handle RTCs that run local time. Obviously if the kernel knew the number of 
leap seconds (the correction to the time() timescale) conversion from UNIX 
timescale to TAI would be easy.

So roughly if the kernel exports the time and type of the next_leap second 
scheduled, some future localtime could actually return the 61st second in the 
minute. As it is now applications will all see some magic duplication of the 
60th second. (Maybe that' why Google does "leap smear"). If the kernel API 
would also export the TAI offset, one could even offer a TAI-based time, or, 
maybe even better: The kernel could run on TAI internally, and convert to UNIX 
time scale as needed.

I'll leave exact specification and implementation to the really clever guys.

Regards,
Ulrich
P.S. Not subscribed to the kernel-list so if you want to talk to me keep me on 
CC: preferably

Re: Licensing of include/linux/hash.h

2019-02-11 Thread Ulrich Mueller

>>>>> On Mon, 11 Feb 2019, Ulrich Mueller wrote:

>>>>> On Mon, 11 Feb 2019, Domenico Andreoli wrote:
>> On Mon, Feb 11, 2019 at 12:08:32AM +0100, Kristian Fiskerstrand wrote:
>>> It was [pointed out] by one of our license group that [hash.h]  is the
>>> same that has a GPL-2+ in [fio] which has a signed-off-by.
>>> 
>>> References:
>>> [pointed out]
>>> https://bugs.gentoo.org/677586#c1
>>> 
>>> [hash.h]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/fio.git/commit/hash.h?id=bdc7211e190482f0c17c109a0d90834a6611be1c

>> Yes, the Signed-off-by is from Jens Axboe (in CC) but he's not the
>> original author, I guess he just copied the file as Arnaldo did. The
>> file he committed has not any reference to the license.

>>> [fio]
>>> https://metadata.ftp-master.debian.org/changelogs/main/f/fio/fio_3.12-2_copyright

>> I'm afraid that this entry in wrong. I'll seek confirmation with
>> Martin Steigerwald.

> Not sure if this will help, but hash.h originally appeared in
> Linux 2.5.7. In the following commit one can see that most of its code
> was moved or copied from mm/filemap.c:
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=882ad449046cec136c484dd2b3659fb4c683e0a3

> filemap.c has a copyright line by Linus, but git blame shows that
> the relevant code was added in 2002 with a commit by Rik van Riel:
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e5191c50d12621e04f8bc247dcb6a0f4ad76ae26

> The relevant thread in LKML is here:
> https://lkml.org/lkml/2002/2/18/1970

Sorry, that link should read:
https://lkml.org/lkml/2002/2/18/197

Re: Licensing of include/linux/hash.h

2019-02-11 Thread Ulrich Mueller

>>>>> On Mon, 11 Feb 2019, Domenico Andreoli wrote:

> On Mon, Feb 11, 2019 at 12:08:32AM +0100, Kristian Fiskerstrand wrote:
>> It was [pointed out] by one of our license group that [hash.h]  is the
>> same that has a GPL-2+ in [fio] which has a signed-off-by.
>> 
>> References:
>> [pointed out]
>> https://bugs.gentoo.org/677586#c1
>> 
>> [hash.h]
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/fio.git/commit/hash.h?id=bdc7211e190482f0c17c109a0d90834a6611be1c

> Yes, the Signed-off-by is from Jens Axboe (in CC) but he's not the
> original author, I guess he just copied the file as Arnaldo did. The
> file he committed has not any reference to the license.

>> [fio]
>> https://metadata.ftp-master.debian.org/changelogs/main/f/fio/fio_3.12-2_copyright

> I'm afraid that this entry in wrong. I'll seek confirmation with
> Martin Steigerwald.

Not sure if this will help, but hash.h originally appeared in
Linux 2.5.7. In the following commit one can see that most of its code
was moved or copied from mm/filemap.c:
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=882ad449046cec136c484dd2b3659fb4c683e0a3

filemap.c has a copyright line by Linus, but git blame shows that
the relevant code was added in 2002 with a commit by Rik van Riel:
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e5191c50d12621e04f8bc247dcb6a0f4ad76ae26

The relevant thread in LKML is here:
https://lkml.org/lkml/2002/2/18/1970

Ulrich

Re: [PATCH] serial: sh-sci: Fix use-after-free on subsequent port startup

2018-07-17 Thread Ulrich Hecht

On Tue, Jul 17, 2018 at 10:58 AM, Geert Uytterhoeven
 wrote:
> sci_request_irq() checks port->irqstr[j] for a NULL pointer, to decide
> if a fallback interrupt name string should be allocated or not.
>
> While this string is freed during port shutdown, the pointer is not
> zeroed.  Hence on a subsequent startup of the port, it will still be
> pointing to the freed memory, leading to e.g.
>
> WARNING: CPU: 0 PID: 404 at fs/proc/generic.c:388 __proc_create+0xbc/0x260
> name len 0
>
> or to a crash (the latter is more likely with CONFIG_DEBUG_SLAB=y, due
> to the poisoning of freed memory).
>
> Instead of zeroeing the pointer at multiple places, preinitialize
> port->irqstr[j] to zero to fix this.
>
> Fixes: 8b0bbd956228ae87 ("serial: sh-sci: Add support for R7S9210")
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/tty/serial/sh-sci.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
> index 5808edfe3f7be404..f8e53ac5c17dfb94 100644
> --- a/drivers/tty/serial/sh-sci.c
> +++ b/drivers/tty/serial/sh-sci.c
> @@ -1872,6 +1872,7 @@ static int sci_request_irq(struct sci_port *port)
> }
>
> desc = sci_irq_desc + i;
> +   port->irqstr[j] = NULL;
> if (SCIx_TEIDRI_IRQ_EXISTS(port)) {
>     /*
>  * ERI and BRI are muxed, just register ERI and
> --
> 2.17.1
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH] serial: sh-sci: Fix use-after-free on subsequent port startup

2018-07-17 Thread Ulrich Hecht

On Tue, Jul 17, 2018 at 10:58 AM, Geert Uytterhoeven
 wrote:
> sci_request_irq() checks port->irqstr[j] for a NULL pointer, to decide
> if a fallback interrupt name string should be allocated or not.
>
> While this string is freed during port shutdown, the pointer is not
> zeroed.  Hence on a subsequent startup of the port, it will still be
> pointing to the freed memory, leading to e.g.
>
> WARNING: CPU: 0 PID: 404 at fs/proc/generic.c:388 __proc_create+0xbc/0x260
> name len 0
>
> or to a crash (the latter is more likely with CONFIG_DEBUG_SLAB=y, due
> to the poisoning of freed memory).
>
> Instead of zeroeing the pointer at multiple places, preinitialize
> port->irqstr[j] to zero to fix this.
>
> Fixes: 8b0bbd956228ae87 ("serial: sh-sci: Add support for R7S9210")
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/tty/serial/sh-sci.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
> index 5808edfe3f7be404..f8e53ac5c17dfb94 100644
> --- a/drivers/tty/serial/sh-sci.c
> +++ b/drivers/tty/serial/sh-sci.c
> @@ -1872,6 +1872,7 @@ static int sci_request_irq(struct sci_port *port)
> }
>
> desc = sci_irq_desc + i;
> +   port->irqstr[j] = NULL;
> if (SCIx_TEIDRI_IRQ_EXISTS(port)) {
>     /*
>  * ERI and BRI are muxed, just register ERI and
> --
> 2.17.1
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH/RFT 6/6] i2c: recovery: remove bogus check if SDA GPIO is set to output

2018-07-16 Thread Ulrich Hecht

On Fri, Jul 13, 2018 at 11:09 PM, Wolfram Sang
 wrote:
> This check did not work as intended. I2C is open drain, so this function
> will likely always have presented the GPIO as input because
> gpiod_get_direction doesn't know about open drain states. Remove this
> check for now. We can add it again once we know how to get more precise
> information about the GPIO.
>
> Signed-off-by: Wolfram Sang 
> ---
>  drivers/i2c/i2c-core-base.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c
> index 301285c54603..7c5f012f561c 100644
> --- a/drivers/i2c/i2c-core-base.c
> +++ b/drivers/i2c/i2c-core-base.c
> @@ -261,9 +261,7 @@ static void i2c_init_recovery(struct i2c_adapter *adap)
> bri->set_scl = set_scl_gpio_value;
> if (bri->sda_gpiod) {
> bri->get_sda = get_sda_gpio_value;
> -   /* FIXME: add proper flag instead of '0' once 
> available */
> -   if (gpiod_get_direction(bri->sda_gpiod) == 0)
> -   bri->set_sda = set_sda_gpio_value;
> +   bri->set_sda = set_sda_gpio_value;
> }
> return;
> }
> --
> 2.11.0
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH/RFT 6/6] i2c: recovery: remove bogus check if SDA GPIO is set to output

2018-07-16 Thread Ulrich Hecht

On Fri, Jul 13, 2018 at 11:09 PM, Wolfram Sang
 wrote:
> This check did not work as intended. I2C is open drain, so this function
> will likely always have presented the GPIO as input because
> gpiod_get_direction doesn't know about open drain states. Remove this
> check for now. We can add it again once we know how to get more precise
> information about the GPIO.
>
> Signed-off-by: Wolfram Sang 
> ---
>  drivers/i2c/i2c-core-base.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c
> index 301285c54603..7c5f012f561c 100644
> --- a/drivers/i2c/i2c-core-base.c
> +++ b/drivers/i2c/i2c-core-base.c
> @@ -261,9 +261,7 @@ static void i2c_init_recovery(struct i2c_adapter *adap)
> bri->set_scl = set_scl_gpio_value;
> if (bri->sda_gpiod) {
> bri->get_sda = get_sda_gpio_value;
> -   /* FIXME: add proper flag instead of '0' once 
> available */
> -   if (gpiod_get_direction(bri->sda_gpiod) == 0)
> -   bri->set_sda = set_sda_gpio_value;
> +   bri->set_sda = set_sda_gpio_value;
> }
> return;
> }
> --
> 2.11.0
>

Reviewed-by: Ulrich Hecht 

CU
Uli

Re: [PATCH] mux: include compiler.h from mux/consumer.h

2017-08-02 Thread Ulrich Hecht

On Mon, Jul 31, 2017 at 6:28 PM, Peter Rosin <p...@axentia.se> wrote:
> On 2017-07-31 18:00, Greg Kroah-Hartman wrote:
>> On Mon, Jul 31, 2017 at 12:04:35PM +0200, Peter Rosin wrote:
>>> From: Ulrich Hecht <ulrich.hecht+rene...@gmail.com>
>>>
>>> Required for __must_check.
>>>
>>> Signed-off-by: Ulrich Hecht <ulrich.hecht+rene...@gmail.com>
>>> Signed-off-by: Peter Rosin <p...@axentia.se>
>>> ---
>>>  include/linux/mux/consumer.h | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> Hi Greg,
>>>
>>> Please pick this up for 4.14.
>>
>> What commit is this patch fixing?
>
> I don't think it's needed for any existing user, I at least do not have any
> report about that. So, I don't think it's needed for stable.
>
> Ulrich, the way I understood this, it was needed for your serdev series [1],
> and not for any existing usage. Right?

That is correct.

CU
Uli

Re: [PATCH] mux: include compiler.h from mux/consumer.h

2017-08-02 Thread Ulrich Hecht

On Mon, Jul 31, 2017 at 6:28 PM, Peter Rosin  wrote:
> On 2017-07-31 18:00, Greg Kroah-Hartman wrote:
>> On Mon, Jul 31, 2017 at 12:04:35PM +0200, Peter Rosin wrote:
>>> From: Ulrich Hecht 
>>>
>>> Required for __must_check.
>>>
>>> Signed-off-by: Ulrich Hecht 
>>> Signed-off-by: Peter Rosin 
>>> ---
>>>  include/linux/mux/consumer.h | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> Hi Greg,
>>>
>>> Please pick this up for 4.14.
>>
>> What commit is this patch fixing?
>
> I don't think it's needed for any existing user, I at least do not have any
> report about that. So, I don't think it's needed for stable.
>
> Ulrich, the way I understood this, it was needed for your serdev series [1],
> and not for any existing usage. Right?

That is correct.

CU
Uli

Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use

2017-07-05 Thread Ulrich Windl

>>> Jeffrey Walton <noloa...@gmail.com> schrieb am 17.06.2017 um 16:23 in 
>>> Nachricht
<cah8yc8nhx2r9cfq0gnejaurgsfas8v16dvhv35brnln-ypr...@mail.gmail.com>:

[...]
> But its not clear to me how to ensure uniqueness when its based on
> randomness from the generators.

Even with a perfect random generator non-unique values are possible (that's why 
it's random). It's unlikely, but it can happen. The question is whether the 
probability of non-unique values from /dev/urandom is any higher than that for 
values read from /dev/random. One _might_ be able to predict the values from 
/dev/urandom.

Regards,
Ulrich

> 
> Jeff
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscr...@googlegroups.com.
> To post to this group, send email to open-is...@googlegroups.com.
> Visit this group at https://groups.google.com/group/open-iscsi.
> For more options, visit https://groups.google.com/d/optout.

Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use

2017-07-05 Thread Ulrich Windl

>>> Jeffrey Walton  schrieb am 17.06.2017 um 16:23 in 
>>> Nachricht
:

[...]
> But its not clear to me how to ensure uniqueness when its based on
> randomness from the generators.

Even with a perfect random generator non-unique values are possible (that's why 
it's random). It's unlikely, but it can happen. The question is whether the 
probability of non-unique values from /dev/urandom is any higher than that for 
values read from /dev/random. One _might_ be able to predict the values from 
/dev/urandom.

Regards,
Ulrich

> 
> Jeff
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscr...@googlegroups.com.
> To post to this group, send email to open-is...@googlegroups.com.
> Visit this group at https://groups.google.com/group/open-iscsi.
> For more options, visit https://groups.google.com/d/optout.

Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use

2017-07-05 Thread Ulrich Windl

>>> Stephan Müller <smuel...@chronox.de> schrieb am 26.06.2017 um 19:38 in
Nachricht <1678474.gnybdsl...@tauon.chronox.de>:
> Am Montag, 26. Juni 2017, 03:23:09 CEST schrieb Nicholas A. Bellinger:
> 
> Hi Nicholas,
> 
>> Hi Stephan, Lee & Jason,
>> 
>> (Adding target-devel CC')
>> 
>> Apologies for coming late to the discussion.  Comments below.
>> 
>> On Sun, 2017-06-18 at 10:04 +0200, Stephan Müller wrote:
>> > Am Samstag, 17. Juni 2017, 05:45:57 CEST schrieb Lee Duncan:
>> > 
>> > Hi Lee,
>> > 
>> > > In your testing, how long might a process have to wait? Are we talking
>> > > seconds? Longer? What about timeouts?
>> > 
>> > In current kernels (starting with 4.8) this timeout should clear within
a
>> > few seconds after boot.
>> > 
>> > In older kernels (pre 4.8), my KVM takes up to 90 seconds to reach that
>> > seeding point. I have heard that on IBM System Z this trigger point
>> > requires minutes to be reached.
>> 
>> I share the same concern as Lee wrt to introducing latency into the
>> existing iscsi-target login sequence.
>> 
>> Namely in the use-cases where a single node is supporting ~1K unique
>> iscsi-target IQNs, and fail-over + re-balancing of IQNs where 100s of
>> login attempts are expected to occur in parallel.
>> 
>> If environments exist that require non trivial amounts of time for RNG
>> seeding to be ready for iscsi-target usage, then enforcing this
>> requirement at iscsi login time can open up problems, especially when
>> iscsi host environments may be sensitive to login timeouts, I/O
>> timeouts, et al.
>> 
>> That said, I'd prefer to simply wait for RNG to be seeded at modprobe
>> iscsi_target_mod time, instead of trying to enforce randomness during
>> login.
>> 
>> This way, those of use who have distributed storage platforms can know
>> to avoid putting a node into a state to accept iscsi-target IQN export
>> migration, before modprobe iscsi_target_mod has successfully loaded and
>> RNG seeding has been confirmed.
>> 
>> WDYT..?
> 
> We may have a chicken and egg problem when the wait is at the modprobe time.

> 
> Assume the modprobe happens during initramfs time to get access to the root

> file system. In this case, you entire boot process will lock up for an 
> indefinite amount of time. The reason is that in order to obtain events 
> detected by the RNG, devices need to be initialized and working. Such 
> devices 
> commonly start working after the the root partition is mounted as it 
> contains 
> all drivers, all configuration information etc.
> 
> Note, during the development of my /dev/random implementation, I added the 
> getrandom-like blocking behavior to /dev/urandom (which is the equivalent to

> 
> Jason's patch except that it applies to user space). The boot process locked


I thought reads from urandom never block by definition. An older manual page
(man urandom) also says: "A  read  from  the  /dev/urandom device will not
block waiting for more entropy."

Regards,
Ulrich

> 
> up since systemd wanted data from /dev/urandom while it processed the 
> initramfs. As it did not get any, the boot process did not commence that 
> could 
> deliver new events to be picked up by the RNG.
> 
> As I do not have such an iscsi system, I cannot test Jason's patch. But 
> maybe 
> the mentioned chicken-and-egg problem I mentioned above is already visible 
> with the current patch as it will lead to a blocking of the mounting of the

> root partition in case the root partition is on an iscsi target.
> 
> Ciao
> Stephan
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscr...@googlegroups.com.
> To post to this group, send email to open-is...@googlegroups.com.
> Visit this group at https://groups.google.com/group/open-iscsi.
> For more options, visit https://groups.google.com/d/optout.

Antw: Re: [kernel-hardening] Re: [PATCH v4 06/13] iscsi: ensure RNG is seeded before use

2017-07-05 Thread Ulrich Windl

>>> Stephan Müller  schrieb am 26.06.2017 um 19:38 in
Nachricht <1678474.gnybdsl...@tauon.chronox.de>:
> Am Montag, 26. Juni 2017, 03:23:09 CEST schrieb Nicholas A. Bellinger:
> 
> Hi Nicholas,
> 
>> Hi Stephan, Lee & Jason,
>> 
>> (Adding target-devel CC')
>> 
>> Apologies for coming late to the discussion.  Comments below.
>> 
>> On Sun, 2017-06-18 at 10:04 +0200, Stephan Müller wrote:
>> > Am Samstag, 17. Juni 2017, 05:45:57 CEST schrieb Lee Duncan:
>> > 
>> > Hi Lee,
>> > 
>> > > In your testing, how long might a process have to wait? Are we talking
>> > > seconds? Longer? What about timeouts?
>> > 
>> > In current kernels (starting with 4.8) this timeout should clear within
a
>> > few seconds after boot.
>> > 
>> > In older kernels (pre 4.8), my KVM takes up to 90 seconds to reach that
>> > seeding point. I have heard that on IBM System Z this trigger point
>> > requires minutes to be reached.
>> 
>> I share the same concern as Lee wrt to introducing latency into the
>> existing iscsi-target login sequence.
>> 
>> Namely in the use-cases where a single node is supporting ~1K unique
>> iscsi-target IQNs, and fail-over + re-balancing of IQNs where 100s of
>> login attempts are expected to occur in parallel.
>> 
>> If environments exist that require non trivial amounts of time for RNG
>> seeding to be ready for iscsi-target usage, then enforcing this
>> requirement at iscsi login time can open up problems, especially when
>> iscsi host environments may be sensitive to login timeouts, I/O
>> timeouts, et al.
>> 
>> That said, I'd prefer to simply wait for RNG to be seeded at modprobe
>> iscsi_target_mod time, instead of trying to enforce randomness during
>> login.
>> 
>> This way, those of use who have distributed storage platforms can know
>> to avoid putting a node into a state to accept iscsi-target IQN export
>> migration, before modprobe iscsi_target_mod has successfully loaded and
>> RNG seeding has been confirmed.
>> 
>> WDYT..?
> 
> We may have a chicken and egg problem when the wait is at the modprobe time.

> 
> Assume the modprobe happens during initramfs time to get access to the root

> file system. In this case, you entire boot process will lock up for an 
> indefinite amount of time. The reason is that in order to obtain events 
> detected by the RNG, devices need to be initialized and working. Such 
> devices 
> commonly start working after the the root partition is mounted as it 
> contains 
> all drivers, all configuration information etc.
> 
> Note, during the development of my /dev/random implementation, I added the 
> getrandom-like blocking behavior to /dev/urandom (which is the equivalent to

> 
> Jason's patch except that it applies to user space). The boot process locked


I thought reads from urandom never block by definition. An older manual page
(man urandom) also says: "A  read  from  the  /dev/urandom device will not
block waiting for more entropy."

Regards,
Ulrich

> 
> up since systemd wanted data from /dev/urandom while it processed the 
> initramfs. As it did not get any, the boot process did not commence that 
> could 
> deliver new events to be picked up by the RNG.
> 
> As I do not have such an iscsi system, I cannot test Jason's patch. But 
> maybe 
> the mentioned chicken-and-egg problem I mentioned above is already visible 
> with the current patch as it will lead to a blocking of the mounting of the

> root partition in case the root partition is on an iscsi target.
> 
> Ciao
> Stephan
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscr...@googlegroups.com.
> To post to this group, send email to open-is...@googlegroups.com.
> Visit this group at https://groups.google.com/group/open-iscsi.
> For more options, visit https://groups.google.com/d/optout.

Antw: Q: si_code for SIGBUS caused by mmap() write error

2017-05-08 Thread Ulrich Windl



I'm sorry for using that address as sender in my previous message; it was an 
oversight! The CC: address was correct, however. You can drop above address 
from your replies.

Antw: Q: si_code for SIGBUS caused by mmap() write error

2017-05-08 Thread Ulrich Windl



I'm sorry for using that address as sender in my previous message; it was an 
oversight! The CC: address was correct, however. You can drop above address 
from your replies.

Kernel lockup with Intel Iris graphics

2017-04-20 Thread Ulrich Windl

Hi folks,

maybe someone has a idea on this: 
https://bugzilla.opensuse.org/show_bug.cgi?id=1032832

Regards,
Ulrich

Kernel lockup with Intel Iris graphics

2017-04-20 Thread Ulrich Windl

Hi folks,

maybe someone has a idea on this: 
https://bugzilla.opensuse.org/show_bug.cgi?id=1032832

Regards,
Ulrich

Re: linux-next: build warning after merge of the tty tree

2017-02-06 Thread Ulrich Hecht

On Mon, Feb 6, 2017 at 9:50 AM, Greg KH <g...@kroah.com> wrote:
> On Mon, Feb 06, 2017 at 03:16:38PM +1100, Stephen Rothwell wrote:
>> Hi Greg,
>>
>> After merging the tty tree, today's linux-next build
>> (arm_multi_v7_defconfig) produced this warning:
>>
>> drivers/tty/serial/sh-sci.c:977:12: warning: 'scif_set_rtrg' defined but not 
>> used [-Wunused-function]
>>  static int scif_set_rtrg(struct uart_port *port, int rx_trig)
>> ^
>>
>> Introduced by commit
>>
>>   a380ed461f66 ("serial: sh-sci: implement FIFO threshold register setting")
>>
>> Forgot to add a call to this new function?
>
> I think this is fixed by a patch I just took into my tree, which isn't
> in linux-next yet.  Right Ulrich?

That is correct, it's called in "[PATCH v4 2/4] serial: sh-sci:
SCIFA/B RX FIFO software timeout".

CU
Uli

Re: linux-next: build warning after merge of the tty tree

2017-02-06 Thread Ulrich Hecht

On Mon, Feb 6, 2017 at 9:50 AM, Greg KH  wrote:
> On Mon, Feb 06, 2017 at 03:16:38PM +1100, Stephen Rothwell wrote:
>> Hi Greg,
>>
>> After merging the tty tree, today's linux-next build
>> (arm_multi_v7_defconfig) produced this warning:
>>
>> drivers/tty/serial/sh-sci.c:977:12: warning: 'scif_set_rtrg' defined but not 
>> used [-Wunused-function]
>>  static int scif_set_rtrg(struct uart_port *port, int rx_trig)
>> ^
>>
>> Introduced by commit
>>
>>   a380ed461f66 ("serial: sh-sci: implement FIFO threshold register setting")
>>
>> Forgot to add a call to this new function?
>
> I think this is fixed by a patch I just took into my tree, which isn't
> in linux-next yet.  Right Ulrich?

That is correct, it's called in "[PATCH v4 2/4] serial: sh-sci:
SCIFA/B RX FIFO software timeout".

CU
Uli

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi once more!

I managed to get the call traces of involved processes:

1) The process doing read():
Dec  7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State
Dec  7 13:51:16 h10 kernel: [183809.594447]   taskPC 
stack   pid father
Dec  7 13:51:16 h10 kernel: [183809.594750] randio  D 8801703a9d68  
   0  2762  53250 0x0004
Dec  7 13:51:16 h10 kernel: [183809.594758]  880100887ad8 0046 
880100886010 00010900
Dec  7 13:51:16 h10 kernel: [183809.594765]  00010900 00010900 
00010900 880100887fd8
Dec  7 13:51:16 h10 kernel: [183809.594772]  880100887fd8 00010900 
88016bb6a280 88017670c300
Dec  7 13:51:16 h10 kernel: [183809.594778] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.594806]  [] 
io_schedule+0x9c/0xf0
Dec  7 13:51:16 h10 kernel: [183809.594820]  [] 
__lock_page+0x93/0xc0
Dec  7 13:51:16 h10 kernel: [183809.594834]  [] 
truncate_inode_pages_range+0x294/0x460
Dec  7 13:51:16 h10 kernel: [183809.594845]  [] 
__blkdev_put+0x1d7/0x210
Dec  7 13:51:16 h10 kernel: [183809.594856]  [] 
__fput+0xb3/0x200
Dec  7 13:51:16 h10 kernel: [183809.594868]  [] 
filp_close+0x5c/0x90
Dec  7 13:51:16 h10 kernel: [183809.594880]  [] 
put_files_struct+0x7a/0xd0
Dec  7 13:51:16 h10 kernel: [183809.594889]  [] 
do_exit+0x1d0/0x470
Dec  7 13:51:16 h10 kernel: [183809.594897]  [] 
do_group_exit+0x3d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594907]  [] 
get_signal_to_deliver+0x247/0x480
Dec  7 13:51:16 h10 kernel: [183809.594919]  [] 
do_signal+0x71/0x1b0
Dec  7 13:51:16 h10 kernel: [183809.594928]  [] 
do_notify_resume+0x98/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594940]  [] 
int_signal+0x12/0x17
Dec  7 13:51:16 h10 kernel: [183809.594988]  [<7f64f28cbba0>] 0x7f64f28cbb9f

2) The process trying to modify the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.594995] blocktune   D 88014e048000  
   0 58867  1 0x
Dec  7 13:51:16 h10 kernel: [183809.595000]  880128defd18 0086 
880128dee010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595007]  00010900 00010900 
00010900 880128deffd8
Dec  7 13:51:16 h10 kernel: [183809.595013]  880128deffd8 00010900 
88012889a3c0 8801767bc1c0
Dec  7 13:51:16 h10 kernel: [183809.595019] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595026]  [] 
schedule_timeout+0x1b0/0x2a0
Dec  7 13:51:16 h10 kernel: [183809.595040]  [] 
msleep+0x1d/0x30
Dec  7 13:51:16 h10 kernel: [183809.595052]  [] 
__blk_drain_queue+0xbc/0x140
Dec  7 13:51:16 h10 kernel: [183809.595063]  [] 
elv_quiesce_start+0x51/0x90
Dec  7 13:51:16 h10 kernel: [183809.595071]  [] 
elevator_switch+0x4a/0x150
Dec  7 13:51:16 h10 kernel: [183809.595079]  [] 
elevator_change+0x6d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595086]  [] 
elv_iosched_store+0x27/0x60
Dec  7 13:51:16 h10 kernel: [183809.595096]  [] 
queue_attr_store+0x67/0xc0
Dec  7 13:51:16 h10 kernel: [183809.595106]  [] 
sysfs_write_file+0xcb/0x160
Dec  7 13:51:16 h10 kernel: [183809.595115]  [] 
vfs_write+0xce/0x140
Dec  7 13:51:16 h10 kernel: [183809.595123]  [] 
sys_write+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595131]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595140]  [<7f12b7f70c00>] 0x7f12b7f70bff

3) The process trying to read the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718  
   0 45053   5957 0x0004
Dec  7 13:51:16 h10 kernel: [183809.595155]  880122f7be00 0082 
880122f7a010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595161]  00010900 00010900 
00010900 880122f7bfd8
Dec  7 13:51:16 h10 kernel: [183809.595167]  880122f7bfd8 00010900 
8801154ea1c0 81a11020
Dec  7 13:51:16 h10 kernel: [183809.595174] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595181]  [] 
__mutex_lock_slowpath+0x160/0x1f0
Dec  7 13:51:16 h10 kernel: [183809.595189]  [] 
mutex_lock+0x1a/0x40
Dec  7 13:51:16 h10 kernel: [183809.595196]  [] 
queue_attr_show+0x49/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595203]  [] 
sysfs_read_file+0xfe/0x1c0
Dec  7 13:51:16 h10 kernel: [183809.595212]  [] 
vfs_read+0xc7/0x130
Dec  7 13:51:16 h10 kernel: [183809.595219]  [] 
sys_read+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595226]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595235]  [<7fb04560dba0>] 0x7fb04560db9f



>>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> An addition: Processes doing such I/O seem to be unkillable, and I also 
> cannot change the queue parameters while this problem occurs (the process 
> trying to write (e.g.: to queue/scheduler) is also blocked. The process 
> status of the process doing I/O looks like this:
> # cat /proc/2762/

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi once more!

I managed to get the call traces of involved processes:

1) The process doing read():
Dec  7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State
Dec  7 13:51:16 h10 kernel: [183809.594447]   taskPC 
stack   pid father
Dec  7 13:51:16 h10 kernel: [183809.594750] randio  D 8801703a9d68  
   0  2762  53250 0x0004
Dec  7 13:51:16 h10 kernel: [183809.594758]  880100887ad8 0046 
880100886010 00010900
Dec  7 13:51:16 h10 kernel: [183809.594765]  00010900 00010900 
00010900 880100887fd8
Dec  7 13:51:16 h10 kernel: [183809.594772]  880100887fd8 00010900 
88016bb6a280 88017670c300
Dec  7 13:51:16 h10 kernel: [183809.594778] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.594806]  [] 
io_schedule+0x9c/0xf0
Dec  7 13:51:16 h10 kernel: [183809.594820]  [] 
__lock_page+0x93/0xc0
Dec  7 13:51:16 h10 kernel: [183809.594834]  [] 
truncate_inode_pages_range+0x294/0x460
Dec  7 13:51:16 h10 kernel: [183809.594845]  [] 
__blkdev_put+0x1d7/0x210
Dec  7 13:51:16 h10 kernel: [183809.594856]  [] 
__fput+0xb3/0x200
Dec  7 13:51:16 h10 kernel: [183809.594868]  [] 
filp_close+0x5c/0x90
Dec  7 13:51:16 h10 kernel: [183809.594880]  [] 
put_files_struct+0x7a/0xd0
Dec  7 13:51:16 h10 kernel: [183809.594889]  [] 
do_exit+0x1d0/0x470
Dec  7 13:51:16 h10 kernel: [183809.594897]  [] 
do_group_exit+0x3d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594907]  [] 
get_signal_to_deliver+0x247/0x480
Dec  7 13:51:16 h10 kernel: [183809.594919]  [] 
do_signal+0x71/0x1b0
Dec  7 13:51:16 h10 kernel: [183809.594928]  [] 
do_notify_resume+0x98/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594940]  [] 
int_signal+0x12/0x17
Dec  7 13:51:16 h10 kernel: [183809.594988]  [<7f64f28cbba0>] 0x7f64f28cbb9f

2) The process trying to modify the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.594995] blocktune   D 88014e048000  
   0 58867  1 0x
Dec  7 13:51:16 h10 kernel: [183809.595000]  880128defd18 0086 
880128dee010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595007]  00010900 00010900 
00010900 880128deffd8
Dec  7 13:51:16 h10 kernel: [183809.595013]  880128deffd8 00010900 
88012889a3c0 8801767bc1c0
Dec  7 13:51:16 h10 kernel: [183809.595019] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595026]  [] 
schedule_timeout+0x1b0/0x2a0
Dec  7 13:51:16 h10 kernel: [183809.595040]  [] 
msleep+0x1d/0x30
Dec  7 13:51:16 h10 kernel: [183809.595052]  [] 
__blk_drain_queue+0xbc/0x140
Dec  7 13:51:16 h10 kernel: [183809.595063]  [] 
elv_quiesce_start+0x51/0x90
Dec  7 13:51:16 h10 kernel: [183809.595071]  [] 
elevator_switch+0x4a/0x150
Dec  7 13:51:16 h10 kernel: [183809.595079]  [] 
elevator_change+0x6d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595086]  [] 
elv_iosched_store+0x27/0x60
Dec  7 13:51:16 h10 kernel: [183809.595096]  [] 
queue_attr_store+0x67/0xc0
Dec  7 13:51:16 h10 kernel: [183809.595106]  [] 
sysfs_write_file+0xcb/0x160
Dec  7 13:51:16 h10 kernel: [183809.595115]  [] 
vfs_write+0xce/0x140
Dec  7 13:51:16 h10 kernel: [183809.595123]  [] 
sys_write+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595131]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595140]  [<7f12b7f70c00>] 0x7f12b7f70bff

3) The process trying to read the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718  
   0 45053   5957 0x0004
Dec  7 13:51:16 h10 kernel: [183809.595155]  880122f7be00 0082 
880122f7a010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595161]  00010900 00010900 
00010900 880122f7bfd8
Dec  7 13:51:16 h10 kernel: [183809.595167]  880122f7bfd8 00010900 
8801154ea1c0 81a11020
Dec  7 13:51:16 h10 kernel: [183809.595174] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595181]  [] 
__mutex_lock_slowpath+0x160/0x1f0
Dec  7 13:51:16 h10 kernel: [183809.595189]  [] 
mutex_lock+0x1a/0x40
Dec  7 13:51:16 h10 kernel: [183809.595196]  [] 
queue_attr_show+0x49/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595203]  [] 
sysfs_read_file+0xfe/0x1c0
Dec  7 13:51:16 h10 kernel: [183809.595212]  [] 
vfs_read+0xc7/0x130
Dec  7 13:51:16 h10 kernel: [183809.595219]  [] 
sys_read+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595226]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595235]  [<7fb04560dba0>] 0x7fb04560db9f



>>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> An addition: Processes doing such I/O seem to be unkillable, and I also 
> cannot change the queue parameters while this problem occurs (the process 
> trying to write (e.g.: to queue/scheduler) is also blocked. The process 
> status of the process doing I/O looks like this:
> # cat /proc/2762/

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

An addition: Processes doing such I/O seem to be unkillable, and I also cannot 
change the queue parameters while this problem occurs (the process trying to 
write (e.g.: to queue/scheduler) is also blocked. The process status of the 
process doing I/O looks like this:
# cat /proc/2762/status
Name:   randio
State:  D (disk sleep)
Tgid:   2762
Pid:2762
PPid:   53250
TracerPid:  0
Uid:0   0   0   0
Gid:0   0   0   0
FDSize: 0
Groups: 0 105
Threads:1
SigQ:   5/38340
SigPnd: 
ShdPnd: 4102
SigBlk: 
SigIgn: 
SigCgt: 00018000
CapInh: 
CapPrm: 
CapEff: 
CapBnd: 
Cpus_allowed:   ,
Cpus_allowed_list:  0-63
Mems_allowed:   
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001
Mems_allowed_list:  0
voluntary_ctxt_switches:5
nonvoluntary_ctxt_switches: 1

Best regards,
Ulrich

>>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> Maybe someone can confirm this:
> If you have a device (e.g. multipath map) that limits max_sectors_kb to 
> maybe 64, and then define an LVM LV using that multipath map as PV, the LV 
> still has a larger max_sectors_kb. If you send a big request (read in my 
> case), the kernel will complain:
> 
> kernel: [173116.098798] blk_rq_check_limits: over max size limit.
> 
> Note that this message does not give any clue to the device being involved, 
> nor the size of the IO attempted, nor the limit of the IO size.
> 
> My expectation would be that the higher layer reports back an I/O error, and 
> the user process receives an I/O error, or, alternatively, the big request is 
> split into acceptable chunks before passing it to the lower layers.
> 
> However none of the above happens; instead the request seems to block the 
> request queue, because later TUR-checks also fail:
> kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
> kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
> multipathd: 66:384: mark as failed
> multipathd: NAP_S11: remaining active paths: 1
> multipathd: 66:352: mark as failed
> multipathd: NAP_S11: Entering recovery mode: max_retries=6
> multipathd: NAP_S11: remaining active paths: 0
> 
> (somewhat later)
>  multipathd: NAP_S11: sdkh - tur checker reports path is up
> multipathd: 66:336: reinstated
> multipathd: NAP_S11: Recovered to normal mode
> kernel: [173117.286712] device-mapper: multipath: Could not failover device 
> 66:368: Handler scsi_dh_alua error 8.
> (I don't know the implications of this)
> 
> Of course this error does not appear as long as all devices use the same 
> maximum request size, but tests have shown that different SAN disk systems 
> prefer different request sizes (as they split large requests internally to 
> handle them in chunks anyway).
> 
> Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
> 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
> revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
> >>> 161 
> :
> 60728>:
> > Hello!
> > 
> > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I 
> noticed 
> > that I/Os dropped, until everything stood still more or less. Looking into 
> > the syslog I found that multipath's TUR-checker considered the paths (FC, 
> > BTW) as dead. Amazingly I did not have this problem when I did read-only 
> > tests.
> > 
> > The start looks like this:
> > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> > Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: 
> Failing 
> > path 8:32.
> > Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> > size limit.

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

An addition: Processes doing such I/O seem to be unkillable, and I also cannot 
change the queue parameters while this problem occurs (the process trying to 
write (e.g.: to queue/scheduler) is also blocked. The process status of the 
process doing I/O looks like this:
# cat /proc/2762/status
Name:   randio
State:  D (disk sleep)
Tgid:   2762
Pid:2762
PPid:   53250
TracerPid:  0
Uid:0   0   0   0
Gid:0   0   0   0
FDSize: 0
Groups: 0 105
Threads:1
SigQ:   5/38340
SigPnd: 
ShdPnd: 4102
SigBlk: 
SigIgn: 
SigCgt: 00018000
CapInh: 
CapPrm: 
CapEff: 
CapBnd: 
Cpus_allowed:   ,
Cpus_allowed_list:  0-63
Mems_allowed:   
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001
Mems_allowed_list:  0
voluntary_ctxt_switches:5
nonvoluntary_ctxt_switches: 1

Best regards,
Ulrich

>>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> Maybe someone can confirm this:
> If you have a device (e.g. multipath map) that limits max_sectors_kb to 
> maybe 64, and then define an LVM LV using that multipath map as PV, the LV 
> still has a larger max_sectors_kb. If you send a big request (read in my 
> case), the kernel will complain:
> 
> kernel: [173116.098798] blk_rq_check_limits: over max size limit.
> 
> Note that this message does not give any clue to the device being involved, 
> nor the size of the IO attempted, nor the limit of the IO size.
> 
> My expectation would be that the higher layer reports back an I/O error, and 
> the user process receives an I/O error, or, alternatively, the big request is 
> split into acceptable chunks before passing it to the lower layers.
> 
> However none of the above happens; instead the request seems to block the 
> request queue, because later TUR-checks also fail:
> kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
> kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
> multipathd: 66:384: mark as failed
> multipathd: NAP_S11: remaining active paths: 1
> multipathd: 66:352: mark as failed
> multipathd: NAP_S11: Entering recovery mode: max_retries=6
> multipathd: NAP_S11: remaining active paths: 0
> 
> (somewhat later)
>  multipathd: NAP_S11: sdkh - tur checker reports path is up
> multipathd: 66:336: reinstated
> multipathd: NAP_S11: Recovered to normal mode
> kernel: [173117.286712] device-mapper: multipath: Could not failover device 
> 66:368: Handler scsi_dh_alua error 8.
> (I don't know the implications of this)
> 
> Of course this error does not appear as long as all devices use the same 
> maximum request size, but tests have shown that different SAN disk systems 
> prefer different request sizes (as they split large requests internally to 
> handle them in chunks anyway).
> 
> Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
> 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
> revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
> >>> 161 
> :
> 60728>:
> > Hello!
> > 
> > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I 
> noticed 
> > that I/Os dropped, until everything stood still more or less. Looking into 
> > the syslog I found that multipath's TUR-checker considered the paths (FC, 
> > BTW) as dead. Amazingly I did not have this problem when I did read-only 
> > tests.
> > 
> > The start looks like this:
> > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> > Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: 
> Failing 
> > path 8:32.
> > Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> > size limit.

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

Maybe someone can confirm this:
If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 
64, and then define an LVM LV using that multipath map as PV, the LV still has 
a larger max_sectors_kb. If you send a big request (read in my case), the 
kernel will complain:

kernel: [173116.098798] blk_rq_check_limits: over max size limit.

Note that this message does not give any clue to the device being involved, nor 
the size of the IO attempted, nor the limit of the IO size.

My expectation would be that the higher layer reports back an I/O error, and 
the user process receives an I/O error, or, alternatively, the big request is 
split into acceptable chunks before passing it to the lower layers.

However none of the above happens; instead the request seems to block the 
request queue, because later TUR-checks also fail:
kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
multipathd: 66:384: mark as failed
multipathd: NAP_S11: remaining active paths: 1
multipathd: 66:352: mark as failed
multipathd: NAP_S11: Entering recovery mode: max_retries=6
multipathd: NAP_S11: remaining active paths: 0

(somewhat later)
 multipathd: NAP_S11: sdkh - tur checker reports path is up
multipathd: 66:336: reinstated
multipathd: NAP_S11: Recovered to normal mode
kernel: [173117.286712] device-mapper: multipath: Could not failover device 
66:368: Handler scsi_dh_alua error 8.
(I don't know the implications of this)

Of course this error does not appear as long as all devices use the same 
maximum request size, but tests have shown that different SAN disk systems 
prefer different request sizes (as they split large requests internally to 
handle them in chunks anyway).

Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)

Regards,
Ulrich

>>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
>>> 161 :
60728>:
> Hello!
> 
> While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed 
> that I/Os dropped, until everything stood still more or less. Looking into 
> the syslog I found that multipath's TUR-checker considered the paths (FC, 
> BTW) as dead. Amazingly I did not have this problem when I did read-only 
> tests.
> 
> The start looks like this:
> Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: Failing 
> path 8:32.
> Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: Failing 
> path 129:112.
> [...]
> It seems the TUR-checker does some ping-pong-like game: Paths go up and down
> 
> Now for the Linux part: I found the relevant message in blk-core.c 
> (blk_rq_check_limits()).
> First s/agaist/against/ in " *Such request stacking drivers should check 
> those requests agaist", the there's the problem that the message neither 
> outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
> underlying device. That makes debugging somewhat difficult if you customize 
> the block queue settings per device as I did:
> 
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-31 (noop)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTest-31 (128)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-32 (noop)
> Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTest-32 (128)
>

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

Maybe someone can confirm this:
If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 
64, and then define an LVM LV using that multipath map as PV, the LV still has 
a larger max_sectors_kb. If you send a big request (read in my case), the 
kernel will complain:

kernel: [173116.098798] blk_rq_check_limits: over max size limit.

Note that this message does not give any clue to the device being involved, nor 
the size of the IO attempted, nor the limit of the IO size.

My expectation would be that the higher layer reports back an I/O error, and 
the user process receives an I/O error, or, alternatively, the big request is 
split into acceptable chunks before passing it to the lower layers.

However none of the above happens; instead the request seems to block the 
request queue, because later TUR-checks also fail:
kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
multipathd: 66:384: mark as failed
multipathd: NAP_S11: remaining active paths: 1
multipathd: 66:352: mark as failed
multipathd: NAP_S11: Entering recovery mode: max_retries=6
multipathd: NAP_S11: remaining active paths: 0

(somewhat later)
 multipathd: NAP_S11: sdkh - tur checker reports path is up
multipathd: 66:336: reinstated
multipathd: NAP_S11: Recovered to normal mode
kernel: [173117.286712] device-mapper: multipath: Could not failover device 
66:368: Handler scsi_dh_alua error 8.
(I don't know the implications of this)

Of course this error does not appear as long as all devices use the same 
maximum request size, but tests have shown that different SAN disk systems 
prefer different request sizes (as they split large requests internally to 
handle them in chunks anyway).

Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)

Regards,
Ulrich

>>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
>>> 161 :
60728>:
> Hello!
> 
> While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed 
> that I/Os dropped, until everything stood still more or less. Looking into 
> the syslog I found that multipath's TUR-checker considered the paths (FC, 
> BTW) as dead. Amazingly I did not have this problem when I did read-only 
> tests.
> 
> The start looks like this:
> Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: Failing 
> path 8:32.
> Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: Failing 
> path 129:112.
> [...]
> It seems the TUR-checker does some ping-pong-like game: Paths go up and down
> 
> Now for the Linux part: I found the relevant message in blk-core.c 
> (blk_rq_check_limits()).
> First s/agaist/against/ in " *Such request stacking drivers should check 
> those requests agaist", the there's the problem that the message neither 
> outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
> underlying device. That makes debugging somewhat difficult if you customize 
> the block queue settings per device as I did:
> 
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-31 (noop)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTest-31 (128)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-32 (noop)
> Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTest-32 (128)
>

Antw: Re: MBR partitions slow?

2016-09-01 Thread Ulrich Windl

>>> Mark D Rustad <mrus...@gmail.com> schrieb am 31.08.2016 um 17:32 in 
>>> Nachricht
<e2d72371-913b-4460-a370-c141835ad...@gmail.com>:
> Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote:
> 
>> So without partition the throughput is about twice as high! Why?
> 
> My first thought is that by starting at block 0 the accesses were aligned  
> with the flash block size of the device. By starting at a partition, the  
> accesses probably were not so aligned.

Hi!

Thanks for answering. Yes, you are right: Usually I use fdisk to create 
partitions, and the tool does proper aligning for the partitions. In my case 
YaST insisted on having a partition before creating a filesystem, so I created 
on within YaST, and that partition turned out to be badly aligned (I think Yast 
uses cfdisk internally). I'm sorry that I didn't think about that earlier!

Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related...

Regards,
Ulrich

> 
> --
> Mark Rustad, mrus...@gmail.com

Antw: Re: MBR partitions slow?

2016-09-01 Thread Ulrich Windl

>>> Mark D Rustad  schrieb am 31.08.2016 um 17:32 in 
>>> Nachricht
:
> Ulrich Windl  wrote:
> 
>> So without partition the throughput is about twice as high! Why?
> 
> My first thought is that by starting at block 0 the accesses were aligned  
> with the flash block size of the device. By starting at a partition, the  
> accesses probably were not so aligned.

Hi!

Thanks for answering. Yes, you are right: Usually I use fdisk to create 
partitions, and the tool does proper aligning for the partitions. In my case 
YaST insisted on having a partition before creating a filesystem, so I created 
on within YaST, and that partition turned out to be badly aligned (I think Yast 
uses cfdisk internally). I'm sorry that I didn't think about that earlier!

Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related...

Regards,
Ulrich

> 
> --
> Mark Rustad, mrus...@gmail.com

Antw: MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Update:

I found out the bad performance was caused by partition alignment, and not by 
the pertition per se (YaST created the partition next to the MBR). I compared 
two partitions, number one badly aligned, and number 2 properly aligned. Then I 
got these results:

Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes
64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 16384 bytes / 16777216 bytes
Disk identifier: 0x00016340

Device Boot  Start End  Blocks  
 Id  System
/dev/disk/by-id/dm-name-FirstTest-32-part1   1 5242879 
2621439+  83  Linux
Partition 1 does not start on physical sector boundary.
/dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 
2621440   83  Linux
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s
time for fstat(): 0.60s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 
0x7f826a8a1000: 0.38s
time to zap 2684.4MiB: 11.734121s (228.76 MiB/s)
time to sync 2684.4MiB: 3.515991s (763.47 MiB/s)
time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2
time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s
time for fstat(): 0.69s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 
0x7fe18823e000: 0.44s
time to zap 2684.4MiB: 4.861062s (552.22 MiB/s)
time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s)
time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s

So the correctly aligned partition is two to three times faster than the badly 
aligned partition (write-only case), and it's about the performance of an 
unpartitioned disk.

Regards,
Ulrich

>>> Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> schrieb am 30.08.2016 um 
>>> 11:32
in Nachricht <57C552B6.33D : 161 : 60728>:
> Hello!
> 
> (I'm not subscribed to this list, but I'm hoping to get a reply anyway)
> While testing some SAN storage system, I needed a utility to erase disks 
> quickly. I wrote my own one that mmap()s the block device, memset()s the 
> area, then msync()s the changes, and finally close()s the file descriptor.
> 
> On one disk I had a primary MBR partition spanning the whole disk, like this 
> (output from some of my obscure tools):
> disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
> (10737418240 bytes)
> partition 1 (1-20971520)
> Total Sectors =   20971519
> 
> When wiping, I started (for no good reason) to wipe partition 1, then I 
> wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the 
> disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 
> SP4.
> For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
> bandwidth is about 9GB/s.
> 
> To my surprise I found out that the partition eats significant performance 
> (not quite 50%, but a lot):
> 
> ### Partition
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to 
> open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 
> 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 
> 10.7Gib) at 0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 
> MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 
> 0x7fbc86739000: 0.052170s
time to close 
> /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s
> 
> ### Whole disk
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open 
> /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 
> 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 
> 10.7Gib) at 0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 
> MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 
> 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 
> 0.290470s
> 
> Reproducible:
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
> time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
> time for fstat(): 0.65s
> time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
> 0x7f1cc17ab000: 0.37s
> time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
> time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
> time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
> time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s
> 
> So without partition the throughput is about twice as high! Why?
> 
> Regards
> Ulrich
> 
>

Antw: MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Update:

I found out the bad performance was caused by partition alignment, and not by 
the pertition per se (YaST created the partition next to the MBR). I compared 
two partitions, number one badly aligned, and number 2 properly aligned. Then I 
got these results:

Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes
64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 16384 bytes / 16777216 bytes
Disk identifier: 0x00016340

Device Boot  Start End  Blocks  
 Id  System
/dev/disk/by-id/dm-name-FirstTest-32-part1   1 5242879 
2621439+  83  Linux
Partition 1 does not start on physical sector boundary.
/dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 
2621440   83  Linux
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s
time for fstat(): 0.60s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 
0x7f826a8a1000: 0.38s
time to zap 2684.4MiB: 11.734121s (228.76 MiB/s)
time to sync 2684.4MiB: 3.515991s (763.47 MiB/s)
time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2
time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s
time for fstat(): 0.69s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 
0x7fe18823e000: 0.44s
time to zap 2684.4MiB: 4.861062s (552.22 MiB/s)
time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s)
time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s

So the correctly aligned partition is two to three times faster than the badly 
aligned partition (write-only case), and it's about the performance of an 
unpartitioned disk.

Regards,
Ulrich

>>> Ulrich Windl  schrieb am 30.08.2016 um 
>>> 11:32
in Nachricht <57C552B6.33D : 161 : 60728>:
> Hello!
> 
> (I'm not subscribed to this list, but I'm hoping to get a reply anyway)
> While testing some SAN storage system, I needed a utility to erase disks 
> quickly. I wrote my own one that mmap()s the block device, memset()s the 
> area, then msync()s the changes, and finally close()s the file descriptor.
> 
> On one disk I had a primary MBR partition spanning the whole disk, like this 
> (output from some of my obscure tools):
> disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
> (10737418240 bytes)
> partition 1 (1-20971520)
> Total Sectors =   20971519
> 
> When wiping, I started (for no good reason) to wipe partition 1, then I 
> wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the 
> disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 
> SP4.
> For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
> bandwidth is about 9GB/s.
> 
> To my surprise I found out that the partition eats significant performance 
> (not quite 50%, but a lot):
> 
> ### Partition
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to 
> open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 
> 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 
> 10.7Gib) at 0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 
> MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 
> 0x7fbc86739000: 0.052170s
time to close 
> /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s
> 
> ### Whole disk
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open 
> /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 
> 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 
> 10.7Gib) at 0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 
> MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 
> 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 
> 0.290470s
> 
> Reproducible:
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
> time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
> time for fstat(): 0.65s
> time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
> 0x7f1cc17ab000: 0.37s
> time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
> time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
> time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
> time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s
> 
> So without partition the throughput is about twice as high! Why?
> 
> Regards
> Ulrich
> 
>

MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Hello!

(I'm not subscribed to this list, but I'm hoping to get a reply anyway)
While testing some SAN storage system, I needed a utility to erase disks 
quickly. I wrote my own one that mmap()s the block device, memset()s the area, 
then msync()s the changes, and finally close()s the file descriptor.

On one disk I had a primary MBR partition spanning the whole disk, like this 
(output from some of my obscure tools):
disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
(10737418240 bytes)
partition 1 (1-20971520)
Total Sectors =   20971519

When wiping, I started (for no good reason) to wipe partition 1, then I wiped 
the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk 
system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4.
For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
bandwidth is about 9GB/s.

To my surprise I found out that the partition eats significant performance (not 
quite 50%, but a lot):

### Partition
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 
0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s

### Whole disk
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s

Reproducible:
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
time for fstat(): 0.65s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7f1cc17ab000: 0.37s
time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s

So without partition the throughput is about twice as high! Why?

Regards
Ulrich

MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Hello!

(I'm not subscribed to this list, but I'm hoping to get a reply anyway)
While testing some SAN storage system, I needed a utility to erase disks 
quickly. I wrote my own one that mmap()s the block device, memset()s the area, 
then msync()s the changes, and finally close()s the file descriptor.

On one disk I had a primary MBR partition spanning the whole disk, like this 
(output from some of my obscure tools):
disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
(10737418240 bytes)
partition 1 (1-20971520)
Total Sectors =   20971519

When wiping, I started (for no good reason) to wipe partition 1, then I wiped 
the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk 
system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4.
For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
bandwidth is about 9GB/s.

To my surprise I found out that the partition eats significant performance (not 
quite 50%, but a lot):

### Partition
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 
0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s

### Whole disk
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s

Reproducible:
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
time for fstat(): 0.65s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7f1cc17ab000: 0.37s
time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s

So without partition the throughput is about twice as high! Why?

Regards
Ulrich

3.0.101: "blk_rq_check_limits: over max size limit."

2016-08-23 Thread Ulrich Windl

Hello!

While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed 
that I/Os dropped, until everything stood still more or less. Looking into the 
syslog I found that multipath's TUR-checker considered the paths (FC, BTW) as 
dead. Amazingly I did not have this problem when I did read-only tests.

The start looks like this:
Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: Failing 
path 8:32.
Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: Failing 
path 129:112.
[...]
It seems the TUR-checker does some ping-pong-like game: Paths go up and down

Now for the Linux part: I found the relevant message in blk-core.c 
(blk_rq_check_limits()).
First s/agaist/against/ in " *Such request stacking drivers should check 
those requests agaist", the there's the problem that the message neither 
outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
underlying device. That makes debugging somewhat difficult if you customize the 
block queue settings per device as I did:

Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/rotational for FirstTest-31 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/add_random for FirstTest-31 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/scheduler for FirstTest-31 (noop)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/max_sectors_kb for FirstTest-31 (128)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/rotational for FirstTest-32 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/add_random for FirstTest-32 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/scheduler for FirstTest-32 (noop)
Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
queue/max_sectors_kb for FirstTest-32 (128)

I suspect the "queue/max_sectors_kb=128" is the culprit:
# multipath -ll FirstTest-32
FirstTest-32 (360002ac40001b383) dm-7 3PARdata,VV
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 2:0:0:1  sdet 129:80  failed ready running
  |- 2:0:2:1  sdev 129:112 failed ready running
  |- 1:0:0:1  sdb  8:16failed ready running
  `- 1:0:1:1  sdc  8:32failed ready running
# cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb
128
128
128
128
128

While writing this message, I noticed that I had created a primary partition of 
dm-7:
# dmsetup ls |grep Fi
FirstTest-32_part1  (253:8)
FirstTest-32(253:7)
# cat /sys/block/dm-8/queue/max_sectors_kb
1024

After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did not 
get better.

Can't blk_rq_check_limits() do anything more clever than returning -EIO?

Regards,
Ulrich
P.S: Keep me in CC:, please!

3.0.101: "blk_rq_check_limits: over max size limit."

2016-08-23 Thread Ulrich Windl

Hello!

While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed 
that I/Os dropped, until everything stood still more or less. Looking into the 
syslog I found that multipath's TUR-checker considered the paths (FC, BTW) as 
dead. Amazingly I did not have this problem when I did read-only tests.

The start looks like this:
Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: Failing 
path 8:32.
Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max size 
limit.
Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: Failing 
path 129:112.
[...]
It seems the TUR-checker does some ping-pong-like game: Paths go up and down

Now for the Linux part: I found the relevant message in blk-core.c 
(blk_rq_check_limits()).
First s/agaist/against/ in " *Such request stacking drivers should check 
those requests agaist", the there's the problem that the message neither 
outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
underlying device. That makes debugging somewhat difficult if you customize the 
block queue settings per device as I did:

Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/rotational for FirstTest-31 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/add_random for FirstTest-31 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/scheduler for FirstTest-31 (noop)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/max_sectors_kb for FirstTest-31 (128)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/rotational for FirstTest-32 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/add_random for FirstTest-32 (0)
Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
queue/scheduler for FirstTest-32 (noop)
Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
queue/max_sectors_kb for FirstTest-32 (128)

I suspect the "queue/max_sectors_kb=128" is the culprit:
# multipath -ll FirstTest-32
FirstTest-32 (360002ac40001b383) dm-7 3PARdata,VV
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 2:0:0:1  sdet 129:80  failed ready running
  |- 2:0:2:1  sdev 129:112 failed ready running
  |- 1:0:0:1  sdb  8:16failed ready running
  `- 1:0:1:1  sdc  8:32failed ready running
# cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb
128
128
128
128
128

While writing this message, I noticed that I had created a primary partition of 
dm-7:
# dmsetup ls |grep Fi
FirstTest-32_part1  (253:8)
FirstTest-32(253:7)
# cat /sys/block/dm-8/queue/max_sectors_kb
1024

After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did not 
get better.

Can't blk_rq_check_limits() do anything more clever than returning -EIO?

Regards,
Ulrich
P.S: Keep me in CC:, please!

cpuset cleanup race

2016-04-26 Thread Ulrich Drepper

I came across a problem with code which uses a cpuset CG and tries to
be responsible and clean up after itself.  The code attached at the
bottom illustrates the problem.  It's only long because it has no
dependencies aside from the basic runtime and should work on all
machines.  You need to run it with privileges high enough to create a
CG.

The code is really simple:
- a (new) CG in cpuset is created
- one of the cores of the root cpuset is selected
- the thread (and therefore entire process) is switched to the cpuset
- a thread is created which does nothing but terminate immediately
- the parent waits for the thread
- then the parent removes itself from the cpuset
- finally the parent tries to remove the created cpuset

The last part is where things go wrong.  Usually* the rmdir() call
made to remove the cpuset fails because the cpuset is still busy.  The
program prints the members of the cpuset CG: it's the child thread.

* I wrote "usually" because slowing down the parent code will help.
I.e., there is a race.  Successful slowdowns I found:
- compile with -fsanitize=address (seems already enough)
- very short wait, e.g., 1ns (you can see this by starting the program
with the parameter "wait")

You might want to compile the code with optimization.  It is a race, after all.


The pthread_join() call made by the parent won't return until the
kernel signals through the futex set up at clone() time that the
thread has terminated.  From the perspective of the userlevel code the
thread is gone.  But not all bookkeeping related to the terminated
thread seems to has been finished, it seems.


I didn't look at the code but I can imagine that the futex
notification happens as soon as all observable aspects of the thread
are gone.  This is of course good to not delay the waiter.  Hopefully
the cgroup bookkeeping can also be moved before the notification.


I tested it with a recent kernel (4.5.0-0.rc7) but I doubt it's a recent issue.


~
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void *tf(void *p)
{
  return NULL;
}

int main(int argc, char *argv[])
{
  const char *csname = argc == 1 ? "test" : argv[1];

  struct mntent *me;
  FILE *fp = setmntent(_PATH_MOUNTED, "r");
  if (fp == NULL)
error(1, errno, "cannot read mounted filesystem information");
  while ((me = getmntent(fp)) != NULL) {
if (strcmp(me->mnt_type, "cgroup") == 0
&& hasmntopt(me, "cpuset") != NULL)
  break;
  }
  if (me == NULL)
error(1, 0, "cpuset filesystem not mounted");
  endmntent(fp);

  char *cshier = NULL;
  asprintf(, "%s/%s", me->mnt_dir, csname);

  if (mkdir(cshier, 0777) == 0)
printf("new cpuset control group: %s\n", cshier);
  else if (errno != EEXIST)
error(1, errno, "cannot create cpuset group %s", cshier);

  char *csrootmems;
  asprintf(, "%s/cpuset.mems", me->mnt_dir);
  fp = fopen(csrootmems, "r");
  if (fp == NULL)
error(1, errno, "cannot read /cpuset.mems");
  char *val = NULL;
  size_t vallen = 0;
  ssize_t n = getline(, , fp);
  fclose(fp);
  free(csrootmems);

  char *testmems;
  asprintf(, "%s/cpuset.mems", cshier);
  fp = fopen(testmems, "w");
  if (fp == NULL)
error(1, errno, "cannot read /%s/cpuset.mems", csname);
  fwrite(val, n, 1, fp);
  fclose(fp);
  free(testmems);
  free(val);

  cpu_set_t cs;
  int first = 0;
  sched_getaffinity(0, sizeof(cs), );
  while (! CPU_ISSET(first, ))
++first;

  char *testcpus;
  asprintf(, "%s/cpuset.cpus", cshier);
  fp = fopen(testcpus, "w");
  if (fp == NULL)
error(1, errno, "cannot write /%s/cpuset.cpus", csname);
  fprintf(fp, "%d", first);
  fclose(fp);
  free(testcpus);

  char *testtasks;
  asprintf(, "%s/tasks", cshier);
  fp = fopen(testtasks, "w");
  if (fp == NULL)
error(1, errno, "cannot write /%s/tasks", csname);
  fprintf(fp, "%d", (int) getpid());
  fclose(fp);

  pthread_t th;
  pthread_create(, NULL, tf, NULL);

  pthread_join(th, NULL);

  char *roottasks;
  asprintf(, "%s/tasks", me->mnt_dir);
  fp = fopen(roottasks, "w");
  if (fp == NULL)
error(1, errno, "cannot write /tasks");
  fprintf(fp, "%d", (int) getpid());
  fclose(fp);
  free(roottasks);

  if (strcmp(csname, "wait") == 0) {
struct timespec s = { 0, 1 };
nanosleep(, NULL);
  }

  if (rmdir(cshier) != 0) {
printf("PID = %ld\nremaining = ", (long) getpid());
fp = fopen(testtasks, "r");
char *line = NULL;
size_t linelen = 0;
while ((n = getline(, , fp)) > 0)
  fputs(line, stdout);
fclose(fp);
free(line);
error(1, errno, "couldn't remove cpuset %s", cshier);
  }

  free(cshier);
  free(testtasks);

  return 0;
}

cpuset cleanup race

2016-04-26 Thread Ulrich Drepper

I came across a problem with code which uses a cpuset CG and tries to
be responsible and clean up after itself.  The code attached at the
bottom illustrates the problem.  It's only long because it has no
dependencies aside from the basic runtime and should work on all
machines.  You need to run it with privileges high enough to create a
CG.

The code is really simple:
- a (new) CG in cpuset is created
- one of the cores of the root cpuset is selected
- the thread (and therefore entire process) is switched to the cpuset
- a thread is created which does nothing but terminate immediately
- the parent waits for the thread
- then the parent removes itself from the cpuset
- finally the parent tries to remove the created cpuset

The last part is where things go wrong.  Usually* the rmdir() call
made to remove the cpuset fails because the cpuset is still busy.  The
program prints the members of the cpuset CG: it's the child thread.

* I wrote "usually" because slowing down the parent code will help.
I.e., there is a race.  Successful slowdowns I found:
- compile with -fsanitize=address (seems already enough)
- very short wait, e.g., 1ns (you can see this by starting the program
with the parameter "wait")

You might want to compile the code with optimization.  It is a race, after all.


The pthread_join() call made by the parent won't return until the
kernel signals through the futex set up at clone() time that the
thread has terminated.  From the perspective of the userlevel code the
thread is gone.  But not all bookkeeping related to the terminated
thread seems to has been finished, it seems.


I didn't look at the code but I can imagine that the futex
notification happens as soon as all observable aspects of the thread
are gone.  This is of course good to not delay the waiter.  Hopefully
the cgroup bookkeeping can also be moved before the notification.


I tested it with a recent kernel (4.5.0-0.rc7) but I doubt it's a recent issue.


~
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void *tf(void *p)
{
  return NULL;
}

int main(int argc, char *argv[])
{
  const char *csname = argc == 1 ? "test" : argv[1];

  struct mntent *me;
  FILE *fp = setmntent(_PATH_MOUNTED, "r");
  if (fp == NULL)
error(1, errno, "cannot read mounted filesystem information");
  while ((me = getmntent(fp)) != NULL) {
if (strcmp(me->mnt_type, "cgroup") == 0
&& hasmntopt(me, "cpuset") != NULL)
  break;
  }
  if (me == NULL)
error(1, 0, "cpuset filesystem not mounted");
  endmntent(fp);

  char *cshier = NULL;
  asprintf(, "%s/%s", me->mnt_dir, csname);

  if (mkdir(cshier, 0777) == 0)
printf("new cpuset control group: %s\n", cshier);
  else if (errno != EEXIST)
error(1, errno, "cannot create cpuset group %s", cshier);

  char *csrootmems;
  asprintf(, "%s/cpuset.mems", me->mnt_dir);
  fp = fopen(csrootmems, "r");
  if (fp == NULL)
error(1, errno, "cannot read /cpuset.mems");
  char *val = NULL;
  size_t vallen = 0;
  ssize_t n = getline(, , fp);
  fclose(fp);
  free(csrootmems);

  char *testmems;
  asprintf(, "%s/cpuset.mems", cshier);
  fp = fopen(testmems, "w");
  if (fp == NULL)
error(1, errno, "cannot read /%s/cpuset.mems", csname);
  fwrite(val, n, 1, fp);
  fclose(fp);
  free(testmems);
  free(val);

  cpu_set_t cs;
  int first = 0;
  sched_getaffinity(0, sizeof(cs), );
  while (! CPU_ISSET(first, ))
++first;

  char *testcpus;
  asprintf(, "%s/cpuset.cpus", cshier);
  fp = fopen(testcpus, "w");
  if (fp == NULL)
error(1, errno, "cannot write /%s/cpuset.cpus", csname);
  fprintf(fp, "%d", first);
  fclose(fp);
  free(testcpus);

  char *testtasks;
  asprintf(, "%s/tasks", cshier);
  fp = fopen(testtasks, "w");
  if (fp == NULL)
error(1, errno, "cannot write /%s/tasks", csname);
  fprintf(fp, "%d", (int) getpid());
  fclose(fp);

  pthread_t th;
  pthread_create(, NULL, tf, NULL);

  pthread_join(th, NULL);

  char *roottasks;
  asprintf(, "%s/tasks", me->mnt_dir);
  fp = fopen(roottasks, "w");
  if (fp == NULL)
error(1, errno, "cannot write /tasks");
  fprintf(fp, "%d", (int) getpid());
  fclose(fp);
  free(roottasks);

  if (strcmp(csname, "wait") == 0) {
struct timespec s = { 0, 1 };
nanosleep(, NULL);
  }

  if (rmdir(cshier) != 0) {
printf("PID = %ld\nremaining = ", (long) getpid());
fp = fopen(testtasks, "r");
char *line = NULL;
size_t linelen = 0;
while ((n = getline(, , fp)) > 0)
  fputs(line, stdout);
fclose(fp);
free(line);
error(1, errno, "couldn't remove cpuset %s", cshier);
  }

  free(cshier);
  free(testtasks);

  return 0;
}

Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-19 Thread Ulrich Obergfell


Josh,

in https://lkml.org/lkml/2016/3/15/1 you stated that the soft lockup
messages do not occur with kernel v4.5. Hence, I believe this should
not be reproducible with kernel v4.4 either. The relevant changes in
update_watchdog_all_cpus() were introduced in kernel v4.3 by patches
that I mentioned in my previous reply. I see that

  watchdog: use park/unpark functions in update_watchdog_all_cpus()

and related patches are included in this change log

  
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/watchdog.c?id=refs/tags/v4.4.6

In terms of kernel v4.1, it seems that the issue is mitigated by the
access rights of the watchdog parameters in /proc/sys/kernel as only
a privileged user should be able to write to, for example

   /proc/sys/kernel/nmi_watchdog

Also, based on the analysis in my previous reply, I think these soft
lockup messages are 'false positives' as the repeated cancel/restart
of watchdog_timer_fn() prevents the 'watchdog/N' thread from running
(i.e. I think the thread is not prevented from running by something
actually hogging CPU N).


Regards,

Uli


- Original Message -
From: "Josh Hunt" <joh...@akamai.com>
To: "Ulrich Obergfell" <uober...@redhat.com>
Cc: "Don Zickus" <dzic...@redhat.com>, a...@linux-foundation.org, 
linux-kernel@vger.kernel.org
Sent: Thursday, March 17, 2016 5:08:03 PM
Subject: Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is 
same as old

[...]

As you mention my patch will mask this problem for 4.1 which is why I 
wanted to get it into stable. Do you think there is any way to mitigate 
this issue for the stable kernels (4.1 to 4.4) if the user changes the 
values doing something like:

foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! 
$foo )); sleep .1; done & sleep 30 && kill %1

?

I realize this isn't a real-world use-case (I hope :)), but shows there 
is still a way to cause the box to soft lockup with this code path.

[...]

Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-19 Thread Ulrich Obergfell


Josh,

in https://lkml.org/lkml/2016/3/15/1 you stated that the soft lockup
messages do not occur with kernel v4.5. Hence, I believe this should
not be reproducible with kernel v4.4 either. The relevant changes in
update_watchdog_all_cpus() were introduced in kernel v4.3 by patches
that I mentioned in my previous reply. I see that

  watchdog: use park/unpark functions in update_watchdog_all_cpus()

and related patches are included in this change log

  
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/watchdog.c?id=refs/tags/v4.4.6

In terms of kernel v4.1, it seems that the issue is mitigated by the
access rights of the watchdog parameters in /proc/sys/kernel as only
a privileged user should be able to write to, for example

   /proc/sys/kernel/nmi_watchdog

Also, based on the analysis in my previous reply, I think these soft
lockup messages are 'false positives' as the repeated cancel/restart
of watchdog_timer_fn() prevents the 'watchdog/N' thread from running
(i.e. I think the thread is not prevented from running by something
actually hogging CPU N).


Regards,

Uli


- Original Message -
From: "Josh Hunt" 
To: "Ulrich Obergfell" 
Cc: "Don Zickus" , a...@linux-foundation.org, 
linux-kernel@vger.kernel.org
Sent: Thursday, March 17, 2016 5:08:03 PM
Subject: Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is 
same as old

[...]

As you mention my patch will mask this problem for 4.1 which is why I 
wanted to get it into stable. Do you think there is any way to mitigate 
this issue for the stable kernels (4.1 to 4.4) if the user changes the 
values doing something like:

foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! 
$foo )); sleep .1; done & sleep 30 && kill %1

?

I realize this isn't a real-world use-case (I hope :)), but shows there 
is still a way to cause the box to soft lockup with this code path.

[...]

Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-16 Thread Ulrich Obergfell

tch and it 
did not help. After that I did a git bisect to figure out when the soft 
lockup was fixed and it appears to be resolved after one of the commits 
in this series:

commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
Author: Ulrich Obergfell <uober...@redhat.com>
Date:   Fri Sep 4 15:45:15 2015 -0700

 watchdog: introduce watchdog_park_threads() and 
watchdog_unpark_threads()

I didn't identify the exact commit.

It would be nice to resolve the soft lockup for the stable folks since 
4.1 and 4.4 are longterm stable releases and would see this problem.

I did not have time to debug it any more outside of this bisection 
today. If you have something you'd like me to try which may work for the 
stable kernels I'm happy to test it.

For the record I'm able to reproduce the soft lockup on 4.1 doing:

while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & 
sleep 30 && kill %1 && sleep 5

Josh

Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-16 Thread Ulrich Obergfell

t to figure out when the soft 
lockup was fixed and it appears to be resolved after one of the commits 
in this series:

commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
Author: Ulrich Obergfell 
Date:   Fri Sep 4 15:45:15 2015 -0700

 watchdog: introduce watchdog_park_threads() and 
watchdog_unpark_threads()

I didn't identify the exact commit.

It would be nice to resolve the soft lockup for the stable folks since 
4.1 and 4.4 are longterm stable releases and would see this problem.

I did not have time to debug it any more outside of this bisection 
today. If you have something you'd like me to try which may work for the 
stable kernels I'm happy to test it.

For the record I'm able to reproduce the soft lockup on 4.1 doing:

while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & 
sleep 30 && kill %1 && sleep 5

Josh

and NTP incompatibility (MOD_NANO)

2016-01-31 Thread Ulrich Windl

Hi folks!

I'd wish ntp_loopfilter.c would compile without problems. The mess is (I had 
fixed it about 15 years ago (keyword "PPSkit")) that Linux uses ADJ_* flags to 
do traditional adjtime() things, as well as NTP kernel-related things (That's 
why the Linux syscall is named adjtimex()).
NTP however expects the flags to have names like MOD_* to modify settings, and 
STA_* for status bits to test.

When lacks a MOD_NANO, but has a STA_NANO, compilation of 
ntpd/ntp_loopfilter.c fails.

I'd wish this to be fixed. I don't know which "Bonehead weakener" is needed to 
get the fix permanently into Linux.

My proposal many years ago was to use ADJ_* style flags only for the Linux 
specialities (the adjtime() part of adjtimex()), and use the MOD_* style flags 
for the NTP part of adjtimex().

Regards,
Ulrich
---
Appendix: Linux has only these:
/* xntp 3.4 compatibility names */
#define MOD_OFFSET ADJ_OFFSET
#define MOD_FREQUENCY  ADJ_FREQUENCY
#define MOD_MAXERROR   ADJ_MAXERROR
#define MOD_ESTERROR   ADJ_ESTERROR
#define MOD_STATUS ADJ_STATUS
#define MOD_TIMECONST  ADJ_TIMECONST
#define MOD_CLKB   ADJ_TICK
#define MOD_CLKA   ADJ_OFFSET_SINGLESHOT /* 0x8000 in original */

and NTP incompatibility (MOD_NANO)

2016-01-31 Thread Ulrich Windl

Hi folks!

I'd wish ntp_loopfilter.c would compile without problems. The mess is (I had 
fixed it about 15 years ago (keyword "PPSkit")) that Linux uses ADJ_* flags to 
do traditional adjtime() things, as well as NTP kernel-related things (That's 
why the Linux syscall is named adjtimex()).
NTP however expects the flags to have names like MOD_* to modify settings, and 
STA_* for status bits to test.

When lacks a MOD_NANO, but has a STA_NANO, compilation of 
ntpd/ntp_loopfilter.c fails.

I'd wish this to be fixed. I don't know which "Bonehead weakener" is needed to 
get the fix permanently into Linux.

My proposal many years ago was to use ADJ_* style flags only for the Linux 
specialities (the adjtime() part of adjtimex()), and use the MOD_* style flags 
for the NTP part of adjtimex().

Regards,
Ulrich
---
Appendix: Linux has only these:
/* xntp 3.4 compatibility names */
#define MOD_OFFSET ADJ_OFFSET
#define MOD_FREQUENCY  ADJ_FREQUENCY
#define MOD_MAXERROR   ADJ_MAXERROR
#define MOD_ESTERROR   ADJ_ESTERROR
#define MOD_STATUS ADJ_STATUS
#define MOD_TIMECONST  ADJ_TIMECONST
#define MOD_CLKB   ADJ_TICK
#define MOD_CLKA   ADJ_OFFSET_SINGLESHOT /* 0x8000 in original */

Suggestion for improving documentation (/usr/src/linux/Documentation/sysctl/vm.txt)

2016-01-28 Thread Ulrich Windl

Hi!

I just read the documentation for the "swappiness" sysctl parameter, and I 
realized that
1) the documentation does not talk about the valid range of the parameter 
(0-100?)
2) the documentation does not talk about the units the parameter uses (Percent?)

==

swappiness

This control is used to define how aggressive the kernel will swap
memory pages.  Higher values will increase agressiveness, lower values
decrease the amount of swap.

The default value is 60.

==

My proposal to improve sysctl documentation is obvious, and no, I don't mean 
"Use the source, Luke!" ;-)

(I'm not subscribed to LKML)

Regards,
Ulrich

Suggestion for improving documentation (/usr/src/linux/Documentation/sysctl/vm.txt)

2016-01-28 Thread Ulrich Windl

Hi!

I just read the documentation for the "swappiness" sysctl parameter, and I 
realized that
1) the documentation does not talk about the valid range of the parameter 
(0-100?)
2) the documentation does not talk about the units the parameter uses (Percent?)

==

swappiness

This control is used to define how aggressive the kernel will swap
memory pages.  Higher values will increase agressiveness, lower values
decrease the amount of swap.

The default value is 60.

==

My proposal to improve sysctl documentation is obvious, and no, I don't mean 
"Use the source, Luke!" ;-)

(I'm not subscribed to LKML)

Regards,
Ulrich

Re: [PATCH 2/2] workqueue: implement lockup detector

2015-12-04 Thread Ulrich Obergfell


Tejun,

> Sure, separating the knobs out isn't difficult.  I still don't like
> the idea of having multiple set of similar knobs controlling about the
> same thing tho.
>
> For example, let's say there's a user who boots with "nosoftlockup"
> explicitly.  I'm pretty sure the user wouldn't be intending to keep
> workqueue watchdog running.  The same goes for threshold adjustments,
> so here's my question.  What are the reasons for the concern?  What
> are we worrying about?

I'm not sure if it is obvious to a user that a stall of workqueues is
"about the same thing" as a soft lockup, and that one could thus argue
that both should be controlled by the same knob. Looking at this from
perspective of usability, I would still vote for having separate knobs
for each lockup detector. For example

  /proc/sys/kernel/wq_watchdog_thresh

could control the on|off state of the workqueue watchdog and the timeout
at the same time (0 means off, > 0 means on and specifies the timeout).
Separating wq_watchdog_thresh from watchdog_thresh might also be useful
for diagnostic purposes for example, if during the investigation of a
problem one would want to explicitly increase or lower one threshold
without impacting the other.


>> And another question that comes to my mind is: Would the workqueue watchdog
>> participate in the lockup detector suspend/resume mechanism, and if yes, how
>> would it be integrated into this ?
>
> From the usage, I can't quite tell what the purpose of the mechanism
> is.  The only user seems to be fixup_ht_bug() and when it fails it
> says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so
> if you could give me a pointer, it'd be great.  But at any rate, if
> shutting down watchdog is all that's necessary, it shouldn't be a
> problem to integrate.

The patch post that introduced the mechanism is here:

  http://marc.info/?l=linux-kernel=143843318208917=2

The watchdog_{suspend|resume} functions were later renamed:

  http://marc.info/?l=linux-kernel=143894132129982=2

At the moment I don't see a reason why the workqueue watchdog would have to
participate in that mechanism. However, if the workqueue watchdog would be
connected to the soft lockup detector as you proposed, I think it should be
participating for the 'sake of consistency' (it would seem hard to under-
stand if the interface would only suspend parts of the lockup detector).


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] workqueue: implement lockup detector

2015-12-04 Thread Ulrich Obergfell


Tejun,

> Sure, separating the knobs out isn't difficult.  I still don't like
> the idea of having multiple set of similar knobs controlling about the
> same thing tho.
>
> For example, let's say there's a user who boots with "nosoftlockup"
> explicitly.  I'm pretty sure the user wouldn't be intending to keep
> workqueue watchdog running.  The same goes for threshold adjustments,
> so here's my question.  What are the reasons for the concern?  What
> are we worrying about?

I'm not sure if it is obvious to a user that a stall of workqueues is
"about the same thing" as a soft lockup, and that one could thus argue
that both should be controlled by the same knob. Looking at this from
perspective of usability, I would still vote for having separate knobs
for each lockup detector. For example

  /proc/sys/kernel/wq_watchdog_thresh

could control the on|off state of the workqueue watchdog and the timeout
at the same time (0 means off, > 0 means on and specifies the timeout).
Separating wq_watchdog_thresh from watchdog_thresh might also be useful
for diagnostic purposes for example, if during the investigation of a
problem one would want to explicitly increase or lower one threshold
without impacting the other.


>> And another question that comes to my mind is: Would the workqueue watchdog
>> participate in the lockup detector suspend/resume mechanism, and if yes, how
>> would it be integrated into this ?
>
> From the usage, I can't quite tell what the purpose of the mechanism
> is.  The only user seems to be fixup_ht_bug() and when it fails it
> says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so
> if you could give me a pointer, it'd be great.  But at any rate, if
> shutting down watchdog is all that's necessary, it shouldn't be a
> problem to integrate.

The patch post that introduced the mechanism is here:

  http://marc.info/?l=linux-kernel=143843318208917=2

The watchdog_{suspend|resume} functions were later renamed:

  http://marc.info/?l=linux-kernel=143894132129982=2

At the moment I don't see a reason why the workqueue watchdog would have to
participate in that mechanism. However, if the workqueue watchdog would be
connected to the soft lockup detector as you proposed, I think it should be
participating for the 'sake of consistency' (it would seem hard to under-
stand if the interface would only suspend parts of the lockup detector).


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] workqueue: implement lockup detector

2015-12-03 Thread Ulrich Obergfell

Tejun,

I share Don's concern about connecting the soft lockup detector and the
workqueue watchdog to the same kernel parameter in /proc. I would feel
more comfortable if the workqueue watchdog had its dedicated parameter.

I also see a scenario that the proposed patch does not handle well: The
watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not
necessary to disable and re-enable the watchdog. The flow of execution
looks like this.

  proc_watchdog_thresh
proc_watchdog_update
  if (watchdog_enabled && watchdog_thresh)
  watchdog_enable_all_cpus
if (!watchdog_running) {
...
} else {
//
// update 'on the fly'
//
update_watchdog_all_cpus()
}

The patched watchdog_enable_all_cpus() function disables the workqueue watchdog
unconditionally at [1]. However, the workqueue watchdog remains disabled if the
code path [2] is executed (and wq_watchdog_thresh is not updated as well).

static int watchdog_enable_all_cpus(void)
{
int err = 0;

[1] --> disable_workqueue_watchdog();

if (!watchdog_running) {
...
} else {
 .- /*
 |   * Enable/disable the lockup detectors or
 |   * change the sample period 'on the fly'.
 |   */
[2] <err = update_watchdog_all_cpus();
 |
 |  if (err) {
 |  watchdog_disable_all_cpus();
 |  pr_err("Failed to update lockup detectors, disabled\n");
 '- }
}

if (err)
watchdog_enabled = 0;

return err;
}

And another question that comes to my mind is: Would the workqueue watchdog
participate in the lockup detector suspend/resume mechanism, and if yes, how
would it be integrated into this ?

Regards,

Uli

- Original Message -
From: "Tejun Heo" 
To: "Don Zickus" 
Cc: "Ulrich Obergfell" , "Ingo Molnar" , 
"Peter Zijlstra" , "Andrew Morton" 
, linux-kernel@vger.kernel.org, kernel-t...@fb.com
Sent: Thursday, December 3, 2015 8:43:58 PM
Subject: Re: [PATCH 2/2] workqueue: implement lockup detector

Hello, Don.

On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote:
> This sort of looks like the hung task detector..
> 
> I am a little concerned because we just made a big effort to properly
> separate the hardlockup and softlockup paths and yet retain the flexibility
> to enable/disable them separately.  Now it seems the workqueue detector is
> permanently entwined with the softlockup detector.  I am not entirely sure
> that is correct thing to do.

The only area they get entwined is how it's controlled from userland.
While it isn't quite the same as softlockup detection, I think what it
monitors is close enough that it makes sense to put them under the
same interface.

> It also seems awkward for the lockup code to have to jump to the workqueue
> code to function properly. :-/  Though we have made exceptions for the virt
> stuff and the workqueue code is simple..

Softlockup code doesn't depend on workqueue in any way.  Workqueue
tags on touch_softlockup to detect cases which shouldn't be warned and
its enabledness is controlled together with softlockup and that's it.

> Actually, I am curious, it seems if you just added a
> /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
> modifying the watchdog code to begin with.  As you really aren't using any
> of it other than piggybacking on the touch_softlockup_watchdog stuff, which
> could probably be easily added without all the extra enable/disable changes
> in watchdog.c.

Yeah, except for touch signal, it's purely interface thing.  I don't
feel too strong about this but it seems a bit silly to introduce a
whole different set of interface for this.  e.g. if the user wanted to
disable softlockup detection, it'd be weird to leave wq lockup
detection running.  The same goes for threshold.

> Again, this looks like what the hung task detector is doing, which I
> struggled with years ago to integrate with the lockup code because in the
> end I had trouble re-using much of it.

So, it's a stall detector and there are inherent similarities but the
conditions tested are pretty different and it's a lot lighter.  I'm
not really sure what you're meaning to say.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] workqueue: implement lockup detector

2015-12-03 Thread Ulrich Obergfell

Tejun,

I share Don's concern about connecting the soft lockup detector and the
workqueue watchdog to the same kernel parameter in /proc. I would feel
more comfortable if the workqueue watchdog had its dedicated parameter.

I also see a scenario that the proposed patch does not handle well: The
watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not
necessary to disable and re-enable the watchdog. The flow of execution
looks like this.

  proc_watchdog_thresh
proc_watchdog_update
  if (watchdog_enabled && watchdog_thresh)
  watchdog_enable_all_cpus
if (!watchdog_running) {
...
} else {
//
// update 'on the fly'
//
update_watchdog_all_cpus()
}

The patched watchdog_enable_all_cpus() function disables the workqueue watchdog
unconditionally at [1]. However, the workqueue watchdog remains disabled if the
code path [2] is executed (and wq_watchdog_thresh is not updated as well).

static int watchdog_enable_all_cpus(void)
{
int err = 0;

[1] --> disable_workqueue_watchdog();

if (!watchdog_running) {
...
} else {
 .- /*
 |   * Enable/disable the lockup detectors or
 |   * change the sample period 'on the fly'.
 |   */
[2] <err = update_watchdog_all_cpus();
 |
 |  if (err) {
 |  watchdog_disable_all_cpus();
 |  pr_err("Failed to update lockup detectors, disabled\n");
 '- }
}

if (err)
watchdog_enabled = 0;

return err;
}

And another question that comes to my mind is: Would the workqueue watchdog
participate in the lockup detector suspend/resume mechanism, and if yes, how
would it be integrated into this ?

Regards,

Uli

- Original Message -
From: "Tejun Heo" <t...@kernel.org>
To: "Don Zickus" <dzic...@redhat.com>
Cc: "Ulrich Obergfell" <uober...@redhat.com>, "Ingo Molnar" <mi...@redhat.com>, 
"Peter Zijlstra" <pet...@infradead.org>, "Andrew Morton" 
<a...@linux-foundation.org>, linux-kernel@vger.kernel.org, kernel-t...@fb.com
Sent: Thursday, December 3, 2015 8:43:58 PM
Subject: Re: [PATCH 2/2] workqueue: implement lockup detector

Hello, Don.

On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote:
> This sort of looks like the hung task detector..
> 
> I am a little concerned because we just made a big effort to properly
> separate the hardlockup and softlockup paths and yet retain the flexibility
> to enable/disable them separately.  Now it seems the workqueue detector is
> permanently entwined with the softlockup detector.  I am not entirely sure
> that is correct thing to do.

The only area they get entwined is how it's controlled from userland.
While it isn't quite the same as softlockup detection, I think what it
monitors is close enough that it makes sense to put them under the
same interface.

> It also seems awkward for the lockup code to have to jump to the workqueue
> code to function properly. :-/  Though we have made exceptions for the virt
> stuff and the workqueue code is simple..

Softlockup code doesn't depend on workqueue in any way.  Workqueue
tags on touch_softlockup to detect cases which shouldn't be warned and
its enabledness is controlled together with softlockup and that's it.

> Actually, I am curious, it seems if you just added a
> /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
> modifying the watchdog code to begin with.  As you really aren't using any
> of it other than piggybacking on the touch_softlockup_watchdog stuff, which
> could probably be easily added without all the extra enable/disable changes
> in watchdog.c.

Yeah, except for touch signal, it's purely interface thing.  I don't
feel too strong about this but it seems a bit silly to introduce a
whole different set of interface for this.  e.g. if the user wanted to
disable softlockup detection, it'd be weird to leave wq lockup
detection running.  The same goes for threshold.

> Again, this looks like what the hung task detector is doing, which I
> struggled with years ago to integrate with the lockup code because in the
> end I had trouble re-using much of it.

So, it's a stall detector and there are inherent similarities but the
conditions tested are pretty different and it's a lot lighter.  I'm
not really sure what you're meaning to say.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] watchdog: avoid race between lockup detector suspend/resume and CPU hotplug

2015-11-03 Thread Ulrich Obergfell

The lockup detector suspend/resume interface that was introduced by
commit 8c073d27d7ad293bf734cc8475689413afadab81 does not protect
itself against races with CPU hotplug. Hence, theoretically it is
possible that a new watchdog thread is started on a hotplugged CPU
while the lockup detector is suspended, and the thread could thus
interfere unexpectedly with the code that requested to suspend the
lockup detector. Avoid the race by calling

  get_online_cpus() in lockup_detector_suspend()
  put_online_cpus() in lockup_detector_resume()

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0a23125..7357842 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -719,6 +719,7 @@ int lockup_detector_suspend(void)
 {
int ret = 0;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
/*
 * Multiple suspend requests can be active in parallel (counted by
@@ -759,6 +760,7 @@ void lockup_detector_resume(void)
watchdog_unpark_threads();
 
mutex_unlock(_proc_mutex);
+   put_online_cpus();
 }
 
 static int update_watchdog_all_cpus(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] watchdog: address various races (CPU hotplug, timer expiry)

2015-11-03 Thread Ulrich Obergfell

This patch set addresses various races in relation to CPU hotplug
and a race in relation to watchdog timer expiry. I discovered the
corner cases during code inspection. I haven't seen any of these
issues occur in practice.

Ulrich Obergfell (4):
  watchdog: avoid race between lockup detector suspend/resume and CPU
hotplug
  watchdog: avoid races between /proc handlers and CPU hotplug
  watchdog: remove {get|put}_online_cpus() from
watchdog_{park|unpark}_threads()
  watchdog: fix race between proc_watchdog_thresh() and
watchdog_timer_fn()

 kernel/watchdog.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn()

2015-11-03 Thread Ulrich Obergfell

Theoretically it is possible that the watchdog timer expires
right at the time when a user sets 'watchdog_thresh' to zero
(note: this disables the lockup detectors). In this scenario,
the is_softlockup() function - which is called by the timer -
could produce a false positive.

Fix this by checking the current value of 'watchdog_thresh'.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 84c4744..18f34cf 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -289,7 +289,7 @@ static int is_softlockup(unsigned long touch_ts)
 {
unsigned long now = get_timestamp();
 
-   if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) {
+   if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
/* Warn about unreasonable delays. */
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] watchdog: avoid races between /proc handlers and CPU hotplug

2015-11-03 Thread Ulrich Obergfell

The handler functions for watchdog parameters in /proc/sys/kernel
do not protect themselves against races with CPU hotplug. Hence,
theoretically it is possible that a new watchdog thread is started
on a hotplugged CPU while a parameter is being modified, and the
thread could thus use a parameter value that is 'in transition'.

For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the
value zero as the sample period.

To avoid such races and to keep the /proc handler code consistent,
call
 {get|put}_online_cpus() in proc_watchdog_common()
 {get|put}_online_cpus() in proc_watchdog_thresh()
 {get|put}_online_cpus() in proc_watchdog_cpumask()

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7357842..13fdda1 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -857,6 +857,7 @@ static int proc_watchdog_common(int which, struct ctl_table 
*table, int write,
int err, old, new;
int *watchdog_param = (int *)table->data;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -908,6 +909,7 @@ static int proc_watchdog_common(int which, struct ctl_table 
*table, int write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
@@ -949,6 +951,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
 {
int err, old;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -974,6 +977,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
@@ -988,6 +992,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int 
write,
 {
int err;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -1015,6 +1020,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int 
write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()

2015-11-03 Thread Ulrich Obergfell

watchdog_{park|unpark}_threads() are now called in code paths that
protect themselves against CPU hotplug, so {get|put}_online_cpus()
calls are redundant and can be removed.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 13fdda1..84c4744 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -683,33 +683,35 @@ static struct smp_hotplug_thread watchdog_threads = {
  * be parked and the watchdog threads of other CPUs can still be runnable.
  * Callers are expected to handle this special condition as appropriate in
  * their context.
+ *
+ * This function may only be called in a context that is protected against
+ * races with CPU hotplug - for example, via get_online_cpus().
  */
 static int watchdog_park_threads(void)
 {
int cpu, ret = 0;
 
-   get_online_cpus();
for_each_watchdog_cpu(cpu) {
ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
if (ret)
break;
}
-   put_online_cpus();
 
return ret;
 }
 
 /*
  * unpark all watchdog threads that are specified in 'watchdog_cpumask'
+ *
+ * This function may only be called in a context that is protected against
+ * races with CPU hotplug - for example, via get_online_cpus().
  */
 static void watchdog_unpark_threads(void)
 {
int cpu;
 
-   get_online_cpus();
for_each_watchdog_cpu(cpu)
kthread_unpark(per_cpu(softlockup_watchdog, cpu));
-   put_online_cpus();
 }
 
 /*
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] watchdog: avoid races between /proc handlers and CPU hotplug

2015-11-03 Thread Ulrich Obergfell

The handler functions for watchdog parameters in /proc/sys/kernel
do not protect themselves against races with CPU hotplug. Hence,
theoretically it is possible that a new watchdog thread is started
on a hotplugged CPU while a parameter is being modified, and the
thread could thus use a parameter value that is 'in transition'.

For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the
value zero as the sample period.

To avoid such races and to keep the /proc handler code consistent,
call
 {get|put}_online_cpus() in proc_watchdog_common()
 {get|put}_online_cpus() in proc_watchdog_thresh()
 {get|put}_online_cpus() in proc_watchdog_cpumask()

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7357842..13fdda1 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -857,6 +857,7 @@ static int proc_watchdog_common(int which, struct ctl_table 
*table, int write,
int err, old, new;
int *watchdog_param = (int *)table->data;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -908,6 +909,7 @@ static int proc_watchdog_common(int which, struct ctl_table 
*table, int write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
@@ -949,6 +951,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
 {
int err, old;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -974,6 +977,7 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
@@ -988,6 +992,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int 
write,
 {
int err;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
 
if (watchdog_suspended) {
@@ -1015,6 +1020,7 @@ int proc_watchdog_cpumask(struct ctl_table *table, int 
write,
}
 out:
mutex_unlock(_proc_mutex);
+   put_online_cpus();
return err;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] watchdog: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()

2015-11-03 Thread Ulrich Obergfell

watchdog_{park|unpark}_threads() are now called in code paths that
protect themselves against CPU hotplug, so {get|put}_online_cpus()
calls are redundant and can be removed.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 13fdda1..84c4744 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -683,33 +683,35 @@ static struct smp_hotplug_thread watchdog_threads = {
  * be parked and the watchdog threads of other CPUs can still be runnable.
  * Callers are expected to handle this special condition as appropriate in
  * their context.
+ *
+ * This function may only be called in a context that is protected against
+ * races with CPU hotplug - for example, via get_online_cpus().
  */
 static int watchdog_park_threads(void)
 {
int cpu, ret = 0;
 
-   get_online_cpus();
for_each_watchdog_cpu(cpu) {
ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
if (ret)
break;
}
-   put_online_cpus();
 
return ret;
 }
 
 /*
  * unpark all watchdog threads that are specified in 'watchdog_cpumask'
+ *
+ * This function may only be called in a context that is protected against
+ * races with CPU hotplug - for example, via get_online_cpus().
  */
 static void watchdog_unpark_threads(void)
 {
int cpu;
 
-   get_online_cpus();
for_each_watchdog_cpu(cpu)
kthread_unpark(per_cpu(softlockup_watchdog, cpu));
-   put_online_cpus();
 }
 
 /*
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] watchdog: fix race between proc_watchdog_thresh() and watchdog_timer_fn()

2015-11-03 Thread Ulrich Obergfell

Theoretically it is possible that the watchdog timer expires
right at the time when a user sets 'watchdog_thresh' to zero
(note: this disables the lockup detectors). In this scenario,
the is_softlockup() function - which is called by the timer -
could produce a false positive.

Fix this by checking the current value of 'watchdog_thresh'.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 84c4744..18f34cf 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -289,7 +289,7 @@ static int is_softlockup(unsigned long touch_ts)
 {
unsigned long now = get_timestamp();
 
-   if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) {
+   if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
/* Warn about unreasonable delays. */
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] watchdog: address various races (CPU hotplug, timer expiry)

2015-11-03 Thread Ulrich Obergfell

This patch set addresses various races in relation to CPU hotplug
and a race in relation to watchdog timer expiry. I discovered the
corner cases during code inspection. I haven't seen any of these
issues occur in practice.

Ulrich Obergfell (4):
  watchdog: avoid race between lockup detector suspend/resume and CPU
hotplug
  watchdog: avoid races between /proc handlers and CPU hotplug
  watchdog: remove {get|put}_online_cpus() from
watchdog_{park|unpark}_threads()
  watchdog: fix race between proc_watchdog_thresh() and
watchdog_timer_fn()

 kernel/watchdog.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] watchdog: avoid race between lockup detector suspend/resume and CPU hotplug

2015-11-03 Thread Ulrich Obergfell

The lockup detector suspend/resume interface that was introduced by
commit 8c073d27d7ad293bf734cc8475689413afadab81 does not protect
itself against races with CPU hotplug. Hence, theoretically it is
possible that a new watchdog thread is started on a hotplugged CPU
while the lockup detector is suspended, and the thread could thus
interfere unexpectedly with the code that requested to suspend the
lockup detector. Avoid the race by calling

  get_online_cpus() in lockup_detector_suspend()
  put_online_cpus() in lockup_detector_resume()

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0a23125..7357842 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -719,6 +719,7 @@ int lockup_detector_suspend(void)
 {
int ret = 0;
 
+   get_online_cpus();
mutex_lock(_proc_mutex);
/*
 * Multiple suspend requests can be active in parallel (counted by
@@ -759,6 +760,7 @@ void lockup_detector_resume(void)
watchdog_unpark_threads();
 
mutex_unlock(_proc_mutex);
+   put_online_cpus();
 }
 
 static int update_watchdog_all_cpus(void)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:irq/core] irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support

2015-09-30 Thread tip-bot for Ulrich Hecht

Commit-ID:  26c21dd9885a2d8a4f4d539917c4877ffd399286
Gitweb: http://git.kernel.org/tip/26c21dd9885a2d8a4f4d539917c4877ffd399286
Author: Ulrich Hecht 
AuthorDate: Wed, 30 Sep 2015 12:03:07 +0200
Committer:  Thomas Gleixner 
CommitDate: Wed, 30 Sep 2015 21:02:22 +0200

irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support

Works the same as on r8a7779.

Signed-off-by: Ulrich Hecht 
Signed-off-by: Geert Uytterhoeven 
Cc: Jason Cooper 
Cc: Marc Zyngier 
Link: 
http://lkml.kernel.org/r/1443607387-19147-1-git-send-email-geert+brene...@glider.be
Signed-off-by: Thomas Gleixner 
---
 drivers/irqchip/irq-renesas-intc-irqpin.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-renesas-intc-irqpin.c 
b/drivers/irqchip/irq-renesas-intc-irqpin.c
index 9525335..c325806 100644
--- a/drivers/irqchip/irq-renesas-intc-irqpin.c
+++ b/drivers/irqchip/irq-renesas-intc-irqpin.c
@@ -361,14 +361,16 @@ static const struct irq_domain_ops 
intc_irqpin_irq_domain_ops = {
.xlate  = irq_domain_xlate_twocell,
 };
 
-static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a7779 = {
+static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a777x = {
.irlm_bit = 23, /* ICR0.IRLM0 */
 };
 
 static const struct of_device_id intc_irqpin_dt_ids[] = {
{ .compatible = "renesas,intc-irqpin", },
+   { .compatible = "renesas,intc-irqpin-r8a7778",
+ .data = _irqpin_irlm_r8a777x },
{ .compatible = "renesas,intc-irqpin-r8a7779",
- .data = _irqpin_irlm_r8a7779 },
+ .data = _irqpin_irlm_r8a777x },
{},
 };
 MODULE_DEVICE_TABLE(of, intc_irqpin_dt_ids);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] improve handling of errors returned by kthread_park()

2015-09-30 Thread Ulrich Obergfell

Andrew,

> ... what inspired this patchset?
> Are you experiencing kthread_park() failures in practice?

I did not experience kthread_park() failures in practice.  Looking at
watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
I realized that there is a theoretical corner case which would not be
handled well. Let's assume that kthread_park() would return an error
in the following flow of execution (the user changes watchdog_thresh).

  proc_watchdog_thresh
set_sample_period()
//
// The watchdog_thresh and sample_period variable are now set to
// the new value.
//
proc_watchdog_update
  watchdog_enable_all_cpus
update_watchdog_all_cpus
  watchdog_park_threads

Let's say the system has eight CPUs and that kthread_park() failed to
park watchdog/4. In this example watchdog/0 .. watchdog/3 are already
parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know
exactly what happened to watchdog/4). watchdog_park_threads() unparks
the threads if kthread_park() of one thread fails.

  for_each_watchdog_cpu(cpu) {
  ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
  if (ret)
  break;
  }
  if (ret) {
  for_each_watchdog_cpu(cpu)
  kthread_unpark(per_cpu(softlockup_watchdog, cpu));
  }

watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value
when they are unparked (please see the watchdog_enable() function),
whereas watchdog/5 .. watchdog/7 will continue to use the old value
for the hard lockup detector and begin using the new value for the
soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7
in the unparked state, so it skips these threads). The inconsistency
which results from using different watchdog_thresh values can cause
unexpected behaviour of the lockup detectors (e.g. false positives).

The new error handling that is introduced by this patch set aims to
handle the above corner case in a better way (this was my original
motivation to come up with a patch set). However, I also think that
_if_ kthread_park() would ever be changed in the future so that it
could return errors under various (other) conditions, the patch set
should prepare the watchdog code for this possibility.

Since I did not experience kthread_park() failures in practice, I
used some instrumentation to fake error returns from kthread_park()
in order to test the patches.

Regards,

Uli

- Original Message -
From: "Andrew Morton" 
To: "Ulrich Obergfell" 
Cc: linux-kernel@vger.kernel.org, dzic...@redhat.com, atom...@redhat.com
Sent: Wednesday, September 30, 2015 1:30:36 AM
Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park()

On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell  wrote:

> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:
> 
> - watchdog suspend/resume interface
> 
> lockup_detector_suspend
>   watchdog_park_threads
> 
> - write to parameters in /proc/sys/kernel
> 
> proc_watchdog_update
>   watchdog_enable_all_cpus
> update_watchdog_all_cpus
>   watchdog_park_threads
> 
> Instead of 'blindly' attempting to unpark the watchdog threads if a 
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
> 
> - error messages from lockup_detector_suspend()
>or watchdog_enable_all_cpus()
> 
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
> 
> - the 'write' system call in the latter call chain returns an error
> 

hm, you made me look at kthread parking.  Why does it exist?  What is a
"parked" thread anyway, and how does it differ from, say, a sleeping
one?  The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the
patch added no useful documentation, sigh.

Anwyay...  what inspired this patchset?  Are you experiencing
kthread_park() failures in practice?  If so, what is causing them?  And
what is the user-visible effect of these failures?  This is all pretty
important context for such a patchset.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:irq/core] irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support

2015-09-30 Thread tip-bot for Ulrich Hecht

Commit-ID:  26c21dd9885a2d8a4f4d539917c4877ffd399286
Gitweb: http://git.kernel.org/tip/26c21dd9885a2d8a4f4d539917c4877ffd399286
Author: Ulrich Hecht <ulrich.hecht+rene...@gmail.com>
AuthorDate: Wed, 30 Sep 2015 12:03:07 +0200
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 30 Sep 2015 21:02:22 +0200

irqchip/renesas-intc-irqpin: r8a7778 IRLM setup support

Works the same as on r8a7779.

Signed-off-by: Ulrich Hecht <ulrich.hecht+rene...@gmail.com>
Signed-off-by: Geert Uytterhoeven <geert+rene...@glider.be>
Cc: Jason Cooper <ja...@lakedaemon.net>
Cc: Marc Zyngier <marc.zyng...@arm.com>
Link: 
http://lkml.kernel.org/r/1443607387-19147-1-git-send-email-geert+brene...@glider.be
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 drivers/irqchip/irq-renesas-intc-irqpin.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-renesas-intc-irqpin.c 
b/drivers/irqchip/irq-renesas-intc-irqpin.c
index 9525335..c325806 100644
--- a/drivers/irqchip/irq-renesas-intc-irqpin.c
+++ b/drivers/irqchip/irq-renesas-intc-irqpin.c
@@ -361,14 +361,16 @@ static const struct irq_domain_ops 
intc_irqpin_irq_domain_ops = {
.xlate  = irq_domain_xlate_twocell,
 };
 
-static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a7779 = {
+static const struct intc_irqpin_irlm_config intc_irqpin_irlm_r8a777x = {
.irlm_bit = 23, /* ICR0.IRLM0 */
 };
 
 static const struct of_device_id intc_irqpin_dt_ids[] = {
{ .compatible = "renesas,intc-irqpin", },
+   { .compatible = "renesas,intc-irqpin-r8a7778",
+ .data = _irqpin_irlm_r8a777x },
{ .compatible = "renesas,intc-irqpin-r8a7779",
- .data = _irqpin_irlm_r8a7779 },
+ .data = _irqpin_irlm_r8a777x },
{},
 };
 MODULE_DEVICE_TABLE(of, intc_irqpin_dt_ids);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] improve handling of errors returned by kthread_park()

2015-09-30 Thread Ulrich Obergfell

Andrew,

> ... what inspired this patchset?
> Are you experiencing kthread_park() failures in practice?

I did not experience kthread_park() failures in practice.  Looking at
watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
I realized that there is a theoretical corner case which would not be
handled well. Let's assume that kthread_park() would return an error
in the following flow of execution (the user changes watchdog_thresh).

  proc_watchdog_thresh
set_sample_period()
//
// The watchdog_thresh and sample_period variable are now set to
// the new value.
//
proc_watchdog_update
  watchdog_enable_all_cpus
update_watchdog_all_cpus
  watchdog_park_threads

Let's say the system has eight CPUs and that kthread_park() failed to
park watchdog/4. In this example watchdog/0 .. watchdog/3 are already
parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know
exactly what happened to watchdog/4). watchdog_park_threads() unparks
the threads if kthread_park() of one thread fails.

  for_each_watchdog_cpu(cpu) {
  ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
  if (ret)
  break;
  }
  if (ret) {
  for_each_watchdog_cpu(cpu)
  kthread_unpark(per_cpu(softlockup_watchdog, cpu));
  }

watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value
when they are unparked (please see the watchdog_enable() function),
whereas watchdog/5 .. watchdog/7 will continue to use the old value
for the hard lockup detector and begin using the new value for the
soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7
in the unparked state, so it skips these threads). The inconsistency
which results from using different watchdog_thresh values can cause
unexpected behaviour of the lockup detectors (e.g. false positives).

The new error handling that is introduced by this patch set aims to
handle the above corner case in a better way (this was my original
motivation to come up with a patch set). However, I also think that
_if_ kthread_park() would ever be changed in the future so that it
could return errors under various (other) conditions, the patch set
should prepare the watchdog code for this possibility.

Since I did not experience kthread_park() failures in practice, I
used some instrumentation to fake error returns from kthread_park()
in order to test the patches.

Regards,

Uli

- Original Message -
From: "Andrew Morton" <a...@linux-foundation.org>
To: "Ulrich Obergfell" <uober...@redhat.com>
Cc: linux-kernel@vger.kernel.org, dzic...@redhat.com, atom...@redhat.com
Sent: Wednesday, September 30, 2015 1:30:36 AM
Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park()

On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uober...@redhat.com> wrote:

> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:
> 
> - watchdog suspend/resume interface
> 
> lockup_detector_suspend
>   watchdog_park_threads
> 
> - write to parameters in /proc/sys/kernel
> 
> proc_watchdog_update
>   watchdog_enable_all_cpus
> update_watchdog_all_cpus
>   watchdog_park_threads
> 
> Instead of 'blindly' attempting to unpark the watchdog threads if a 
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
> 
> - error messages from lockup_detector_suspend()
>or watchdog_enable_all_cpus()
> 
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
> 
> - the 'write' system call in the latter call chain returns an error
> 

hm, you made me look at kthread parking.  Why does it exist?  What is a
"parked" thread anyway, and how does it differ from, say, a sleeping
one?  The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the
patch added no useful documentation, sigh.

Anwyay...  what inspired this patchset?  Are you experiencing
kthread_park() failures in practice?  If so, what is causing them?  And
what is the user-visible effect of these failures?  This is all pretty
important context for such a patchset.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef

2015-09-28 Thread Ulrich Obergfell

It makes sense to place watchdog_{dis|enable}_all_cpus() outside of
the ifdef so that _both_ are available even if CONFIG_SYSCTL is not
defined.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index cd9a504..eb9527c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void)
 static int watchdog_nmi_enable(unsigned int cpu);
 static void watchdog_nmi_disable(unsigned int cpu);
 
+static int watchdog_enable_all_cpus(void);
+static void watchdog_disable_all_cpus(void);
+
 /* watchdog kicker functions */
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
@@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void)
return err;
 }
 
-/* prepare/enable/disable routines */
-/* sysctl functions */
-#ifdef CONFIG_SYSCTL
 static void watchdog_disable_all_cpus(void)
 {
if (watchdog_running) {
@@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void)
}
 }
 
+#ifdef CONFIG_SYSCTL
+
 /*
  * Update the run state of the lockup detectors.
  */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers

2015-09-28 Thread Ulrich Obergfell

update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
up to functions in the call chain. This allows watchdog_enable_all_cpus()
and proc_watchdog_update() to handle such errors too.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index eb9527c..457113c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -731,10 +731,17 @@ void lockup_detector_resume(void)
mutex_unlock(_proc_mutex);
 }
 
-static void update_watchdog_all_cpus(void)
+static int update_watchdog_all_cpus(void)
 {
-   watchdog_park_threads();
+   int ret;
+
+   ret = watchdog_park_threads();
+   if (ret)
+   return ret;
+
watchdog_unpark_threads();
+
+   return 0;
 }
 
 static int watchdog_enable_all_cpus(void)
@@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void)
 * Enable/disable the lockup detectors or
 * change the sample period 'on the fly'.
 */
-   update_watchdog_all_cpus();
+   err = update_watchdog_all_cpus();
+
+   if (err) {
+   watchdog_disable_all_cpus();
+   pr_err("Failed to update lockup detectors, disabled\n");
+   }
}
 
+   if (err)
+   watchdog_enabled = 0;
+
return err;
 }
 
@@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct 
ctl_table *table, int write,
} while (cmpxchg(_enabled, old, new) != old);
 
/*
-* Update the run state of the lockup detectors.
-* Restore 'watchdog_enabled' on failure.
+* Update the run state of the lockup detectors. There is _no_
+* need to check the value returned by proc_watchdog_update()
+* and to restore the previous value of 'watchdog_enabled' as
+* both lockup detectors are disabled if proc_watchdog_update()
+* returns an error.
 */
err = proc_watchdog_update();
-   if (err)
-   watchdog_enabled = old;
}
 out:
mutex_unlock(_proc_mutex);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()

2015-09-28 Thread Ulrich Obergfell

Restore the previous value of watchdog_thresh _and_ sample_period
if proc_watchdog_update() returns an error. The variables must be
consistent to avoid false positives of the lockup detectors.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 64ed1c3..cd9a504 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int 
write,
goto out;
 
/*
-* Update the sample period.
-* Restore 'watchdog_thresh' on failure.
+* Update the sample period. Restore on failure.
 */
set_sample_period();
err = proc_watchdog_update();
-   if (err)
+   if (err) {
watchdog_thresh = old;
+   set_sample_period();
+   }
 out:
mutex_unlock(_proc_mutex);
return err;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error

2015-09-28 Thread Ulrich Obergfell

If kthread_park() returns an error, watchdog_park_threads() should not
blindly 'roll back' the already parked threads to the unparked state.
Instead leave it up to the callers to handle such errors appropriately
in their context. For example, it is redundant to unpark the threads
if the lockup detectors will soon be disabled by the callers anyway.

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 3bc22a9..af70bf2 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = {
 
 /*
  * park all watchdog threads that are specified in 'watchdog_cpumask'
+ *
+ * This function returns an error if kthread_park() of a watchdog thread
+ * fails. In this situation, the watchdog threads of some CPUs can already
+ * be parked and the watchdog threads of other CPUs can still be runnable.
+ * Callers are expected to handle this special condition as appropriate in
+ * their context.
  */
 static int watchdog_park_threads(void)
 {
@@ -665,10 +671,6 @@ static int watchdog_park_threads(void)
if (ret)
break;
}
-   if (ret) {
-   for_each_watchdog_cpu(cpu)
-   kthread_unpark(per_cpu(softlockup_watchdog, cpu));
-   }
put_online_cpus();
 
return ret;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()

2015-09-28 Thread Ulrich Obergfell

lockup_detector_suspend() now handles errors from watchdog_park_threads().

Signed-off-by: Ulrich Obergfell 
---
 kernel/watchdog.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 457113c..3bc22a9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -707,6 +707,11 @@ int lockup_detector_suspend(void)
 
if (ret == 0)
watchdog_suspended++;
+   else {
+   watchdog_disable_all_cpus();
+   pr_err("Failed to suspend lockup detectors, disabled\n");
+   watchdog_enabled = 0;
+   }
 
mutex_unlock(_proc_mutex);
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] improve handling of errors returned by kthread_park()

2015-09-28 Thread Ulrich Obergfell

The original watchdog_park_threads() function that was introduced by
commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
approach to handle errors returned by kthread_park(): It attempts to
roll back all watchdog threads to the unparked state. However, this
may be undesired behaviour from the perspective of the caller which
may want to handle errors as appropriate in its specific context.
Currently, there are two possible call chains:

- watchdog suspend/resume interface

lockup_detector_suspend
  watchdog_park_threads

- write to parameters in /proc/sys/kernel

proc_watchdog_update
  watchdog_enable_all_cpus
update_watchdog_all_cpus
  watchdog_park_threads

Instead of 'blindly' attempting to unpark the watchdog threads if a 
kthread_park() call fails, the new approach is to disable the lockup
detectors in the above call chains. Failure becomes visible to the
user as follows:

- error messages from lockup_detector_suspend()
   or watchdog_enable_all_cpus()

- the state that can be read from /proc/sys/kernel/watchdog_enabled

- the 'write' system call in the latter call chain returns an error

Ulrich Obergfell (5):
  watchdog: fix error handling in proc_watchdog_thresh()
  watchdog: move watchdog_disable_all_cpus() outside of ifdef
  watchdog: implement error handling in update_watchdog_all_cpus() and
callers
  watchdog: implement error handling in lockup_detector_suspend()
  watchdog: do not unpark threads in watchdog_park_threads() on error

 kernel/watchdog.c | 60 +++
 1 file changed, 43 insertions(+), 17 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers

2015-09-28 Thread Ulrich Obergfell

update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
up to functions in the call chain. This allows watchdog_enable_all_cpus()
and proc_watchdog_update() to handle such errors too.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index eb9527c..457113c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -731,10 +731,17 @@ void lockup_detector_resume(void)
mutex_unlock(_proc_mutex);
 }
 
-static void update_watchdog_all_cpus(void)
+static int update_watchdog_all_cpus(void)
 {
-   watchdog_park_threads();
+   int ret;
+
+   ret = watchdog_park_threads();
+   if (ret)
+   return ret;
+
watchdog_unpark_threads();
+
+   return 0;
 }
 
 static int watchdog_enable_all_cpus(void)
@@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void)
 * Enable/disable the lockup detectors or
 * change the sample period 'on the fly'.
 */
-   update_watchdog_all_cpus();
+   err = update_watchdog_all_cpus();
+
+   if (err) {
+   watchdog_disable_all_cpus();
+   pr_err("Failed to update lockup detectors, disabled\n");
+   }
}
 
+   if (err)
+   watchdog_enabled = 0;
+
return err;
 }
 
@@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct 
ctl_table *table, int write,
} while (cmpxchg(_enabled, old, new) != old);
 
/*
-* Update the run state of the lockup detectors.
-* Restore 'watchdog_enabled' on failure.
+* Update the run state of the lockup detectors. There is _no_
+* need to check the value returned by proc_watchdog_update()
+* and to restore the previous value of 'watchdog_enabled' as
+* both lockup detectors are disabled if proc_watchdog_update()
+* returns an error.
 */
err = proc_watchdog_update();
-   if (err)
-   watchdog_enabled = old;
}
 out:
mutex_unlock(_proc_mutex);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()

2015-09-28 Thread Ulrich Obergfell

Restore the previous value of watchdog_thresh _and_ sample_period
if proc_watchdog_update() returns an error. The variables must be
consistent to avoid false positives of the lockup detectors.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 64ed1c3..cd9a504 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int 
write,
goto out;
 
/*
-* Update the sample period.
-* Restore 'watchdog_thresh' on failure.
+* Update the sample period. Restore on failure.
 */
set_sample_period();
err = proc_watchdog_update();
-   if (err)
+   if (err) {
watchdog_thresh = old;
+   set_sample_period();
+   }
 out:
mutex_unlock(_proc_mutex);
return err;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error

2015-09-28 Thread Ulrich Obergfell

If kthread_park() returns an error, watchdog_park_threads() should not
blindly 'roll back' the already parked threads to the unparked state.
Instead leave it up to the callers to handle such errors appropriately
in their context. For example, it is redundant to unpark the threads
if the lockup detectors will soon be disabled by the callers anyway.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 3bc22a9..af70bf2 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = {
 
 /*
  * park all watchdog threads that are specified in 'watchdog_cpumask'
+ *
+ * This function returns an error if kthread_park() of a watchdog thread
+ * fails. In this situation, the watchdog threads of some CPUs can already
+ * be parked and the watchdog threads of other CPUs can still be runnable.
+ * Callers are expected to handle this special condition as appropriate in
+ * their context.
  */
 static int watchdog_park_threads(void)
 {
@@ -665,10 +671,6 @@ static int watchdog_park_threads(void)
if (ret)
break;
}
-   if (ret) {
-   for_each_watchdog_cpu(cpu)
-   kthread_unpark(per_cpu(softlockup_watchdog, cpu));
-   }
put_online_cpus();
 
return ret;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()

2015-09-28 Thread Ulrich Obergfell

lockup_detector_suspend() now handles errors from watchdog_park_threads().

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 457113c..3bc22a9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -707,6 +707,11 @@ int lockup_detector_suspend(void)
 
if (ret == 0)
watchdog_suspended++;
+   else {
+   watchdog_disable_all_cpus();
+   pr_err("Failed to suspend lockup detectors, disabled\n");
+   watchdog_enabled = 0;
+   }
 
mutex_unlock(_proc_mutex);
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] improve handling of errors returned by kthread_park()

2015-09-28 Thread Ulrich Obergfell

The original watchdog_park_threads() function that was introduced by
commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
approach to handle errors returned by kthread_park(): It attempts to
roll back all watchdog threads to the unparked state. However, this
may be undesired behaviour from the perspective of the caller which
may want to handle errors as appropriate in its specific context.
Currently, there are two possible call chains:

- watchdog suspend/resume interface

lockup_detector_suspend
  watchdog_park_threads

- write to parameters in /proc/sys/kernel

proc_watchdog_update
  watchdog_enable_all_cpus
update_watchdog_all_cpus
  watchdog_park_threads

Instead of 'blindly' attempting to unpark the watchdog threads if a 
kthread_park() call fails, the new approach is to disable the lockup
detectors in the above call chains. Failure becomes visible to the
user as follows:

- error messages from lockup_detector_suspend()
   or watchdog_enable_all_cpus()

- the state that can be read from /proc/sys/kernel/watchdog_enabled

- the 'write' system call in the latter call chain returns an error

Ulrich Obergfell (5):
  watchdog: fix error handling in proc_watchdog_thresh()
  watchdog: move watchdog_disable_all_cpus() outside of ifdef
  watchdog: implement error handling in update_watchdog_all_cpus() and
callers
  watchdog: implement error handling in lockup_detector_suspend()
  watchdog: do not unpark threads in watchdog_park_threads() on error

 kernel/watchdog.c | 60 +++
 1 file changed, 43 insertions(+), 17 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef

2015-09-28 Thread Ulrich Obergfell

It makes sense to place watchdog_{dis|enable}_all_cpus() outside of
the ifdef so that _both_ are available even if CONFIG_SYSCTL is not
defined.

Signed-off-by: Ulrich Obergfell <uober...@redhat.com>
---
 kernel/watchdog.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index cd9a504..eb9527c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void)
 static int watchdog_nmi_enable(unsigned int cpu);
 static void watchdog_nmi_disable(unsigned int cpu);
 
+static int watchdog_enable_all_cpus(void);
+static void watchdog_disable_all_cpus(void);
+
 /* watchdog kicker functions */
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
@@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void)
return err;
 }
 
-/* prepare/enable/disable routines */
-/* sysctl functions */
-#ifdef CONFIG_SYSCTL
 static void watchdog_disable_all_cpus(void)
 {
if (watchdog_running) {
@@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void)
}
 }
 
+#ifdef CONFIG_SYSCTL
+
 /*
  * Update the run state of the lockup detectors.
  */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] watchdog: use pr_debug() in fixup_ht_bug() failure path

2015-08-07 Thread Ulrich Obergfell


Signed-off-by: Ulrich Obergfell 
---
 arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 0357bf7..abb25c3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3369,7 +3369,7 @@ static __init int fixup_ht_bug(void)
}
 
if (lockup_detector_suspend() != 0) {
-   pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n");
+   pr_debug("failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n");
return 0;
}
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] watchdog: avoid races in watchdog_nmi_{en|disable} functions - follow up patches

2015-08-07 Thread Ulrich Obergfell

This small series follows up to patch discussion related
to $subject patch series which is now in linux-next at:

- introduce watchdog_park_threads() and watchdog_unpark_threads()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=89f536017155c7f76c8a620e6aed03809e53cd4c

- introduce watchdog_suspend() and watchdog_resume()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=e0dd9ee3ecf3394e93343cf9542d8ca1b54eca08

- use park/unpark functions in update_watchdog_all_cpus()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=eef1a759acfb98766fef20afca9e06f014299e3f

- use suspend/resume interface in fixup_ht_bug()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=77592a82363dd508ce3dbe4eaaa2c6c45b966ebb

Changes based on patch discussion:

- Patch 1/2: rename watchdog_{suspend|resume}, add comment blocks
  http://marc.info/?l=linux-kernel=143844050610220=2
  http://marc.info/?l=linux-kernel=143876224114821=2

- Patch 2/2: use pr_debug() instead of pr_info()
  http://marc.info/?l=linux-kernel=143869949229461=2

Ulrich Obergfell (2):
  watchdog: rename watchdog_suspend() and watchdog_resume()
  watchdog: use pr_debug() in fixup_ht_bug() failure path

 arch/x86/kernel/cpu/perf_event_intel.c |  6 +++---
 include/linux/nmi.h|  8 
 kernel/watchdog.c  | 26 ++
 3 files changed, 29 insertions(+), 11 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] watchdog: rename watchdog_suspend() and watchdog_resume()

2015-08-07 Thread Ulrich Obergfell

Rename watchdog_suspend() to lockup_detector_suspend() and
watchdog_resume() to lockup_detector_resume() to avoid
confusion with the watchdog subsystem and to be consistent
with the existing name lockup_detector_init().

Also provide comment blocks to explain the watchdog_running
and watchdog_suspended variables and their relationship.

Signed-off-by: Ulrich Obergfell 
---
 arch/x86/kernel/cpu/perf_event_intel.c |  4 ++--
 include/linux/nmi.h|  8 
 kernel/watchdog.c  | 26 ++
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index d4e1b0c..0357bf7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3368,7 +3368,7 @@ static __init int fixup_ht_bug(void)
return 0;
}
 
-   if (watchdog_suspend() != 0) {
+   if (lockup_detector_suspend() != 0) {
pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n");
return 0;
}
@@ -3379,7 +3379,7 @@ static __init int fixup_ht_bug(void)
x86_pmu.commit_scheduling = NULL;
x86_pmu.stop_scheduling = NULL;
 
-   watchdog_resume();
+   lockup_detector_resume();
 
get_online_cpus();
 
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 46e28e9e..78488e0 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -84,15 +84,15 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
-extern int watchdog_suspend(void);
-extern void watchdog_resume(void);
+extern int lockup_detector_suspend(void);
+extern void lockup_detector_resume(void);
 #else
-static inline int watchdog_suspend(void)
+static inline int lockup_detector_suspend(void)
 {
return 0;
 }
 
-static inline void watchdog_resume(void)
+static inline void lockup_detector_resume(void)
 {
 }
 #endif
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 69666f4..64ed1c3 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -67,8 +67,26 @@ unsigned long *watchdog_cpumask_bits = 
cpumask_bits(_cpumask);
 #define for_each_watchdog_cpu(cpu) \
for_each_cpu_and((cpu), cpu_online_mask, _cpumask)
 
-static int __read_mostly watchdog_suspended;
+/*
+ * The 'watchdog_running' variable is set to 1 when the watchdog threads
+ * are registered/started and is set to 0 when the watchdog threads are
+ * unregistered/stopped, so it is an indicator whether the threads exist.
+ */
 static int __read_mostly watchdog_running;
+/*
+ * If a subsystem has a need to deactivate the watchdog temporarily, it
+ * can use the suspend/resume interface to achieve this. The content of
+ * the 'watchdog_suspended' variable reflects this state. Existing threads
+ * are parked/unparked by the lockup_detector_{suspend|resume} functions
+ * (see comment blocks pertaining to those functions for further details).
+ *
+ * 'watchdog_suspended' also prevents threads from being registered/started
+ * or unregistered/stopped via parameters in /proc/sys/kernel, so the state
+ * of 'watchdog_running' cannot change while the watchdog is deactivated
+ * temporarily (see related code in 'proc' handlers).
+ */
+static int __read_mostly watchdog_suspended;
+
 static u64 __read_mostly sample_period;
 
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
@@ -669,7 +687,7 @@ static void watchdog_unpark_threads(void)
 /*
  * Suspend the hard and soft lockup detector by parking the watchdog threads.
  */
-int watchdog_suspend(void)
+int lockup_detector_suspend(void)
 {
int ret = 0;
 
@@ -679,7 +697,7 @@ int watchdog_suspend(void)
 * the 'watchdog_suspended' variable). If the watchdog threads are
 * running, the first caller takes care that they will be parked.
 * The state of 'watchdog_running' cannot change while a suspend
-* request is active (see related changes in 'proc' handlers).
+* request is active (see related code in 'proc' handlers).
 */
if (watchdog_running && !watchdog_suspended)
ret = watchdog_park_threads();
@@ -695,7 +713,7 @@ int watchdog_suspend(void)
 /*
  * Resume the hard and soft lockup detector by unparking the watchdog threads.
  */
-void watchdog_resume(void)
+void lockup_detector_resume(void)
 {
mutex_lock(_proc_mutex);
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] watchdog: avoid races in watchdog_nmi_{en|disable} functions - follow up patches

2015-08-07 Thread Ulrich Obergfell

This small series follows up to patch discussion related
to $subject patch series which is now in linux-next at:

- introduce watchdog_park_threads() and watchdog_unpark_threads()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=89f536017155c7f76c8a620e6aed03809e53cd4c

- introduce watchdog_suspend() and watchdog_resume()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=e0dd9ee3ecf3394e93343cf9542d8ca1b54eca08

- use park/unpark functions in update_watchdog_all_cpus()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=eef1a759acfb98766fef20afca9e06f014299e3f

- use suspend/resume interface in fixup_ht_bug()
  
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/kernel/watchdog.c?id=77592a82363dd508ce3dbe4eaaa2c6c45b966ebb

Changes based on patch discussion:

- Patch 1/2: rename watchdog_{suspend|resume}, add comment blocks
  http://marc.info/?l=linux-kernelm=143844050610220w=2
  http://marc.info/?l=linux-kernelm=143876224114821w=2

- Patch 2/2: use pr_debug() instead of pr_info()
  http://marc.info/?l=linux-kernelm=143869949229461w=2

Ulrich Obergfell (2):
  watchdog: rename watchdog_suspend() and watchdog_resume()
  watchdog: use pr_debug() in fixup_ht_bug() failure path

 arch/x86/kernel/cpu/perf_event_intel.c |  6 +++---
 include/linux/nmi.h|  8 
 kernel/watchdog.c  | 26 ++
 3 files changed, 29 insertions(+), 11 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] watchdog: rename watchdog_suspend() and watchdog_resume()

2015-08-07 Thread Ulrich Obergfell

Rename watchdog_suspend() to lockup_detector_suspend() and
watchdog_resume() to lockup_detector_resume() to avoid
confusion with the watchdog subsystem and to be consistent
with the existing name lockup_detector_init().

Also provide comment blocks to explain the watchdog_running
and watchdog_suspended variables and their relationship.

Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 arch/x86/kernel/cpu/perf_event_intel.c |  4 ++--
 include/linux/nmi.h|  8 
 kernel/watchdog.c  | 26 ++
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index d4e1b0c..0357bf7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3368,7 +3368,7 @@ static __init int fixup_ht_bug(void)
return 0;
}
 
-   if (watchdog_suspend() != 0) {
+   if (lockup_detector_suspend() != 0) {
pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n);
return 0;
}
@@ -3379,7 +3379,7 @@ static __init int fixup_ht_bug(void)
x86_pmu.commit_scheduling = NULL;
x86_pmu.stop_scheduling = NULL;
 
-   watchdog_resume();
+   lockup_detector_resume();
 
get_online_cpus();
 
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 46e28e9e..78488e0 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -84,15 +84,15 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
-extern int watchdog_suspend(void);
-extern void watchdog_resume(void);
+extern int lockup_detector_suspend(void);
+extern void lockup_detector_resume(void);
 #else
-static inline int watchdog_suspend(void)
+static inline int lockup_detector_suspend(void)
 {
return 0;
 }
 
-static inline void watchdog_resume(void)
+static inline void lockup_detector_resume(void)
 {
 }
 #endif
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 69666f4..64ed1c3 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -67,8 +67,26 @@ unsigned long *watchdog_cpumask_bits = 
cpumask_bits(watchdog_cpumask);
 #define for_each_watchdog_cpu(cpu) \
for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
 
-static int __read_mostly watchdog_suspended;
+/*
+ * The 'watchdog_running' variable is set to 1 when the watchdog threads
+ * are registered/started and is set to 0 when the watchdog threads are
+ * unregistered/stopped, so it is an indicator whether the threads exist.
+ */
 static int __read_mostly watchdog_running;
+/*
+ * If a subsystem has a need to deactivate the watchdog temporarily, it
+ * can use the suspend/resume interface to achieve this. The content of
+ * the 'watchdog_suspended' variable reflects this state. Existing threads
+ * are parked/unparked by the lockup_detector_{suspend|resume} functions
+ * (see comment blocks pertaining to those functions for further details).
+ *
+ * 'watchdog_suspended' also prevents threads from being registered/started
+ * or unregistered/stopped via parameters in /proc/sys/kernel, so the state
+ * of 'watchdog_running' cannot change while the watchdog is deactivated
+ * temporarily (see related code in 'proc' handlers).
+ */
+static int __read_mostly watchdog_suspended;
+
 static u64 __read_mostly sample_period;
 
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
@@ -669,7 +687,7 @@ static void watchdog_unpark_threads(void)
 /*
  * Suspend the hard and soft lockup detector by parking the watchdog threads.
  */
-int watchdog_suspend(void)
+int lockup_detector_suspend(void)
 {
int ret = 0;
 
@@ -679,7 +697,7 @@ int watchdog_suspend(void)
 * the 'watchdog_suspended' variable). If the watchdog threads are
 * running, the first caller takes care that they will be parked.
 * The state of 'watchdog_running' cannot change while a suspend
-* request is active (see related changes in 'proc' handlers).
+* request is active (see related code in 'proc' handlers).
 */
if (watchdog_running  !watchdog_suspended)
ret = watchdog_park_threads();
@@ -695,7 +713,7 @@ int watchdog_suspend(void)
 /*
  * Resume the hard and soft lockup detector by unparking the watchdog threads.
  */
-void watchdog_resume(void)
+void lockup_detector_resume(void)
 {
mutex_lock(watchdog_proc_mutex);
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] watchdog: use pr_debug() in fixup_ht_bug() failure path

2015-08-07 Thread Ulrich Obergfell


Signed-off-by: Ulrich Obergfell uober...@redhat.com
---
 arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 0357bf7..abb25c3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3369,7 +3369,7 @@ static __init int fixup_ht_bug(void)
}
 
if (lockup_detector_suspend() != 0) {
-   pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n);
+   pr_debug(failed to disable PMU erratum BJ122, BV98, HSD29 
workaround\n);
return 0;
}
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] watchdog: introduce watchdog_suspend() and watchdog_resume()

2015-08-05 Thread Ulrich Obergfell

> - Original Message -
> From: "Andrew Morton" 
> ...
> On Sat,  1 Aug 2015 14:49:23 +0200 Ulrich Obergfell  
> wrote:
>
>> This interface can be utilized to deactivate the hard and soft lockup
>> detector temporarily. Callers are expected to minimize the duration of
>> deactivation. Multiple deactivations are allowed to occur in parallel
>> but should be rare in practice.
>>
>> ...
>>
>> --- a/include/linux/nmi.h
>> +++ b/include/linux/nmi.h
>> @@ -80,6 +80,8 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
>>  void __user *, size_t *, loff_t *);
>>  extern int proc_watchdog_cpumask(struct ctl_table *, int,
>>   void __user *, size_t *, loff_t *);
>> +extern int watchdog_suspend(void);
>> +extern void watchdog_resume(void);
>>  #endif
>>  
>>  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 5571f20..98d44b1 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -67,6 +67,7 @@ unsigned long *watchdog_cpumask_bits = 
>> cpumask_bits(_cpumask);
>>  #define for_each_watchdog_cpu(cpu) \
>>  for_each_cpu_and((cpu), cpu_online_mask, _cpumask)
>>  
>> +static int __read_mostly watchdog_suspended = 0;
>
> With my compiler the "= 0" increases the size of watchdog.o data.  For
> some reason by 16 bytes(!).

I see that you already fixed this. Many Thanks.

>>  static int __read_mostly watchdog_running;
>>  static u64 __read_mostly sample_period;
>
> The relationship between watchdog_running and watchdog_suspended hurts
> my brain a bit.  It appears that all watchdog_running transitions
> happen under watchdog_proc_mutex so I don't think it's racy, but I
> wonder if things would be simpler if these were folded into a single
> up/down counter.

The 'watchdog_running' variable indicates whether watchdog threads exist.
[Whether they have been launched via smpboot_register_percpu_thread().]

The 'watchdog_suspended' variable indicates whether existing threads are
currently parked, and whether multiple requests to suspend the watchdog
have been made by different callers.

I think folding them into one variable would not improve the readability
of the code, because instead of two variables with distinct semantics we
would then have one variable with multiple semantics (which I think would
also make code more complex). If we would want to improve the readability
I'd prefer to either just add a comment block to explain the semantics of
the variables or to rename them -for example- like:

  watchdog_running   to watchdog_threads_exist
  watchdog_suspended to watchdog_suspend_count

Please let me know if you would want any changes in this regard.

I also received these suggestions:

- rename the watchdog_{suspend|resume} functions as discussed in
  http://marc.info/?l=linux-kernel=143844050610220=2

- use pr_debug() instead of pr_info() as discussed in
  http://marc.info/?l=linux-kernel=143869949229461=2

Please let me know if you want me to post a 'version 2' of the patch set
or if you want me to post these changes as separate follow-up patches.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] watchdog: introduce watchdog_suspend() and watchdog_resume()

2015-08-05 Thread Ulrich Obergfell

 - Original Message -
 From: Andrew Morton a...@linux-foundation.org
 ...
 On Sat,  1 Aug 2015 14:49:23 +0200 Ulrich Obergfell uober...@redhat.com 
 wrote:

 This interface can be utilized to deactivate the hard and soft lockup
 detector temporarily. Callers are expected to minimize the duration of
 deactivation. Multiple deactivations are allowed to occur in parallel
 but should be rare in practice.

 ...

 --- a/include/linux/nmi.h
 +++ b/include/linux/nmi.h
 @@ -80,6 +80,8 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
  void __user *, size_t *, loff_t *);
  extern int proc_watchdog_cpumask(struct ctl_table *, int,
   void __user *, size_t *, loff_t *);
 +extern int watchdog_suspend(void);
 +extern void watchdog_resume(void);
  #endif
  
  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 diff --git a/kernel/watchdog.c b/kernel/watchdog.c
 index 5571f20..98d44b1 100644
 --- a/kernel/watchdog.c
 +++ b/kernel/watchdog.c
 @@ -67,6 +67,7 @@ unsigned long *watchdog_cpumask_bits = 
 cpumask_bits(watchdog_cpumask);
  #define for_each_watchdog_cpu(cpu) \
  for_each_cpu_and((cpu), cpu_online_mask, watchdog_cpumask)
  
 +static int __read_mostly watchdog_suspended = 0;

 With my compiler the = 0 increases the size of watchdog.o data.  For
 some reason by 16 bytes(!).

I see that you already fixed this. Many Thanks.

  static int __read_mostly watchdog_running;
  static u64 __read_mostly sample_period;

 The relationship between watchdog_running and watchdog_suspended hurts
 my brain a bit.  It appears that all watchdog_running transitions
 happen under watchdog_proc_mutex so I don't think it's racy, but I
 wonder if things would be simpler if these were folded into a single
 up/down counter.

The 'watchdog_running' variable indicates whether watchdog threads exist.
[Whether they have been launched via smpboot_register_percpu_thread().]

The 'watchdog_suspended' variable indicates whether existing threads are
currently parked, and whether multiple requests to suspend the watchdog
have been made by different callers.

I think folding them into one variable would not improve the readability
of the code, because instead of two variables with distinct semantics we
would then have one variable with multiple semantics (which I think would
also make code more complex). If we would want to improve the readability
I'd prefer to either just add a comment block to explain the semantics of
the variables or to rename them -for example- like:

  watchdog_running   to watchdog_threads_exist
  watchdog_suspended to watchdog_suspend_count

Please let me know if you would want any changes in this regard.

I also received these suggestions:

- rename the watchdog_{suspend|resume} functions as discussed in
  http://marc.info/?l=linux-kernelm=143844050610220w=2

- use pr_debug() instead of pr_info() as discussed in
  http://marc.info/?l=linux-kernelm=143869949229461w=2

Please let me know if you want me to post a 'version 2' of the patch set
or if you want me to post these changes as separate follow-up patches.

Regards,

Uli
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads()

2015-08-04 Thread Ulrich Obergfell

> - Original Message -
> From: "Michal Hocko" 
...
> On Sat 01-08-15 14:49:22, Ulrich Obergfell wrote:
>> These functions are intended to be used only from inside kernel/watchdog.c
>> to park/unpark all watchdog threads that are specified in watchdog_cpumask.
>
> I would suggest merging this into Patch2. It is usually better to add
> new functions along with their users.

Michal,

watchdog_{park|unpark}_threads are called by watchdog_{suspend|resume}
in Patch 2/4 and by update_watchdog_all_cpus() in Patch 3/4, so I would
have to merge three patches into one, and I would end up with one large
patch and one smaller patch. I think the result would be harder to read,
review and understand. Thus I'd prefer to leave the patches split as they
currently are.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] watchdog: use suspend/resume interface in fixup_ht_bug()

2015-08-04 Thread Ulrich Obergfell

> - Original Message -
> From: "Don Zickus" 
...
> On Tue, Aug 04, 2015 at 03:31:30PM +0200, Michal Hocko wrote:
>> On Sat 01-08-15 14:49:25, Ulrich Obergfell wrote:
>> [...]
>> > @@ -3368,7 +3368,10 @@ static __init int fixup_ht_bug(void)
>> >return 0;
>> >}
>> >  
>> > -  watchdog_nmi_disable_all();
>> > +  if (watchdog_suspend() != 0) {
>> > +  pr_info("failed to disable PMU erratum BJ122, BV98, HSD29 
>> > workaround\n");
>> > +  return 0;
>> > +  }
>> 
>> Is this really worth reporting to the log? What is an admin supposed to
>> do about it?
>
> I think it was more for developers to aid in debugging a strange behaviour
> of the performance counters.
>
>> 
>> Ok, so kthread_park fails only when the kernel thread has already
>> exited. Can this ever happen during this call path?
>
> It might be overkill, but it is just a harmless informational failure
> message.

Don, Michal,

the module prints a message if the workaround is enabled and if the
workaround is disabled. Hence, I think we should keep the messages
consistent and thus inform the user also if we fail to disable the
workaround. Even though at the moment this seems to be an unlikely
failure case, I agree with Don that the message could be useful in
debugging.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] watchdog: use suspend/resume interface in fixup_ht_bug()

2015-08-04 Thread Ulrich Obergfell

 - Original Message -
 From: Don Zickus dzic...@redhat.com
...
 On Tue, Aug 04, 2015 at 03:31:30PM +0200, Michal Hocko wrote:
 On Sat 01-08-15 14:49:25, Ulrich Obergfell wrote:
 [...]
  @@ -3368,7 +3368,10 @@ static __init int fixup_ht_bug(void)
 return 0;
 }
   
  -  watchdog_nmi_disable_all();
  +  if (watchdog_suspend() != 0) {
  +  pr_info(failed to disable PMU erratum BJ122, BV98, HSD29 
  workaround\n);
  +  return 0;
  +  }
 
 Is this really worth reporting to the log? What is an admin supposed to
 do about it?

 I think it was more for developers to aid in debugging a strange behaviour
 of the performance counters.

 looking into the code
 Ok, so kthread_park fails only when the kernel thread has already
 exited. Can this ever happen during this call path?

 It might be overkill, but it is just a harmless informational failure
 message.

Don, Michal,

the module prints a message if the workaround is enabled and if the
workaround is disabled. Hence, I think we should keep the messages
consistent and thus inform the user also if we fail to disable the
workaround. Even though at the moment this seems to be an unlikely
failure case, I agree with Don that the message could be useful in
debugging.

Regards,

Uli
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1212 matches

Mail list logo