Re: bisected: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-23 Thread Meelis Roos
> >> Now this seems more relevant:
> >>
> >> mroos@rx100s2:~/linux$ nice git bisect good
> >> 24dea04767e6e5175f4750770281b0c17ac6a2fb is the first bad commit
> >> commit 24dea04767e6e5175f4750770281b0c17ac6a2fb
> >> Author: Daniel Borkmann 
> >> Date:   Fri May 4 01:08:23 2018 +0200
> >>
> >> bpf, x32: remove ld_abs/ld_ind
> >>
> >> Since LD_ABS/LD_IND instructions are now removed from the core and
> >> reimplemented through a combination of inlined BPF instructions and
> >> a slow-path helper, we can get rid of the complexity from x32 JIT.
> > 
> > This does seem much more likely than the previous bisection, given
> > that you ended up in an x86-32 specific commit (the subject says x32,
> > but that is a mistake). I also checked that systemd indeed does
> > call into bpf in a number of places, possibly for the journald socket.
> > 
> > OTOH, it's still hard to tell how that commit can have ended up
> > corrupting the clock read function in systemd. To cross-check,
> > could you try reverting that commit on the latest kernel and see
> > if it still works?
> 
> I would be curious as well about that whether revert would make it
> work. What's the value of sysctl net.core.bpf_jit_enable ? Does it
> change anything if you set it to 0 (only interpreter) or 1 (JIT
> enabled). Seems a bit strange to me that bisect ended at this commit
> given the issue you have. The JIT itself was also new in this window
> fwiw. In any case some more debug info would be great to have.

net.core.bpf_jit_enable is 1.

Since it breaks bootup, I can not easily change the value at runtime (it 
would be postfactum). Do you mean changing the 
CONFIG_BPF_JIT_ALWAYS_ON=y option?

Anyway, I started compile of v4.18-rc5 that was the latest I tested, 
with the commit in question reverted. Will see if I can test tomorrow 
morning. But I will leave tomorrow for a week and can only test further 
things if they happen to boot fine (no manual reboot possible for a 
week).

-- 
Meelis Roos (mr...@linux.ee)


Re: bisected: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-23 Thread Meelis Roos
> >> Now this seems more relevant:
> >>
> >> mroos@rx100s2:~/linux$ nice git bisect good
> >> 24dea04767e6e5175f4750770281b0c17ac6a2fb is the first bad commit
> >> commit 24dea04767e6e5175f4750770281b0c17ac6a2fb
> >> Author: Daniel Borkmann 
> >> Date:   Fri May 4 01:08:23 2018 +0200
> >>
> >> bpf, x32: remove ld_abs/ld_ind
> >>
> >> Since LD_ABS/LD_IND instructions are now removed from the core and
> >> reimplemented through a combination of inlined BPF instructions and
> >> a slow-path helper, we can get rid of the complexity from x32 JIT.
> > 
> > This does seem much more likely than the previous bisection, given
> > that you ended up in an x86-32 specific commit (the subject says x32,
> > but that is a mistake). I also checked that systemd indeed does
> > call into bpf in a number of places, possibly for the journald socket.
> > 
> > OTOH, it's still hard to tell how that commit can have ended up
> > corrupting the clock read function in systemd. To cross-check,
> > could you try reverting that commit on the latest kernel and see
> > if it still works?
> 
> I would be curious as well about that whether revert would make it
> work. What's the value of sysctl net.core.bpf_jit_enable ? Does it
> change anything if you set it to 0 (only interpreter) or 1 (JIT
> enabled). Seems a bit strange to me that bisect ended at this commit
> given the issue you have. The JIT itself was also new in this window
> fwiw. In any case some more debug info would be great to have.

net.core.bpf_jit_enable is 1.

Since it breaks bootup, I can not easily change the value at runtime (it 
would be postfactum). Do you mean changing the 
CONFIG_BPF_JIT_ALWAYS_ON=y option?

Anyway, I started compile of v4.18-rc5 that was the latest I tested, 
with the commit in question reverted. Will see if I can test tomorrow 
morning. But I will leave tomorrow for a week and can only test further 
things if they happen to boot fine (no manual reboot possible for a 
week).

-- 
Meelis Roos (mr...@linux.ee)


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-20 Thread Meelis Roos
> > > Everything below here is is 'bad', which can be an indication that you
> > > misclassified one of
> > > the commits above as 'good' when it should have been 'bad'. The most 
> > > likely
> > > explanations are that you either typed the 'git bisect good' by accident, 
> > > or
> > > that the failure is not 100% reliable, and it sometimes works fine even 
> > > on a
> > > broken kernel.
> > > 
> > > 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct 
> > > the
> > > variable name in v9fs_get_trans_by_name() comment", which is marked 
> > > "good",
> > > and can't really be good if 0bc5fe85727413 is bad and you are not using 
> > > the
> > > 'qed' driver.
> > > 
> > > I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> > > if it was, test v4.17-rc4, which is what the net-next tree was based on.
> > 
> > Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
> > it. Building v4.17-rc4 now.
> 
> v4.17-rc4 seems good after 2 reboots.

The new bisect seems to have also led me to a strange commit. This time 
I tried to be careful and tested most on two reboots before classifying 
as good.

However, f4e3ec0d573e was suspicious - it failed to autoload e1000 but 
had no other errors. On both boots with this kernel, modprobe e1000 and 
ifup -a made the system work so I assumed it was good, while it might 
not have been. Will try bisecting with f4e3ec0d573e marked bad.

mroos@rx100s2:~/linux$ nice git bisect bad
9816dd35ececc095f3e3be29d30d3adc755908d9 is the first bad commit
commit 9816dd35ececc095f3e3be29d30d3adc755908d9
Author: Jakub Kicinski 
Date:   Thu May 3 18:37:12 2018 -0700

nfp: bpf: perf event output helpers support

Add support for the perf_event_output family of helpers.

The implementation on the NFP will not match the host code exactly.
The state of the host map and rings is unknown to the device, hence
device can't return errors when rings are not installed.  The device
simply packs the data into a firmware notification message and sends
it over to the host, returning success to the program.

There is no notion of a host CPU on the device when packets are being
processed.  Device will only offload programs which set BPF_F_CURRENT_CPU.
Still, if map index doesn't match CPU no error will be returned (see
above).

Dropped/lost firmware notification messages will not cause "lost
events" event on the perf ring, they are only visible via device
error counters.

Firmware notification messages may also get reordered in respect
to the packets which caused their generation.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Signed-off-by: Daniel Borkmann 

:04 04 00caca934fcbf1d5740a46d71e4d08e1f3ab8c7a 
606c7bdd23e357f0902219630579c22a0ed0380c M  drivers
mroos@rx100s2:~/linux$ nice git bisect log
git bisect start
# bad: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the variable 
name in v9fs_get_trans_by_name() comment
git bisect bad 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
# good: [75bc37fefc4471e718ba8e651aa74673d4e0a9eb] Linux 4.17-rc4
git bisect good 75bc37fefc4471e718ba8e651aa74673d4e0a9eb
# good: [1504269814263c9676b4605a6a91e14dc6ceac21] Merge tag 
'linux-kselftest-4.17-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
git bisect good 1504269814263c9676b4605a6a91e14dc6ceac21
# skip: [c7d28c9df292a49904446dca15b2037ee8f874af] net: dsa: b53: Add support 
for reading PHY statistics
git bisect skip c7d28c9df292a49904446dca15b2037ee8f874af
# good: [173965fbfba596c02fa128966c2a33cb88afcd7f] tools/bpf: add a test for 
bpf_get_stack with raw tracepoint prog
git bisect good 173965fbfba596c02fa128966c2a33cb88afcd7f
# good: [795d8098d32b6bef3d0821588cb6e4b1f369a7a4] liquidio VF: indicate that 
disabling rx vlan offload is not allowed
git bisect good 795d8098d32b6bef3d0821588cb6e4b1f369a7a4
# good: [90278871d4b0da39c84fc9aa4929b0809dc7cf3c] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
git bisect good 90278871d4b0da39c84fc9aa4929b0809dc7cf3c
# good: [4e1ec56cdc59746943b2acfab3c171b930187bbe] bpf: add 
skb_load_bytes_relative helper
git bisect good 4e1ec56cdc59746943b2acfab3c171b930187bbe
# good: [f4e3ec0d573e238f383b3da365127002579a07d6] bpf: replace map pointer 
loads before calling into offloads
git bisect good f4e3ec0d573e238f383b3da365127002579a07d6
# bad: [e94fa1d93117e7f1eb783dc9cae6c7065099] bpf, xskmap: fix crash in 
xsk_map_alloc error path handling
git bisect bad e94fa1d93117e7f1eb783dc9cae6c7065099
# bad: [e64d52569f6e847495091db40ab58d2d379748ef] tools: bpftool: move 
get_possible_cpus() to common code
git bisect bad e64d52569f6e847495091db40ab58d2d379748ef
# bad: [b4264c96b5cbc00c4c07deb9fbab928d43dffcf9] nfp: bpf: rewrite map 
pointers with NFP TIDs
git bisect bad b4264c96b5cbc00c4c07deb9fbab928d43dffcf9
# bad: 

Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-20 Thread Meelis Roos
> > > Everything below here is is 'bad', which can be an indication that you
> > > misclassified one of
> > > the commits above as 'good' when it should have been 'bad'. The most 
> > > likely
> > > explanations are that you either typed the 'git bisect good' by accident, 
> > > or
> > > that the failure is not 100% reliable, and it sometimes works fine even 
> > > on a
> > > broken kernel.
> > > 
> > > 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct 
> > > the
> > > variable name in v9fs_get_trans_by_name() comment", which is marked 
> > > "good",
> > > and can't really be good if 0bc5fe85727413 is bad and you are not using 
> > > the
> > > 'qed' driver.
> > > 
> > > I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> > > if it was, test v4.17-rc4, which is what the net-next tree was based on.
> > 
> > Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
> > it. Building v4.17-rc4 now.
> 
> v4.17-rc4 seems good after 2 reboots.

The new bisect seems to have also led me to a strange commit. This time 
I tried to be careful and tested most on two reboots before classifying 
as good.

However, f4e3ec0d573e was suspicious - it failed to autoload e1000 but 
had no other errors. On both boots with this kernel, modprobe e1000 and 
ifup -a made the system work so I assumed it was good, while it might 
not have been. Will try bisecting with f4e3ec0d573e marked bad.

mroos@rx100s2:~/linux$ nice git bisect bad
9816dd35ececc095f3e3be29d30d3adc755908d9 is the first bad commit
commit 9816dd35ececc095f3e3be29d30d3adc755908d9
Author: Jakub Kicinski 
Date:   Thu May 3 18:37:12 2018 -0700

nfp: bpf: perf event output helpers support

Add support for the perf_event_output family of helpers.

The implementation on the NFP will not match the host code exactly.
The state of the host map and rings is unknown to the device, hence
device can't return errors when rings are not installed.  The device
simply packs the data into a firmware notification message and sends
it over to the host, returning success to the program.

There is no notion of a host CPU on the device when packets are being
processed.  Device will only offload programs which set BPF_F_CURRENT_CPU.
Still, if map index doesn't match CPU no error will be returned (see
above).

Dropped/lost firmware notification messages will not cause "lost
events" event on the perf ring, they are only visible via device
error counters.

Firmware notification messages may also get reordered in respect
to the packets which caused their generation.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Signed-off-by: Daniel Borkmann 

:04 04 00caca934fcbf1d5740a46d71e4d08e1f3ab8c7a 
606c7bdd23e357f0902219630579c22a0ed0380c M  drivers
mroos@rx100s2:~/linux$ nice git bisect log
git bisect start
# bad: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the variable 
name in v9fs_get_trans_by_name() comment
git bisect bad 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
# good: [75bc37fefc4471e718ba8e651aa74673d4e0a9eb] Linux 4.17-rc4
git bisect good 75bc37fefc4471e718ba8e651aa74673d4e0a9eb
# good: [1504269814263c9676b4605a6a91e14dc6ceac21] Merge tag 
'linux-kselftest-4.17-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
git bisect good 1504269814263c9676b4605a6a91e14dc6ceac21
# skip: [c7d28c9df292a49904446dca15b2037ee8f874af] net: dsa: b53: Add support 
for reading PHY statistics
git bisect skip c7d28c9df292a49904446dca15b2037ee8f874af
# good: [173965fbfba596c02fa128966c2a33cb88afcd7f] tools/bpf: add a test for 
bpf_get_stack with raw tracepoint prog
git bisect good 173965fbfba596c02fa128966c2a33cb88afcd7f
# good: [795d8098d32b6bef3d0821588cb6e4b1f369a7a4] liquidio VF: indicate that 
disabling rx vlan offload is not allowed
git bisect good 795d8098d32b6bef3d0821588cb6e4b1f369a7a4
# good: [90278871d4b0da39c84fc9aa4929b0809dc7cf3c] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
git bisect good 90278871d4b0da39c84fc9aa4929b0809dc7cf3c
# good: [4e1ec56cdc59746943b2acfab3c171b930187bbe] bpf: add 
skb_load_bytes_relative helper
git bisect good 4e1ec56cdc59746943b2acfab3c171b930187bbe
# good: [f4e3ec0d573e238f383b3da365127002579a07d6] bpf: replace map pointer 
loads before calling into offloads
git bisect good f4e3ec0d573e238f383b3da365127002579a07d6
# bad: [e94fa1d93117e7f1eb783dc9cae6c7065099] bpf, xskmap: fix crash in 
xsk_map_alloc error path handling
git bisect bad e94fa1d93117e7f1eb783dc9cae6c7065099
# bad: [e64d52569f6e847495091db40ab58d2d379748ef] tools: bpftool: move 
get_possible_cpus() to common code
git bisect bad e64d52569f6e847495091db40ab58d2d379748ef
# bad: [b4264c96b5cbc00c4c07deb9fbab928d43dffcf9] nfp: bpf: rewrite map 
pointers with NFP TIDs
git bisect bad b4264c96b5cbc00c4c07deb9fbab928d43dffcf9
# bad: 

Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-16 Thread Meelis Roos
> > Everything below here is is 'bad', which can be an indication that you
> > misclassified one of
> > the commits above as 'good' when it should have been 'bad'. The most likely
> > explanations are that you either typed the 'git bisect good' by accident, or
> > that the failure is not 100% reliable, and it sometimes works fine even on a
> > broken kernel.
> > 
> > 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
> > variable name in v9fs_get_trans_by_name() comment", which is marked "good",
> > and can't really be good if 0bc5fe85727413 is bad and you are not using the
> > 'qed' driver.
> > 
> > I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> > if it was, test v4.17-rc4, which is what the net-next tree was based on.
> 
> Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
> it. Building v4.17-rc4 now.

v4.17-rc4 seems good after 2 reboots.

-- 
Meelis Roos (mr...@ut.ee)  http://www.cs.ut.ee/~mroos/


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-16 Thread Meelis Roos
> > Everything below here is is 'bad', which can be an indication that you
> > misclassified one of
> > the commits above as 'good' when it should have been 'bad'. The most likely
> > explanations are that you either typed the 'git bisect good' by accident, or
> > that the failure is not 100% reliable, and it sometimes works fine even on a
> > broken kernel.
> > 
> > 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
> > variable name in v9fs_get_trans_by_name() comment", which is marked "good",
> > and can't really be good if 0bc5fe85727413 is bad and you are not using the
> > 'qed' driver.
> > 
> > I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> > if it was, test v4.17-rc4, which is what the net-next tree was based on.
> 
> Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
> it. Building v4.17-rc4 now.

v4.17-rc4 seems good after 2 reboots.

-- 
Meelis Roos (mr...@ut.ee)  http://www.cs.ut.ee/~mroos/


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-16 Thread Meelis Roos
> Everything below here is is 'bad', which can be an indication that you
> misclassified one of
> the commits above as 'good' when it should have been 'bad'. The most likely
> explanations are that you either typed the 'git bisect good' by accident, or
> that the failure is not 100% reliable, and it sometimes works fine even on a
> broken kernel.
> 
> 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
> variable name in v9fs_get_trans_by_name() comment", which is marked "good",
> and can't really be good if 0bc5fe85727413 is bad and you are not using the
> 'qed' driver.
> 
> I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> if it was, test v4.17-rc4, which is what the net-next tree was based on.

Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
it. Building v4.17-rc4 now.

-- 
Meelis Roos (mr...@linux.ee)


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-16 Thread Meelis Roos
> Everything below here is is 'bad', which can be an indication that you
> misclassified one of
> the commits above as 'good' when it should have been 'bad'. The most likely
> explanations are that you either typed the 'git bisect good' by accident, or
> that the failure is not 100% reliable, and it sometimes works fine even on a
> broken kernel.
> 
> 0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
> variable name in v9fs_get_trans_by_name() comment", which is marked "good",
> and can't really be good if 0bc5fe85727413 is bad and you are not using the
> 'qed' driver.
> 
> I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
> if it was, test v4.17-rc4, which is what the net-next tree was based on.

Yes, the same prebuilt 3a443bd6dd7c appeared to be bad when retesting 
it. Building v4.17-rc4 now.

-- 
Meelis Roos (mr...@linux.ee)


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-15 Thread Arnd Bergmann
On Sun, Jul 15, 2018 at 5:05 PM, Meelis Roos  wrote:

>> > > I then tried multiple other machines. All x86-64 machines seem
>> > > unaffected, some x86-32 machines are affected (Athlon with AMD750
>> > > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
>> > > some very similar x86-32 machines are unaffected. I have different
>> > > customized kernel configuration on them, so far I have not pinpointed
>> > > any configuration option to be at fault.
>> > >
>> > > All machines run Debian unstable.
>> > >
>> > > 4.17.0 was working fine.
>> > >
>> > > Will continue with bisecting between 4.17.0 and
>> > > 4.18.0-rc1-00023-g9ffc59d57228.
>
> Bisection has been finished (I'm usually away from the problematic
> computers in summer), result is strange and seems unrelated:
>
> 0bc5fe857274133ca028ebb15ff2e8549a369916 is the first bad commit
> commit 0bc5fe857274133ca028ebb15ff2e8549a369916
> Author: Sudarsana Reddy Kalluru 
> Date:   Sat May 5 18:42:59 2018 -0700
>
> qed*: Refactor mf_mode to consist of bits.

Agreed, that isn't the one you were looking for.

> `mf_mode' field indicates the multi-partitioning mode the device is
> configured to. This method doesn't scale very well, adding a new MF mode
> requires going over all the existing conditions, and deciding whether 
> those
> are needed for the new mode or not.
> The patch defines a set of bit-fields for modes which are derived 
> according
> to the mode info shared by the MFW and all the configuration would be made
> according to those. To add a new mode, there would be a single place where
> we'll need to go and choose which bits apply and which don't.
>
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 
> Signed-off-by: David S. Miller 
>
> :04 04 a3572846e1afb9ccfa9c4a84b0135a0057ade66f 
> bdb7b28725a4f1bffe79ee384a3603b3127d6fdb M  drivers
> :04 04 f90c7f26fd8445afa48c6679ed68fed294b23d7f 
> 52119c547a82b268b5c173d3df94e267cc1297a0 M  include
> mroos@rx100s2:~/linux$ nice git bisect log
> git bisect start# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
> git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
> # good: [e27c49291a7fe9dc415c9fcab5bd781ec82dfe04] x86: Convert 
> x86_platform_ops to timespec64
> git bisect good e27c49291a7fe9dc415c9fcab5bd781ec82dfe04
> # bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
> # bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
> # good: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 
> 'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
> git bisect good 135c5504a600ff9b06e321694fbcac78a9530cd4
> # bad: [ffbc9197b4721634dc6c0fefa9b31e565fa89cee] wcn36xx: improve debug and 
> error messages for SMD
> git bisect bad ffbc9197b4721634dc6c0fefa9b31e565fa89cee
> # good: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the 
> variable name in v9fs_get_trans_by_name() comment
> git bisect good 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
> # bad: [93c65d13d8a0b7c272868d4a9779f96fc973df26] vmxnet3: Replace msleep(1) 
> with usleep_range()
> git bisect bad 93c65d13d8a0b7c272868d4a9779f96fc973df26
> # good: [4bc871984f7cb5b2dec3ae64b570cb02f9ce2227] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect good 4bc871984f7cb5b2dec3ae64b570cb02f9ce2227

Everything below here is is 'bad', which can be an indication that you
misclassified one of
the commits above as 'good' when it should have been 'bad'. The most likely
explanations are that you either typed the 'git bisect good' by accident, or
that the failure is not 100% reliable, and it sometimes works fine even on a
broken kernel.

0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
variable name in v9fs_get_trans_by_name() comment", which is marked "good",
and can't really be good if 0bc5fe85727413 is bad and you are not using the
'qed' driver.

I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
if it was, test v4.17-rc4, which is what the net-next tree was based on.

Arnd


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-15 Thread Arnd Bergmann
On Sun, Jul 15, 2018 at 5:05 PM, Meelis Roos  wrote:

>> > > I then tried multiple other machines. All x86-64 machines seem
>> > > unaffected, some x86-32 machines are affected (Athlon with AMD750
>> > > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
>> > > some very similar x86-32 machines are unaffected. I have different
>> > > customized kernel configuration on them, so far I have not pinpointed
>> > > any configuration option to be at fault.
>> > >
>> > > All machines run Debian unstable.
>> > >
>> > > 4.17.0 was working fine.
>> > >
>> > > Will continue with bisecting between 4.17.0 and
>> > > 4.18.0-rc1-00023-g9ffc59d57228.
>
> Bisection has been finished (I'm usually away from the problematic
> computers in summer), result is strange and seems unrelated:
>
> 0bc5fe857274133ca028ebb15ff2e8549a369916 is the first bad commit
> commit 0bc5fe857274133ca028ebb15ff2e8549a369916
> Author: Sudarsana Reddy Kalluru 
> Date:   Sat May 5 18:42:59 2018 -0700
>
> qed*: Refactor mf_mode to consist of bits.

Agreed, that isn't the one you were looking for.

> `mf_mode' field indicates the multi-partitioning mode the device is
> configured to. This method doesn't scale very well, adding a new MF mode
> requires going over all the existing conditions, and deciding whether 
> those
> are needed for the new mode or not.
> The patch defines a set of bit-fields for modes which are derived 
> according
> to the mode info shared by the MFW and all the configuration would be made
> according to those. To add a new mode, there would be a single place where
> we'll need to go and choose which bits apply and which don't.
>
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 
> Signed-off-by: David S. Miller 
>
> :04 04 a3572846e1afb9ccfa9c4a84b0135a0057ade66f 
> bdb7b28725a4f1bffe79ee384a3603b3127d6fdb M  drivers
> :04 04 f90c7f26fd8445afa48c6679ed68fed294b23d7f 
> 52119c547a82b268b5c173d3df94e267cc1297a0 M  include
> mroos@rx100s2:~/linux$ nice git bisect log
> git bisect start# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
> git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
> # good: [e27c49291a7fe9dc415c9fcab5bd781ec82dfe04] x86: Convert 
> x86_platform_ops to timespec64
> git bisect good e27c49291a7fe9dc415c9fcab5bd781ec82dfe04
> # bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
> # bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
> git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
> # good: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 
> 'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
> git bisect good 135c5504a600ff9b06e321694fbcac78a9530cd4
> # bad: [ffbc9197b4721634dc6c0fefa9b31e565fa89cee] wcn36xx: improve debug and 
> error messages for SMD
> git bisect bad ffbc9197b4721634dc6c0fefa9b31e565fa89cee
> # good: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the 
> variable name in v9fs_get_trans_by_name() comment
> git bisect good 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
> # bad: [93c65d13d8a0b7c272868d4a9779f96fc973df26] vmxnet3: Replace msleep(1) 
> with usleep_range()
> git bisect bad 93c65d13d8a0b7c272868d4a9779f96fc973df26
> # good: [4bc871984f7cb5b2dec3ae64b570cb02f9ce2227] Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect good 4bc871984f7cb5b2dec3ae64b570cb02f9ce2227

Everything below here is is 'bad', which can be an indication that you
misclassified one of
the commits above as 'good' when it should have been 'bad'. The most likely
explanations are that you either typed the 'git bisect good' by accident, or
that the failure is not 100% reliable, and it sometimes works fine even on a
broken kernel.

0bc5fe857274133ca0 follows directly after 3a443bd6dd7c, "net/9p: correct the
variable name in v9fs_get_trans_by_name() comment", which is marked "good",
and can't really be good if 0bc5fe85727413 is bad and you are not using the
'qed' driver.

I'd retest 3a443bd6dd7c again to see if that should have been 'bad', and
if it was, test v4.17-rc4, which is what the net-next tree was based on.

Arnd


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-15 Thread Meelis Roos
> > > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> > > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> > > 32-bit machines, and got half-failed bootup - kernel and userspace come
> > > up but some services fail to start, including network and
> > > systemd-journald:
> > >
> > > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), 
> > > ) == 0' failed at ../src/basic/time-util.c:53, function now(). 
> > > Aborting.
> > >
> > > I then tried multiple other machines. All x86-64 machines seem
> > > unaffected, some x86-32 machines are affected (Athlon with AMD750
> > > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> > > some very similar x86-32 machines are unaffected. I have different
> > > customized kernel configuration on them, so far I have not pinpointed
> > > any configuration option to be at fault.
> > >
> > > All machines run Debian unstable.
> > >
> > > 4.17.0 was working fine.
> > >
> > > Will continue with bisecting between 4.17.0 and
> > > 4.18.0-rc1-00023-g9ffc59d57228.

Bisection has been finished (I'm usually away from the problematic 
computers in summer), result is strange and seems unrelated:

0bc5fe857274133ca028ebb15ff2e8549a369916 is the first bad commit
commit 0bc5fe857274133ca028ebb15ff2e8549a369916
Author: Sudarsana Reddy Kalluru 
Date:   Sat May 5 18:42:59 2018 -0700

qed*: Refactor mf_mode to consist of bits.

`mf_mode' field indicates the multi-partitioning mode the device is
configured to. This method doesn't scale very well, adding a new MF mode
requires going over all the existing conditions, and deciding whether those
are needed for the new mode or not.
The patch defines a set of bit-fields for modes which are derived according
to the mode info shared by the MFW and all the configuration would be made
according to those. To add a new mode, there would be a single place where
we'll need to go and choose which bits apply and which don't.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: David S. Miller 

:04 04 a3572846e1afb9ccfa9c4a84b0135a0057ade66f 
bdb7b28725a4f1bffe79ee384a3603b3127d6fdb M  drivers
:04 04 f90c7f26fd8445afa48c6679ed68fed294b23d7f 
52119c547a82b268b5c173d3df94e267cc1297a0 M  include
mroos@rx100s2:~/linux$ nice git bisect log
git bisect start# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
# good: [e27c49291a7fe9dc415c9fcab5bd781ec82dfe04] x86: Convert 
x86_platform_ops to timespec64
git bisect good e27c49291a7fe9dc415c9fcab5bd781ec82dfe04
# bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
# bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
# good: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 
'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
git bisect good 135c5504a600ff9b06e321694fbcac78a9530cd4
# bad: [ffbc9197b4721634dc6c0fefa9b31e565fa89cee] wcn36xx: improve debug and 
error messages for SMD
git bisect bad ffbc9197b4721634dc6c0fefa9b31e565fa89cee
# good: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the variable 
name in v9fs_get_trans_by_name() comment
git bisect good 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
# bad: [93c65d13d8a0b7c272868d4a9779f96fc973df26] vmxnet3: Replace msleep(1) 
with usleep_range()
git bisect bad 93c65d13d8a0b7c272868d4a9779f96fc973df26
# good: [4bc871984f7cb5b2dec3ae64b570cb02f9ce2227] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 4bc871984f7cb5b2dec3ae64b570cb02f9ce2227
# bad: [38aa51c134b56b7ea61bea79b428c5fbcd95f285] net/mlx5e: Support offloaded 
TC flows with no matches on headers
git bisect bad 38aa51c134b56b7ea61bea79b428c5fbcd95f285
# bad: [00483690552c5fb6aa30bf3acb75b0ee89b4c0fd] tcp: Add mark for TIMEWAIT 
sockets
git bisect bad 00483690552c5fb6aa30bf3acb75b0ee89b4c0fd
# bad: [3e50d2da5850dd126b3e6a6e4387620d55b71db4] microchip_t1: Add driver for 
Microchip LAN87XX T1 PHYs
git bisect bad 3e50d2da5850dd126b3e6a6e4387620d55b71db4
# bad: [dac0490718bd17df5e3995ffca14255e5f9ed22d] bnxt_en: Check unsupported 
speeds in bnxt_update_link() on PF only.
git bisect bad dac0490718bd17df5e3995ffca14255e5f9ed22d
# bad: [9d4927f0d3760d8f10727c3035121d2677108f44] Merge branch 'ipv6-misc'
git bisect bad 9d4927f0d3760d8f10727c3035121d2677108f44
# bad: [cac6f691546b9efd50c31c0db97fe50d0357104a] qed: Add support for Unified 
Fabric Port.
git bisect bad cac6f691546b9efd50c31c0db97fe50d0357104a
# bad: [27bf96e32c92599dc7523b36d6c761fc8312c8c0] qed: Remove unused data 
member 'is_mf_default'.
git bisect bad 27bf96e32c92599dc7523b36d6c761fc8312c8c0
# bad: 

Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-15 Thread Meelis Roos
> > > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> > > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> > > 32-bit machines, and got half-failed bootup - kernel and userspace come
> > > up but some services fail to start, including network and
> > > systemd-journald:
> > >
> > > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), 
> > > ) == 0' failed at ../src/basic/time-util.c:53, function now(). 
> > > Aborting.
> > >
> > > I then tried multiple other machines. All x86-64 machines seem
> > > unaffected, some x86-32 machines are affected (Athlon with AMD750
> > > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> > > some very similar x86-32 machines are unaffected. I have different
> > > customized kernel configuration on them, so far I have not pinpointed
> > > any configuration option to be at fault.
> > >
> > > All machines run Debian unstable.
> > >
> > > 4.17.0 was working fine.
> > >
> > > Will continue with bisecting between 4.17.0 and
> > > 4.18.0-rc1-00023-g9ffc59d57228.

Bisection has been finished (I'm usually away from the problematic 
computers in summer), result is strange and seems unrelated:

0bc5fe857274133ca028ebb15ff2e8549a369916 is the first bad commit
commit 0bc5fe857274133ca028ebb15ff2e8549a369916
Author: Sudarsana Reddy Kalluru 
Date:   Sat May 5 18:42:59 2018 -0700

qed*: Refactor mf_mode to consist of bits.

`mf_mode' field indicates the multi-partitioning mode the device is
configured to. This method doesn't scale very well, adding a new MF mode
requires going over all the existing conditions, and deciding whether those
are needed for the new mode or not.
The patch defines a set of bit-fields for modes which are derived according
to the mode info shared by the MFW and all the configuration would be made
according to those. To add a new mode, there would be a single place where
we'll need to go and choose which bits apply and which don't.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: David S. Miller 

:04 04 a3572846e1afb9ccfa9c4a84b0135a0057ade66f 
bdb7b28725a4f1bffe79ee384a3603b3127d6fdb M  drivers
:04 04 f90c7f26fd8445afa48c6679ed68fed294b23d7f 
52119c547a82b268b5c173d3df94e267cc1297a0 M  include
mroos@rx100s2:~/linux$ nice git bisect log
git bisect start# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
# good: [e27c49291a7fe9dc415c9fcab5bd781ec82dfe04] x86: Convert 
x86_platform_ops to timespec64
git bisect good e27c49291a7fe9dc415c9fcab5bd781ec82dfe04
# bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
# bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
# good: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 
'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
git bisect good 135c5504a600ff9b06e321694fbcac78a9530cd4
# bad: [ffbc9197b4721634dc6c0fefa9b31e565fa89cee] wcn36xx: improve debug and 
error messages for SMD
git bisect bad ffbc9197b4721634dc6c0fefa9b31e565fa89cee
# good: [3a443bd6dd7c43bf5763779309514bf3e7c1c3eb] net/9p: correct the variable 
name in v9fs_get_trans_by_name() comment
git bisect good 3a443bd6dd7c43bf5763779309514bf3e7c1c3eb
# bad: [93c65d13d8a0b7c272868d4a9779f96fc973df26] vmxnet3: Replace msleep(1) 
with usleep_range()
git bisect bad 93c65d13d8a0b7c272868d4a9779f96fc973df26
# good: [4bc871984f7cb5b2dec3ae64b570cb02f9ce2227] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 4bc871984f7cb5b2dec3ae64b570cb02f9ce2227
# bad: [38aa51c134b56b7ea61bea79b428c5fbcd95f285] net/mlx5e: Support offloaded 
TC flows with no matches on headers
git bisect bad 38aa51c134b56b7ea61bea79b428c5fbcd95f285
# bad: [00483690552c5fb6aa30bf3acb75b0ee89b4c0fd] tcp: Add mark for TIMEWAIT 
sockets
git bisect bad 00483690552c5fb6aa30bf3acb75b0ee89b4c0fd
# bad: [3e50d2da5850dd126b3e6a6e4387620d55b71db4] microchip_t1: Add driver for 
Microchip LAN87XX T1 PHYs
git bisect bad 3e50d2da5850dd126b3e6a6e4387620d55b71db4
# bad: [dac0490718bd17df5e3995ffca14255e5f9ed22d] bnxt_en: Check unsupported 
speeds in bnxt_update_link() on PF only.
git bisect bad dac0490718bd17df5e3995ffca14255e5f9ed22d
# bad: [9d4927f0d3760d8f10727c3035121d2677108f44] Merge branch 'ipv6-misc'
git bisect bad 9d4927f0d3760d8f10727c3035121d2677108f44
# bad: [cac6f691546b9efd50c31c0db97fe50d0357104a] qed: Add support for Unified 
Fabric Port.
git bisect bad cac6f691546b9efd50c31c0db97fe50d0357104a
# bad: [27bf96e32c92599dc7523b36d6c761fc8312c8c0] qed: Remove unused data 
member 'is_mf_default'.
git bisect bad 27bf96e32c92599dc7523b36d6c761fc8312c8c0
# bad: 

Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-10 Thread Pavel Machek
On Wed 2018-07-04 14:41:08, Meelis Roos wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
> 32-bit machines, and got half-failed bootup - kernel and userspace come 
> up but some services fail to start, including network and 
> systemd-journald:
> 
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> 
> I then tried multiple other machines. All x86-64 machines seem 
> unaffected, some x86-32 machines are affected (Athlon with AMD750 
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
> some very similar x86-32 machines are unaffected. I have different 
> customized kernel configuration on them, so far I have not pinpointed 
> any configuration option to be at fault.
> 
> All machines run Debian unstable.
> 
> 4.17.0 was working fine.
> 
> Will continue with bisecting between 4.17.0 and 
> 4.18.0-rc1-00023-g9ffc59d57228.

Details of my tests (.config, dmesg, versions) can be found in

https://github.com/pavelmachek/missy/tree/master/db/notebook/lenovo/thinkpad/x60/pavel/2018.3648830947643

(and nearby directories).

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-10 Thread Pavel Machek
On Wed 2018-07-04 14:41:08, Meelis Roos wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
> 32-bit machines, and got half-failed bootup - kernel and userspace come 
> up but some services fail to start, including network and 
> systemd-journald:
> 
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> 
> I then tried multiple other machines. All x86-64 machines seem 
> unaffected, some x86-32 machines are affected (Athlon with AMD750 
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
> some very similar x86-32 machines are unaffected. I have different 
> customized kernel configuration on them, so far I have not pinpointed 
> any configuration option to be at fault.
> 
> All machines run Debian unstable.
> 
> 4.17.0 was working fine.
> 
> Will continue with bisecting between 4.17.0 and 
> 4.18.0-rc1-00023-g9ffc59d57228.

Details of my tests (.config, dmesg, versions) can be found in

https://github.com/pavelmachek/missy/tree/master/db/notebook/lenovo/thinkpad/x60/pavel/2018.3648830947643

(and nearby directories).

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-10 Thread Pavel Machek
On Wed 2018-07-04 14:41:08, Meelis Roos wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
> 32-bit machines, and got half-failed bootup - kernel and userspace come 
> up but some services fail to start, including network and 
> systemd-journald:
> 
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> 
> I then tried multiple other machines. All x86-64 machines seem 
> unaffected, some x86-32 machines are affected (Athlon with AMD750 
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
> some very similar x86-32 machines are unaffected. I have different 
> customized kernel configuration on them, so far I have not pinpointed 
> any configuration option to be at fault.
> 
> All machines run Debian unstable.
> 
> 4.17.0 was working fine.
> 
> Will continue with bisecting between 4.17.0 and 
> 4.18.0-rc1-00023-g9ffc59d57228.

I don't think if it helps you, but 4.18-rc4 seems to work okay for me
(and previous versions did, too) on thinkpad X60.

But I'm using older debian version.

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-10 Thread Pavel Machek
On Wed 2018-07-04 14:41:08, Meelis Roos wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
> 32-bit machines, and got half-failed bootup - kernel and userspace come 
> up but some services fail to start, including network and 
> systemd-journald:
> 
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> 
> I then tried multiple other machines. All x86-64 machines seem 
> unaffected, some x86-32 machines are affected (Athlon with AMD750 
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
> some very similar x86-32 machines are unaffected. I have different 
> customized kernel configuration on them, so far I have not pinpointed 
> any configuration option to be at fault.
> 
> All machines run Debian unstable.
> 
> 4.17.0 was working fine.
> 
> Will continue with bisecting between 4.17.0 and 
> 4.18.0-rc1-00023-g9ffc59d57228.

I don't think if it helps you, but 4.18-rc4 seems to work okay for me
(and previous versions did, too) on thinkpad X60.

But I'm using older debian version.

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-05 Thread Arnd Bergmann
On Thu, Jul 5, 2018 at 11:54 AM, Meelis Roos  wrote:
>> > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
>> > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
>> > 32-bit machines, and got half-failed bootup - kernel and userspace come
>> > up but some services fail to start, including network and
>> > systemd-journald:
>> >
>> > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), 
>> > ) == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
>> >
>> > I then tried multiple other machines. All x86-64 machines seem
>> > unaffected, some x86-32 machines are affected (Athlon with AMD750
>> > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
>> > some very similar x86-32 machines are unaffected. I have different
>> > customized kernel configuration on them, so far I have not pinpointed
>> > any configuration option to be at fault.
>> >
>> > All machines run Debian unstable.
>> >
>> > 4.17.0 was working fine.
>> >
>> > Will continue with bisecting between 4.17.0 and
>> > 4.18.0-rc1-00023-g9ffc59d57228.
>>
>> That does sound like it is related to my patches indeed. If you are not
>> yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
>> x86_platform_ops to timespec64") before you try anything else, that
>> one is the top of the branch with my changes. If that fails, the bisection
>> will be much quicker.
>
> This commit was fine. So it's likely something else.

Ok, at least that's a relief for me, even if it didn't help you ;-)

I looked at the sources a bit and found that the assertion is triggered
in systemd whenever we try to read a clock that the kernel does not
provide. You have CONFIG_POSIX_TIMERS and
CLOCK_RTC_CLASS set, so all the normal clocks should be
operational, and I don't see anything unusual being passed into
clock_gettime() from systemd.

If you are able to find out what clock_id is passed in here, and what
the return code is, that might still lead to a solution more quickly
than continuing the bisection.

  Arnd


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-05 Thread Arnd Bergmann
On Thu, Jul 5, 2018 at 11:54 AM, Meelis Roos  wrote:
>> > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
>> > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
>> > 32-bit machines, and got half-failed bootup - kernel and userspace come
>> > up but some services fail to start, including network and
>> > systemd-journald:
>> >
>> > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), 
>> > ) == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
>> >
>> > I then tried multiple other machines. All x86-64 machines seem
>> > unaffected, some x86-32 machines are affected (Athlon with AMD750
>> > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
>> > some very similar x86-32 machines are unaffected. I have different
>> > customized kernel configuration on them, so far I have not pinpointed
>> > any configuration option to be at fault.
>> >
>> > All machines run Debian unstable.
>> >
>> > 4.17.0 was working fine.
>> >
>> > Will continue with bisecting between 4.17.0 and
>> > 4.18.0-rc1-00023-g9ffc59d57228.
>>
>> That does sound like it is related to my patches indeed. If you are not
>> yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
>> x86_platform_ops to timespec64") before you try anything else, that
>> one is the top of the branch with my changes. If that fails, the bisection
>> will be much quicker.
>
> This commit was fine. So it's likely something else.

Ok, at least that's a relief for me, even if it didn't help you ;-)

I looked at the sources a bit and found that the assertion is triggered
in systemd whenever we try to read a clock that the kernel does not
provide. You have CONFIG_POSIX_TIMERS and
CLOCK_RTC_CLASS set, so all the normal clocks should be
operational, and I don't see anything unusual being passed into
clock_gettime() from systemd.

If you are able to find out what clock_id is passed in here, and what
the return code is, that might still lead to a solution more quickly
than continuing the bisection.

  Arnd


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-05 Thread Meelis Roos
> > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> > 32-bit machines, and got half-failed bootup - kernel and userspace come
> > up but some services fail to start, including network and
> > systemd-journald:
> >
> > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> > == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> >
> > I then tried multiple other machines. All x86-64 machines seem
> > unaffected, some x86-32 machines are affected (Athlon with AMD750
> > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> > some very similar x86-32 machines are unaffected. I have different
> > customized kernel configuration on them, so far I have not pinpointed
> > any configuration option to be at fault.
> >
> > All machines run Debian unstable.
> >
> > 4.17.0 was working fine.
> >
> > Will continue with bisecting between 4.17.0 and
> > 4.18.0-rc1-00023-g9ffc59d57228.
> 
> That does sound like it is related to my patches indeed. If you are not
> yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
> x86_platform_ops to timespec64") before you try anything else, that
> one is the top of the branch with my changes. If that fails, the bisection
> will be much quicker.

This commit was fine. So it's likely something else.

-- 
Meelis Roos (mr...@linux.ee)


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-05 Thread Meelis Roos
> > I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> > 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> > 32-bit machines, and got half-failed bootup - kernel and userspace come
> > up but some services fail to start, including network and
> > systemd-journald:
> >
> > systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> > == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
> >
> > I then tried multiple other machines. All x86-64 machines seem
> > unaffected, some x86-32 machines are affected (Athlon with AMD750
> > chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> > some very similar x86-32 machines are unaffected. I have different
> > customized kernel configuration on them, so far I have not pinpointed
> > any configuration option to be at fault.
> >
> > All machines run Debian unstable.
> >
> > 4.17.0 was working fine.
> >
> > Will continue with bisecting between 4.17.0 and
> > 4.18.0-rc1-00023-g9ffc59d57228.
> 
> That does sound like it is related to my patches indeed. If you are not
> yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
> x86_platform_ops to timespec64") before you try anything else, that
> one is the top of the branch with my changes. If that fails, the bisection
> will be much quicker.

This commit was fine. So it's likely something else.

-- 
Meelis Roos (mr...@linux.ee)


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-04 Thread Arnd Bergmann
On Wed, Jul 4, 2018 at 1:41 PM, Meelis Roos  wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> 32-bit machines, and got half-failed bootup - kernel and userspace come
> up but some services fail to start, including network and
> systemd-journald:
>
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
>
> I then tried multiple other machines. All x86-64 machines seem
> unaffected, some x86-32 machines are affected (Athlon with AMD750
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> some very similar x86-32 machines are unaffected. I have different
> customized kernel configuration on them, so far I have not pinpointed
> any configuration option to be at fault.
>
> All machines run Debian unstable.
>
> 4.17.0 was working fine.
>
> Will continue with bisecting between 4.17.0 and
> 4.18.0-rc1-00023-g9ffc59d57228.

That does sound like it is related to my patches indeed. If you are not
yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
x86_platform_ops to timespec64") before you try anything else, that
one is the top of the branch with my changes. If that fails, the bisection
will be much quicker. Unfortunately I don't see anything right away,
and haven't come across that bug in my own testing using Debian Stretch
in an x86-32 qemu.

  Arnd


Re: 4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-04 Thread Arnd Bergmann
On Wed, Jul 4, 2018 at 1:41 PM, Meelis Roos  wrote:
> I tried 4.18.0-rc1-00023-g9ffc59d57228 and now
> 4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other
> 32-bit machines, and got half-failed bootup - kernel and userspace come
> up but some services fail to start, including network and
> systemd-journald:
>
> systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) 
> == 0' failed at ../src/basic/time-util.c:53, function now(). Aborting.
>
> I then tried multiple other machines. All x86-64 machines seem
> unaffected, some x86-32 machines are affected (Athlon with AMD750
> chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset),
> some very similar x86-32 machines are unaffected. I have different
> customized kernel configuration on them, so far I have not pinpointed
> any configuration option to be at fault.
>
> All machines run Debian unstable.
>
> 4.17.0 was working fine.
>
> Will continue with bisecting between 4.17.0 and
> 4.18.0-rc1-00023-g9ffc59d57228.

That does sound like it is related to my patches indeed. If you are not
yet done bisecting, please checkout commit e27c49291a7f ("x86: Convert
x86_platform_ops to timespec64") before you try anything else, that
one is the top of the branch with my changes. If that fails, the bisection
will be much quicker. Unfortunately I don't see anything right away,
and haven't come across that bug in my own testing using Debian Stretch
in an x86-32 qemu.

  Arnd


4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-04 Thread Meelis Roos
I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
32-bit machines, and got half-failed bootup - kernel and userspace come 
up but some services fail to start, including network and 
systemd-journald:

systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) == 
0' failed at ../src/basic/time-util.c:53, function now(). Aborting.

I then tried multiple other machines. All x86-64 machines seem 
unaffected, some x86-32 machines are affected (Athlon with AMD750 
chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
some very similar x86-32 machines are unaffected. I have different 
customized kernel configuration on them, so far I have not pinpointed 
any configuration option to be at fault.

All machines run Debian unstable.

4.17.0 was working fine.

Will continue with bisecting between 4.17.0 and 
4.18.0-rc1-00023-g9ffc59d57228.


[0.00] Linux version 4.18.0-rc3-00113-gfc36def997cf (mroos@rx100s2) 
(gcc version 7.3.0 (Debian 7.3.0-23)) #27 SMP Wed Jul 4 13:06:34 EEST 2018
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009afff] usable
[0.00] BIOS-e820: [mem 0x0009b000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000ca000-0x000cbfff] reserved
[0.00] BIOS-e820: [mem 0x000dc000-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3ff6] usable
[0.00] BIOS-e820: [mem 0x3ff7-0x3ff79fff] ACPI data
[0.00] BIOS-e820: [mem 0x3ff7a000-0x3ff7] ACPI NVS
[0.00] BIOS-e820: [mem 0x3ff8-0x3fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec0] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff80-0xffbf] reserved
[0.00] BIOS-e820: [mem 0xfc00-0x] reserved
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] SMBIOS 2.3 present.
[0.00] DMI: FUJITSU SIEMENS PRIMERGY RX100S2/D1571/M71IXG, BIOS 6.0 
Rev. C0F2.1571 04/27/2005
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x3ff70 max_arch_pfn = 0x10
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C7FFF write-protect
[0.00]   C8000-D uncachable
[0.00]   E-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask FC000 write-back
[0.00]   1 base 03FF8 mask 8 uncachable
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] total RAM covered: 1023M
[0.00] Found optimal setting for mtrr clean up
[0.00]  gran_size: 64K  chunk_size: 1M  num_reg: 2  lose cover RAM: 
0G
[0.00] found SMP MP-table at [mem 0x000f6680-0x000f668f] mapped at 
[(ptrval)]
[0.00] initial memory mapped: [mem 0x-0x04ff]
[0.00] Base memory trampoline at [(ptrval)] 97000 size 16384
[0.00] BRK [0x04d97000, 0x04d97fff] PGTABLE
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F66B0 14 (v00 PTLTD )
[0.00] ACPI: RSDT 0x3FF75B79 38 (v01 PTLTDRSDT   
0604  LTP )
[0.00] ACPI: FACP 0x3FF79E69 74 (v01 INTEL  CANTWOOD 
0604 PTL  0003)
[0.00] ACPI: DSDT 0x3FF75BB1 0042B8 (v01 INTEL  CANTWOOD 
0604 MSFT 010B)
[0.00] ACPI: FACS 0x3FF7AFC0 40
[0.00] ACPI: SPCR 0x3FF79EDD 50 (v01 PTLTD  $UCRTBL$ 
0604 PTL  0001)
[0.00] ACPI: APIC 0x3FF79F2D 74 (v01 PTLTD  ? APIC   
0604  LTP )
[0.00] ACPI: BOOT 0x3FF79FA1 28 (v01 PTLTD  $SBFTBL$ 
0604  LTP 0001)
[0.00] ACPI: SSDT 0x3FF79FC9 37 (v01 PTLTD  ACPIHT   
0604  LTP 0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] 135MB HIGHMEM available.
[0.00] 887MB LOWMEM available.
[0.00]   mapped low ram: 0 - 377fe000
[0.00]   low ram: 0 - 377fe000
[0.00] tsc: Fast TSC calibration using PIT
[0.00] BRK [0x04d98000, 0x04d98fff] PGTABLE
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   

4.18-rc* regression: x86-32 troubles (with timers?)

2018-07-04 Thread Meelis Roos
I tried 4.18.0-rc1-00023-g9ffc59d57228 and now 
4.18.0-rc3-00113-gfc36def997cf on a 32-bit server and then some other 
32-bit machines, and got half-failed bootup - kernel and userspace come 
up but some services fail to start, including network and 
systemd-journald:

systemd-journald[85]: Assertion 'clock_gettime(map_clock_id(clock_id), ) == 
0' failed at ../src/basic/time-util.c:53, function now(). Aborting.

I then tried multiple other machines. All x86-64 machines seem 
unaffected, some x86-32 machines are affected (Athlon with AMD750 
chipset, Fujitsu RX100-S2 with P4-3.4, and P4 with Intel 865 chipset), 
some very similar x86-32 machines are unaffected. I have different 
customized kernel configuration on them, so far I have not pinpointed 
any configuration option to be at fault.

All machines run Debian unstable.

4.17.0 was working fine.

Will continue with bisecting between 4.17.0 and 
4.18.0-rc1-00023-g9ffc59d57228.


[0.00] Linux version 4.18.0-rc3-00113-gfc36def997cf (mroos@rx100s2) 
(gcc version 7.3.0 (Debian 7.3.0-23)) #27 SMP Wed Jul 4 13:06:34 EEST 2018
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009afff] usable
[0.00] BIOS-e820: [mem 0x0009b000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000ca000-0x000cbfff] reserved
[0.00] BIOS-e820: [mem 0x000dc000-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3ff6] usable
[0.00] BIOS-e820: [mem 0x3ff7-0x3ff79fff] ACPI data
[0.00] BIOS-e820: [mem 0x3ff7a000-0x3ff7] ACPI NVS
[0.00] BIOS-e820: [mem 0x3ff8-0x3fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec0] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff80-0xffbf] reserved
[0.00] BIOS-e820: [mem 0xfc00-0x] reserved
[0.00] Notice: NX (Execute Disable) protection missing in CPU!
[0.00] SMBIOS 2.3 present.
[0.00] DMI: FUJITSU SIEMENS PRIMERGY RX100S2/D1571/M71IXG, BIOS 6.0 
Rev. C0F2.1571 04/27/2005
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x3ff70 max_arch_pfn = 0x10
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C7FFF write-protect
[0.00]   C8000-D uncachable
[0.00]   E-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask FC000 write-back
[0.00]   1 base 03FF8 mask 8 uncachable
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] total RAM covered: 1023M
[0.00] Found optimal setting for mtrr clean up
[0.00]  gran_size: 64K  chunk_size: 1M  num_reg: 2  lose cover RAM: 
0G
[0.00] found SMP MP-table at [mem 0x000f6680-0x000f668f] mapped at 
[(ptrval)]
[0.00] initial memory mapped: [mem 0x-0x04ff]
[0.00] Base memory trampoline at [(ptrval)] 97000 size 16384
[0.00] BRK [0x04d97000, 0x04d97fff] PGTABLE
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F66B0 14 (v00 PTLTD )
[0.00] ACPI: RSDT 0x3FF75B79 38 (v01 PTLTDRSDT   
0604  LTP )
[0.00] ACPI: FACP 0x3FF79E69 74 (v01 INTEL  CANTWOOD 
0604 PTL  0003)
[0.00] ACPI: DSDT 0x3FF75BB1 0042B8 (v01 INTEL  CANTWOOD 
0604 MSFT 010B)
[0.00] ACPI: FACS 0x3FF7AFC0 40
[0.00] ACPI: SPCR 0x3FF79EDD 50 (v01 PTLTD  $UCRTBL$ 
0604 PTL  0001)
[0.00] ACPI: APIC 0x3FF79F2D 74 (v01 PTLTD  ? APIC   
0604  LTP )
[0.00] ACPI: BOOT 0x3FF79FA1 28 (v01 PTLTD  $SBFTBL$ 
0604  LTP 0001)
[0.00] ACPI: SSDT 0x3FF79FC9 37 (v01 PTLTD  ACPIHT   
0604  LTP 0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] 135MB HIGHMEM available.
[0.00] 887MB LOWMEM available.
[0.00]   mapped low ram: 0 - 377fe000
[0.00]   low ram: 0 - 377fe000
[0.00] tsc: Fast TSC calibration using PIT
[0.00] BRK [0x04d98000, 0x04d98fff] PGTABLE
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]