Re: [Intel-wired-lan] VRRP not working on i40e X722 S2600WFT

2020-09-02 Thread Lennart Sorensen
On Mon, Aug 31, 2020 at 09:35:19PM -0400,  wrote:
> On Mon, Aug 31, 2020 at 10:35:12AM -0700, Jesse Brandeburg wrote:
> > Thanks for the report Lennart, I understand your frustration, as this
> > should probably work without user configuration.
> > 
> > However, please give this command a try:
> > ethtool --set-priv-flags ethX disable-source-pruning on
> 
> Hmm, our 4.9 kernel is just a touch too old to support that.  And yes
> that really should not require a flag to be set, given the card has no
> reason to ever do that pruning.  There is no justification you could
> have for doing it in the first place.

So backporting the patch that enabled that flag does allow it to work.
Of course there isn't a particularly good place to put an ethtool command
in the boot up to make sure it runs before vrrp is started.  This has to
be the default. I know I wasted about a week trying things to get this to
work, and clearly lots of other people have wasted a ton of time on this
"feature" too (calling it a feature is clearly wrong, it is a bug).

By default the NIC should work as expected.  Any weird questionable
optimizations have to be turned on by the user explicitly when they
understand the consequences.  I can't find any use case documented
anywhere for this bug, I can only find things it has broken (like
apparently arp monitoring on bonding, and vrrp).

So who should make the patch to change this to be the default?  Clearly
the current behaviour is harming and confusing more people than could
possibly be impacted by changing the current default setting for the flag
(in fact I would just about be willing to bet there are no people that
want the current behaviour.  After all no other NIC does this, so clearly
there is no need for it to be done).

-- 
Len Sorensen


Re: [Intel-wired-lan] VRRP not working on i40e X722 S2600WFT

2020-08-31 Thread Lennart Sorensen
On Mon, Aug 31, 2020 at 10:35:12AM -0700, Jesse Brandeburg wrote:
> Thanks for the report Lennart, I understand your frustration, as this
> should probably work without user configuration.
> 
> However, please give this command a try:
> ethtool --set-priv-flags ethX disable-source-pruning on

Hmm, our 4.9 kernel is just a touch too old to support that.  And yes
that really should not require a flag to be set, given the card has no
reason to ever do that pruning.  There is no justification you could
have for doing it in the first place.

-- 
Len Sorensen


Re: VRRP not working on i40e X722 S2600WFT

2020-08-28 Thread Lennart Sorensen
On Thu, Aug 27, 2020 at 02:30:39PM -0400, Lennart Sorensen wrote:
> I have hit a new problem with the X722 chipset (Intel R1304WFT server).
> VRRP simply does not work.
> 
> When keepalived registers a vmac interface, and starts transmitting
> multicast packets with the vrp message, it never receives those packets
> from the peers, so all nodes think they are the master.  tcpdump shows
> transmits, but no receives.  If I stop keepalived, which deletes the
> vmac interface, then I start to receive the multicast packets from the
> other nodes.  Even in promisc mode, tcpdump can't see those packets.
> 
> So it seems the hardware is dropping all packets with a source mac that
> matches the source mac of the vmac interface, even when the destination
> is a multicast address that was subcribed to.  This is clearly not
> proper behaviour.
> 
> I tried a stock 5.8 kernel to check if a driver update helped, and updated
> the nvm firware to the latest 4.10 (which appears to be over a year old),
> and nothing changes the behaviour at all.
> 
> Seems other people have hit this problem too:
> http://mails.dpdk.org/archives/users/2018-May/003128.html
> 
> Unless someone has a way to fix this, we will have to change away from
> this hardware very quickly.  The IPsec NAT RSS defect we could tolerate
> although didn't like, while this is just unworkable.
> 
> Quite frustrated by this.  Intel network hardware was always great,
> how did the X722 make it out in this state.

Another case with the same problem on an X710:

https://www.talkend.net/post/13256.html

-- 
Len Sorensen


VRRP not working on i40e X722 S2600WFT

2020-08-27 Thread Lennart Sorensen
I have hit a new problem with the X722 chipset (Intel R1304WFT server).
VRRP simply does not work.

When keepalived registers a vmac interface, and starts transmitting
multicast packets with the vrp message, it never receives those packets
from the peers, so all nodes think they are the master.  tcpdump shows
transmits, but no receives.  If I stop keepalived, which deletes the
vmac interface, then I start to receive the multicast packets from the
other nodes.  Even in promisc mode, tcpdump can't see those packets.

So it seems the hardware is dropping all packets with a source mac that
matches the source mac of the vmac interface, even when the destination
is a multicast address that was subcribed to.  This is clearly not
proper behaviour.

I tried a stock 5.8 kernel to check if a driver update helped, and updated
the nvm firware to the latest 4.10 (which appears to be over a year old),
and nothing changes the behaviour at all.

Seems other people have hit this problem too:
http://mails.dpdk.org/archives/users/2018-May/003128.html

Unless someone has a way to fix this, we will have to change away from
this hardware very quickly.  The IPsec NAT RSS defect we could tolerate
although didn't like, while this is just unworkable.

Quite frustrated by this.  Intel network hardware was always great,
how did the X722 make it out in this state.

-- 
Len Sorensen


Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

2019-05-21 Thread Lennart Sorensen
On Tue, May 21, 2019 at 09:51:33AM -0700, Alexander Duyck wrote:
> I think we need to narrow this down a bit more. Let's try forcing the
> lookup table all to one value and see if traffic is still going to
> queue 0.
> 
> Specifically what we need to is run the following command to try and
> force all RSS traffic to queue 8, you can verify the result with
> "ethtool -x":
> ethtool -X  weight 0 0 0 0 0 0 0 0 1
> 
> If that works and the IPSec traffic goes to queue 8 then we are likely
> looking at some sort of input issue, either in the parsing or the
> population of things like the input mask that we can then debug
> further.
> 
> If traffic still goes to queue 0 then that tells us the output of the
> RSS hash and lookup table are being ignored, this would imply either
> some other filter is rerouting the traffic or is directing us to limit
> the queue index to 0 bits.

# ethtool -x eth2
RX flow hash indirection table for eth2 with 12 RX ring(s):
0:  7 7 7 7 7 7 7 7
8:  7 7 7 7 7 7 7 7
   16:  7 7 7 7 7 7 7 7
   24:  7 7 7 7 7 7 7 7
   32:  7 7 7 7 7 7 7 7
...
  472:  7 7 7 7 7 7 7 7
  480:  7 7 7 7 7 7 7 7
  488:  7 7 7 7 7 7 7 7
  496:  7 7 7 7 7 7 7 7
  504:  7 7 7 7 7 7 7 7
RSS hash key:
0b:1f:ae:ed:60:04:7d:e5:8a:2b:43:3f:1d:ee:6c:99:89:29:94:b0:25:db:c7:4b:fa:da:4d:3f:e8:cc:bc:00:ad:32:01:d6:1c:30:3f:f8:79:3e:f4:48:04:1f:51:d2:5a:39:f0:90
root@ECA:~# ethtool --show-priv-flags eth2
Private flags for eth2:
MFP  : off
LinkPolling  : off
flow-director-atr: off
veb-stats: off
hw-atr-eviction  : on
legacy-rx: off

All ipsec packets are still hitting queue 0.

Seems it is completely ignoring RSS for these packets.  That is
impressively weird.

-- 
Len Sorensen


Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

2019-05-21 Thread Lennart Sorensen
On Fri, May 17, 2019 at 03:20:02PM -0700, Alexander Duyck wrote:
> I was hoping it would work too. It seemed like it should have been the
> answer since it definitely didn't seem right. Now it has me wondering
> about some of the other code in the driver.
> 
> By any chance have you run anything like DPDK on any of the X722
> interfaces on this system recently? I ask because it occurs to me that
> if you had and it loaded something like a custom parsing profile it
> could cause issues similar to this.

I have never used DPDK on anything.  I was hoping never to do so. :)

This system has so far booted Debian (with a 4.19 kernel) and our own OS
(which has a 4.9 kernel).

> A debugging step you might try would be to revert back to my earlier
> patch that only displayed the input mask instead of changing it. Once
> you have done that you could look at doing a full power cycle on the
> system by either physically disconnecting the power, or using the
> power switch on the power supply itself if one is available. It is
> necessary to disconnect the motherboard/NIC from power in order to
> fully clear the global state stored in the device as it is retained
> when the system is in standby.
> 
> What I want to verify is if the input mask that we have ran into is
> the natural power-on input mask of if that is something that was
> overridden by something else. The mask change I made should be reset
> if the system loses power, and then it will either default back to the
> value with the 6's if that is it's natural state, or it will match
> what I had if it was not.
> 
> Other than that I really can't think up too much else. I suppose there
> is the possibility of the NVM either setting up a DCB setting or
> HREGION register causing an override that is limiting the queues to 1.
> However, the likelihood of that should be really low.

Here is the register dump after a full power off:

40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.7-k
i40e: Copyright (c) 2013 - 2014 Intel Corporation.
i40e :3d:00.0: fw 3.10.52896 api 1.6 nvm 4.00 0x80001577 1.1767.0
i40e :3d:00.0: The driver for the device detected a newer version of the 
NVM image than expected. Please install the most recent version of the network 
driver.
i40e :3d:00.0: MAC address: a4:bf:01:4e:0c:87
i40e :3d:00.0: flow_type: 63 input_mask:0x4000
i40e :3d:00.0: flow_type: 46 input_mask:0x0007fff8
i40e :3d:00.0: flow_type: 45 input_mask:0x0007fff8
i40e :3d:00.0: flow_type: 44 input_mask:0x00078000
i40e :3d:00.0: flow_type: 43 input_mask:0x0007fffe
i40e :3d:00.0: flow_type: 42 input_mask:0x0007fffe
i40e :3d:00.0: flow_type: 41 input_mask:0x0007fffe
i40e :3d:00.0: flow_type: 40 input_mask:0x0007fffe
i40e :3d:00.0: flow_type: 39 input_mask:0x0007fffe
i40e :3d:00.0: flow_type: 36 input_mask:0x00060600
i40e :3d:00.0: flow_type: 35 input_mask:0x00060600
i40e :3d:00.0: flow_type: 34 input_mask:0x000606078000
i40e :3d:00.0: flow_type: 33 input_mask:0x00060606
i40e :3d:00.0: flow_type: 32 input_mask:0x00060606
i40e :3d:00.0: flow_type: 31 input_mask:0x00060606
i40e :3d:00.0: flow_type: 30 input_mask:0x00060606
i40e :3d:00.0: flow_type: 29 input_mask:0x00060606
i40e :3d:00.0: Features: PF-id[0] VSIs: 34 QP: 12 TXQ: 13 RSS VxLAN Geneve 
VEPA
i40e :3d:00.1: fw 3.10.52896 api 1.6 nvm 4.00 0x80001577 1.1767.0
i40e :3d:00.1: The driver for the device detected a newer version of the 
NVM image than expected. Please install the most recent version of the network 
driver.
i40e :3d:00.1: MAC address: a4:bf:01:4e:0c:88
i40e :3d:00.1: flow_type: 63 input_mask:0x4000
i40e :3d:00.1: flow_type: 46 input_mask:0x0007fff8
i40e :3d:00.1: flow_type: 45 input_mask:0x0007fff8
i40e :3d:00.1: flow_type: 44 input_mask:0x00078000
i40e :3d:00.1: flow_type: 43 input_mask:0x0007fffe
i40e :3d:00.1: flow_type: 42 input_mask:0x0007fffe
i40e :3d:00.1: flow_type: 41 input_mask:0x0007fffe
i40e :3d:00.1: flow_type: 40 input_mask:0x0007fffe
i40e :3d:00.1: flow_type: 39 input_mask:0x0007fffe
i40e :3d:00.1: flow_type: 36 input_mask:0x00060600
i40e :3d:00.1: flow_type: 35 input_mask:0x00060600
i40e :3d:00.1: flow_type: 34 input_mask:0x000606078000
i40e :3d:00.1: flow_type: 33 input_mask:0x00060606
i40e :3d:00.1: flow_type: 32 input_mask:0x00060606
i40e :3d:00.1: flow_type: 31 input_mask:0x00060606
i40e :3d:00.1: flow_type: 30 input_mask:0x00060606
i40e :3d:00.1: flow_type: 29 input_mask:0x00060606
i40e :3d:00.1: Features: PF-id[1] VSIs: 34 QP: 12 TXQ: 13 RSS VxLAN Geneve 
VEPA
i40e :3d:00.1 eth2: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: 
None


Re: [PATCH 3.18 052/144] ARM: dra7xx: Fix counter frequency drift for AM572x errata i856

2018-11-09 Thread Lennart Sorensen
On Thu, Nov 08, 2018 at 01:50:23PM -0800, Greg Kroah-Hartman wrote:
> 3.18-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> [ Upstream commit afc9d590b8a150cfeaac0078ef5de6fb21a5ea6a ]
> 
> Errata i856 for the AM572x (DRA7xx) points out that the 32.768KHz external
> crystal is not enabled at power up.  Instead the CPU falls back to using
> an emulation for the 32KHz clock which is SYSCLK1/610.  SYSCLK1 is usually
> 20MHz on boards so far (which gives an emulated frequency of 32.786KHz),
> but can also be 19.2 or 27MHz which result in much larger drift.
> 
> Since this is used to drive the master counter at 32.768KHz * 375 /
> 2 = 6.144MHz, the emulated speed for 20MHz is of by 570ppm, or about 43
> seconds per day, and more than the 500ppm NTP is able to tolerate.
> 
> Checking the CTRL_CORE_BOOTSTRAP register can determine if the CPU
> is using the real 32.768KHz crystal or the emulated SYSCLK1/610, and
> by known that the real counter frequency can be determined and used.
> The real speed is then SYSCLK1 / 610 * 375 / 2 or SYSCLK1 * 75 / 244.

I certainly can't see a problem including it.  Of course I doubt there
really are any DRA7 systems running that old a kernel (certainly the
beaglebone X15 seems to require a newer kernel to even have a DTB
included).  The RuggedCom RX1400 might still be running 3.14 kernel,
but already has the patch applied (because I did it).

So certainly nothing wrong with adding the patch, but I doubt anyone
will notice.

-- 
Len Sorensen


Re: [PATCH 3.18 052/144] ARM: dra7xx: Fix counter frequency drift for AM572x errata i856

2018-11-09 Thread Lennart Sorensen
On Thu, Nov 08, 2018 at 01:50:23PM -0800, Greg Kroah-Hartman wrote:
> 3.18-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> [ Upstream commit afc9d590b8a150cfeaac0078ef5de6fb21a5ea6a ]
> 
> Errata i856 for the AM572x (DRA7xx) points out that the 32.768KHz external
> crystal is not enabled at power up.  Instead the CPU falls back to using
> an emulation for the 32KHz clock which is SYSCLK1/610.  SYSCLK1 is usually
> 20MHz on boards so far (which gives an emulated frequency of 32.786KHz),
> but can also be 19.2 or 27MHz which result in much larger drift.
> 
> Since this is used to drive the master counter at 32.768KHz * 375 /
> 2 = 6.144MHz, the emulated speed for 20MHz is of by 570ppm, or about 43
> seconds per day, and more than the 500ppm NTP is able to tolerate.
> 
> Checking the CTRL_CORE_BOOTSTRAP register can determine if the CPU
> is using the real 32.768KHz crystal or the emulated SYSCLK1/610, and
> by known that the real counter frequency can be determined and used.
> The real speed is then SYSCLK1 / 610 * 375 / 2 or SYSCLK1 * 75 / 244.

I certainly can't see a problem including it.  Of course I doubt there
really are any DRA7 systems running that old a kernel (certainly the
beaglebone X15 seems to require a newer kernel to even have a DTB
included).  The RuggedCom RX1400 might still be running 3.14 kernel,
but already has the patch applied (because I did it).

So certainly nothing wrong with adding the patch, but I doubt anyone
will notice.

-- 
Len Sorensen


Re: 4.9.80 compile failure with X86_VSYSCALL_EMULATION=n due to 9a0be5af

2018-03-21 Thread Lennart Sorensen
On Wed, Mar 21, 2018 at 04:31:03PM +, Ben Hutchings wrote:
> On Wed, 2018-03-21 at 17:04 +0100, Greg Kroah-Hartman wrote:
> > On Mon, Feb 12, 2018 at 10:43:25AM -0500, Lennart Sorensen wrote:
> > > Commit 9a0be5af added a reference to vsyscall_pgprot in
> > > arch/x86/mm/kaiser.c but that is undefined if X86_VSYSCALL_EMULATION=n
> > > which on an embedded system where you know how all your software is
> > > compiled is quite likely.
> > > 
> > > Of course the condition is always false with that config so the code
> > > will never be run, but the compiler is unhappy.
> > 
> > What is the compiler error?  I have not had any reports of this with all
> > of the varied builds that we currently run on the stable trees.
> 
> You already applied a fix for this ("kaiser: fix compile error without
> vsyscall").

Yeah the fix certainly works fine.  I was not the only one reporting it.

-- 
Len Sorensen


Re: 4.9.80 compile failure with X86_VSYSCALL_EMULATION=n due to 9a0be5af

2018-03-21 Thread Lennart Sorensen
On Wed, Mar 21, 2018 at 04:31:03PM +, Ben Hutchings wrote:
> On Wed, 2018-03-21 at 17:04 +0100, Greg Kroah-Hartman wrote:
> > On Mon, Feb 12, 2018 at 10:43:25AM -0500, Lennart Sorensen wrote:
> > > Commit 9a0be5af added a reference to vsyscall_pgprot in
> > > arch/x86/mm/kaiser.c but that is undefined if X86_VSYSCALL_EMULATION=n
> > > which on an embedded system where you know how all your software is
> > > compiled is quite likely.
> > > 
> > > Of course the condition is always false with that config so the code
> > > will never be run, but the compiler is unhappy.
> > 
> > What is the compiler error?  I have not had any reports of this with all
> > of the varied builds that we currently run on the stable trees.
> 
> You already applied a fix for this ("kaiser: fix compile error without
> vsyscall").

Yeah the fix certainly works fine.  I was not the only one reporting it.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-15 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 05:50:43PM -0500, Theodore Y. Ts'o wrote:
> Yes, this is a shortcoming in tune2fs.  You can set extended mount
> options using debugfs:
> 
> debugfs -w -R "set_super_value mount_opts foo,bar" /dev/sda1
> 
> ... but there ought to be some way to support some kind of quoting
> mechanism so that tune2fs can understand when a comma is part of an
> extended option value, as opposed to separating extended options.
> 
> Extended options haven't been used much, so it's not been something
> that has gotten a lot of polish.  Backing up for a bit, is there a
> reason why you need so many mount options when mounting the root file
> sytsem?  Specifically, why do you want to turn off dellayed allocation?

Completely don't trust it.  Also don't do much writing so no real benefit.
Having seen machines get killed by the watchdog come back with blank
chunks in log files is not acceptable.  Getting all user space software
fixed to actually sync properly in all the right places takes forever.

Now for my home machine (especially my mythtv box) it is absoletely on
and a great thing to have.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-15 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 05:50:43PM -0500, Theodore Y. Ts'o wrote:
> Yes, this is a shortcoming in tune2fs.  You can set extended mount
> options using debugfs:
> 
> debugfs -w -R "set_super_value mount_opts foo,bar" /dev/sda1
> 
> ... but there ought to be some way to support some kind of quoting
> mechanism so that tune2fs can understand when a comma is part of an
> extended option value, as opposed to separating extended options.
> 
> Extended options haven't been used much, so it's not been something
> that has gotten a lot of polish.  Backing up for a bit, is there a
> reason why you need so many mount options when mounting the root file
> sytsem?  Specifically, why do you want to turn off dellayed allocation?

Completely don't trust it.  Also don't do much writing so no real benefit.
Having seen machines get killed by the watchdog come back with blank
chunks in log files is not acceptable.  Getting all user space software
fixed to actually sync properly in all the right places takes forever.

Now for my home machine (especially my mythtv box) it is absoletely on
and a great thing to have.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options.  The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults).  The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state.  Furthermore, that /proc/mounts is the
> minimal set of mount option strings.

One more question about this.

Trying to use tune2fs -E mount_opts to set some default options, and
can't figure out how to enter two options at once.

Doing:

tune2fs -E mount_opts=nodelalloc,nouser_xattr /dev/sda3

gives the error:

tune2fs 1.43.4 (31-Jan-2017)

Bad options specified.

Extended options are separated by commas, and may take an argument which
is set off by an equals ('=') sign.

Valid extended options are:
clear_mmp
hash_alg=
mount_opts=
stride=
stripe_width=
test_fs
^test_fs

Apparently it gets confused by the , in the argument to mount_opts.

Unfortunately , happens to be the separator required by ext4 for that
field.  So how does one use it?

Sure in this case I can set one with -o and the other with -E, but in
general there seems to be a small problem here, probably only in user
space though.  Seems tune2fs needs some change in how it deals with
extended options that contain commas.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options.  The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults).  The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state.  Furthermore, that /proc/mounts is the
> minimal set of mount option strings.

One more question about this.

Trying to use tune2fs -E mount_opts to set some default options, and
can't figure out how to enter two options at once.

Doing:

tune2fs -E mount_opts=nodelalloc,nouser_xattr /dev/sda3

gives the error:

tune2fs 1.43.4 (31-Jan-2017)

Bad options specified.

Extended options are separated by commas, and may take an argument which
is set off by an equals ('=') sign.

Valid extended options are:
clear_mmp
hash_alg=
mount_opts=
stride=
stripe_width=
test_fs
^test_fs

Apparently it gets confused by the , in the argument to mount_opts.

Unfortunately , happens to be the separator required by ext4 for that
field.  So how does one use it?

Sure in this case I can set one with -o and the other with -E, but in
general there seems to be a small problem here, probably only in user
space though.  Seems tune2fs needs some change in how it deals with
extended options that contain commas.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options.  The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults).  The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state.  Furthermore, that /proc/mounts is the
> minimal set of mount option strings.
> 
> You may not like the behavior, but it's been this way forever, and the
> reasoning behind it is that the low-level file system code doesn't
> really know what the actual mount option string that would be in
> /etc/fstab or in the /sbin/mount command line.  That's because
> /sbin/mount command actually parses the mount options, and things like
> "ro" is actually translated into bitflag passed to the mount option.
> So for example, it's impossible to know whether "rw" was in the
> user-specified mount string, since we never see it by the time the
> string gets sent to ext4_fill_super (in fact the kernel never sees
> it).  So when we try to make /proc/mounts a replacement for /etc/mtab
> (since some people make /etc/mtab as symlink /proc/mounts), it's
> actually impossible.  Trying to make it be the minimal set of options
> was at least a consitent thing.  That is, if you use "tune2fs -o
> nodelalloc", it's not necessary to put nodelalloc in /etc/fstab or in
> the mount command line.  And hence, it should not be in /proc/mounts.

OK, that makes sense.  Thanks.  I will work on convincing myself this
is how it should be.

> As far as where ext4_seq_options_show() gets called, it's because we
> have a fair amount of macro shortcuts in fs/ext4/sysfs.c (which is
> where we put all of the pseudo file system support for ext4, which
> means it includes procfs).  Search for macros PROC_FILE_SHOW_DEFN and
> PROC_FILE_LIST.

Oh that's where it is.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options.  The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults).  The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state.  Furthermore, that /proc/mounts is the
> minimal set of mount option strings.
> 
> You may not like the behavior, but it's been this way forever, and the
> reasoning behind it is that the low-level file system code doesn't
> really know what the actual mount option string that would be in
> /etc/fstab or in the /sbin/mount command line.  That's because
> /sbin/mount command actually parses the mount options, and things like
> "ro" is actually translated into bitflag passed to the mount option.
> So for example, it's impossible to know whether "rw" was in the
> user-specified mount string, since we never see it by the time the
> string gets sent to ext4_fill_super (in fact the kernel never sees
> it).  So when we try to make /proc/mounts a replacement for /etc/mtab
> (since some people make /etc/mtab as symlink /proc/mounts), it's
> actually impossible.  Trying to make it be the minimal set of options
> was at least a consitent thing.  That is, if you use "tune2fs -o
> nodelalloc", it's not necessary to put nodelalloc in /etc/fstab or in
> the mount command line.  And hence, it should not be in /proc/mounts.

OK, that makes sense.  Thanks.  I will work on convincing myself this
is how it should be.

> As far as where ext4_seq_options_show() gets called, it's because we
> have a fair amount of macro shortcuts in fs/ext4/sysfs.c (which is
> where we put all of the pseudo file system support for ext4, which
> means it includes procfs).  Search for macros PROC_FILE_SHOW_DEFN and
> PROC_FILE_LIST.

Oh that's where it is.

-- 
Len Sorensen


Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Tue, Mar 06, 2018 at 11:06:08PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Mar 06, 2018 at 02:03:15PM -0500, Lennart Sorensen wrote:
> > While switching a system from using ext3 to ext4 (It's about time) I
> > discovered that setting default options for the filesystem using tune2fs
> > -o doesn't work for the root filesystem when mounted by the kernel itself.
> > Filesystems mounted from userspace with the mount command use the options
> > set just fine.  The extended option set with tune2fs -E mount_opts=
> > works fine however.
> 
> Well it's not that it's being ignored.  It's just a
> misunderstanding of how a few things.  It's also that the how we
> handled mount options has evolved over time, leading to a situation
> which is confusing.
> 
> First, tune2fs changes the default of ext4's mount options.  This is
> stated in the tune2fs man page:
> 
> -o [^]mount-option[,...]
>   Set or clear the indicated default mount options in the filesys‐
>   tem.   Default  mount options can be overridden by mount options
>   specified either in /etc/fstab(5) or on the command  line  argu‐
>   ments  to mount(8).  Older kernels may not support this feature;
>   in particular, kernels which predate  2.4.20  will  almost  cer‐
>   tainly ignore the default mount options field in the superblock.
> 
> Secondly, the message when af ile sytem is mounted, e.g.:
> 
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> 
> ... is the mount option string that are passed to the mount system
> call.
> 
> The extended mount options is different.  It was something that we
> added later.  If it is present, this the extended mount options is
> printed first, followed by a semi-colon, followed by string passed to
> the mount system call.
> 
> Hence:
> 
> > tune2fs -E mount_opts=nodelalloc /dev/sda1
> > 
> > at boot we got:
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: 
> > nodelalloc; (null)
> 
> 
> The description of -E option in the tune2fs man page talks about some
> of this, but it's arguably confusing.
> 
> You can see exactly what mount options that are active by looking at
> the file /proc/fs/ext4//options.  So this is how you can prove to
> yourself that tune2fs -o works.

OK that does in fact seem to be the case.  That's good.

> root@kvm-xfstests:~# dmesg -n 7
> root@kvm-xfstests:~# tune2fs -o nodelalloc /dev/vdc 
> tune2fs 1.44-WIP (06-Sep-2017)
> root@kvm-xfstests:~# mount /dev/vdc /vdc
> [   27.389192] EXT4-fs (vdc): mounted filesystem with ordered data mode. 
> Opts: (null)
> root@kvm-xfstests:~# cat /proc/fs/ext4/vdc/options 
> rw
> bsddf
> nogrpid
> block_validity
> dioread_lock
> nodiscard
> nodelalloc
> journal_checksum
> barrier
> auto_da_alloc
> user_xattr
> acl
> noquota
> resuid=0
> resgid=0
> errors=continue
> commit=5
> min_batch_time=0
> max_batch_time=15000
> stripe=0
> data=ordered
> inode_readahead_blks=32
> init_itable=10
> max_dir_size_kb=0
> 
> > For filesystems mounted from userspace with the mount command, either
> > method works however.  The first option however is what the comment in
> > fs/ext4/super.c suggests to use.
> > 
> > Of course I also got the messages:
> > EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
> > EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
> > EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities
> 
> So what's happening here is something that has recently started
> getting reported by users.  Most modern distro's use an initial
> ramdisk to mount the root file system, and they use blkid to determine
> the file system with the right file system type.  If the kernel is
> mounting the root file system.  An indication that this is what's
> happening is the following message in dmesg:
> 
> [2.196149] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
> 
> This message means the kernel fallback code was used to mount the file
> system, not the initial ramdisk code in userspace.
> 
> If you are using the kernel fallback code, it will first try to mount
> the file system as ext3, and if you have "nodelalloc" in the extended
> mount options in the superblock, it will try it first.  The messages
> you have quoted above are harmless.  But they are scaring users, so we
> are looking into ways to suppress them.
> 
> > And of course the last annoying thing I noticed is that /proc/mounts
> > doesn't actually tell you that nodelalloc is active when it is set
> > from the defaul

Re: ext4 ignoring rootfs default mount options

2018-03-07 Thread Lennart Sorensen
On Tue, Mar 06, 2018 at 11:06:08PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Mar 06, 2018 at 02:03:15PM -0500, Lennart Sorensen wrote:
> > While switching a system from using ext3 to ext4 (It's about time) I
> > discovered that setting default options for the filesystem using tune2fs
> > -o doesn't work for the root filesystem when mounted by the kernel itself.
> > Filesystems mounted from userspace with the mount command use the options
> > set just fine.  The extended option set with tune2fs -E mount_opts=
> > works fine however.
> 
> Well it's not that it's being ignored.  It's just a
> misunderstanding of how a few things.  It's also that the how we
> handled mount options has evolved over time, leading to a situation
> which is confusing.
> 
> First, tune2fs changes the default of ext4's mount options.  This is
> stated in the tune2fs man page:
> 
> -o [^]mount-option[,...]
>   Set or clear the indicated default mount options in the filesys‐
>   tem.   Default  mount options can be overridden by mount options
>   specified either in /etc/fstab(5) or on the command  line  argu‐
>   ments  to mount(8).  Older kernels may not support this feature;
>   in particular, kernels which predate  2.4.20  will  almost  cer‐
>   tainly ignore the default mount options field in the superblock.
> 
> Secondly, the message when af ile sytem is mounted, e.g.:
> 
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> 
> ... is the mount option string that are passed to the mount system
> call.
> 
> The extended mount options is different.  It was something that we
> added later.  If it is present, this the extended mount options is
> printed first, followed by a semi-colon, followed by string passed to
> the mount system call.
> 
> Hence:
> 
> > tune2fs -E mount_opts=nodelalloc /dev/sda1
> > 
> > at boot we got:
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: 
> > nodelalloc; (null)
> 
> 
> The description of -E option in the tune2fs man page talks about some
> of this, but it's arguably confusing.
> 
> You can see exactly what mount options that are active by looking at
> the file /proc/fs/ext4//options.  So this is how you can prove to
> yourself that tune2fs -o works.

OK that does in fact seem to be the case.  That's good.

> root@kvm-xfstests:~# dmesg -n 7
> root@kvm-xfstests:~# tune2fs -o nodelalloc /dev/vdc 
> tune2fs 1.44-WIP (06-Sep-2017)
> root@kvm-xfstests:~# mount /dev/vdc /vdc
> [   27.389192] EXT4-fs (vdc): mounted filesystem with ordered data mode. 
> Opts: (null)
> root@kvm-xfstests:~# cat /proc/fs/ext4/vdc/options 
> rw
> bsddf
> nogrpid
> block_validity
> dioread_lock
> nodiscard
> nodelalloc
> journal_checksum
> barrier
> auto_da_alloc
> user_xattr
> acl
> noquota
> resuid=0
> resgid=0
> errors=continue
> commit=5
> min_batch_time=0
> max_batch_time=15000
> stripe=0
> data=ordered
> inode_readahead_blks=32
> init_itable=10
> max_dir_size_kb=0
> 
> > For filesystems mounted from userspace with the mount command, either
> > method works however.  The first option however is what the comment in
> > fs/ext4/super.c suggests to use.
> > 
> > Of course I also got the messages:
> > EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
> > EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
> > EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities
> 
> So what's happening here is something that has recently started
> getting reported by users.  Most modern distro's use an initial
> ramdisk to mount the root file system, and they use blkid to determine
> the file system with the right file system type.  If the kernel is
> mounting the root file system.  An indication that this is what's
> happening is the following message in dmesg:
> 
> [2.196149] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
> 
> This message means the kernel fallback code was used to mount the file
> system, not the initial ramdisk code in userspace.
> 
> If you are using the kernel fallback code, it will first try to mount
> the file system as ext3, and if you have "nodelalloc" in the extended
> mount options in the superblock, it will try it first.  The messages
> you have quoted above are harmless.  But they are scaring users, so we
> are looking into ways to suppress them.
> 
> > And of course the last annoying thing I noticed is that /proc/mounts
> > doesn't actually tell you that nodelalloc is active when it is set
> > from the defaul

ext4 ignoring rootfs default mount options

2018-03-06 Thread Lennart Sorensen
While switching a system from using ext3 to ext4 (It's about time) I
discovered that setting default options for the filesystem using tune2fs
-o doesn't work for the root filesystem when mounted by the kernel itself.
Filesystems mounted from userspace with the mount command use the options
set just fine.  The extended option set with tune2fs -E mount_opts=
works fine however.  I am sure those using an initrd works fine (and
hence why almost noone would ever see this bug) since that uses the
mount command from userspace to mount the rootfs.

Specifically we did:

tune2fs -o nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Instead using:
tune2fs -E mount_opts=nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: nodelalloc; 
(null)

which seems better.

For filesystems mounted from userspace with the mount command, either
method works however.  The first option however is what the comment in
fs/ext4/super.c suggests to use.

Of course I also got the messages:
EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities

Those of course all ought to not be there.  If it is ext4, I don't really
care if ext3 doesn't understand the options, it works fine with ext4.
I should not have to explicitly specify that it is ext4 just to avoid
those.  It is not an ext3 filesystem.

And of course the last annoying thing I noticed is that /proc/mounts
doesn't actually tell you that nodelalloc is active when it is set
from the default mount options rather than from the mount command line
(or fstab).  Lots of other non default options are explicitly handled,
but not delalloc.  The only place you see it, is in the dmesg line
telling you what options the filesystem was mounted with.

-- 
Len Sorensen


ext4 ignoring rootfs default mount options

2018-03-06 Thread Lennart Sorensen
While switching a system from using ext3 to ext4 (It's about time) I
discovered that setting default options for the filesystem using tune2fs
-o doesn't work for the root filesystem when mounted by the kernel itself.
Filesystems mounted from userspace with the mount command use the options
set just fine.  The extended option set with tune2fs -E mount_opts=
works fine however.  I am sure those using an initrd works fine (and
hence why almost noone would ever see this bug) since that uses the
mount command from userspace to mount the rootfs.

Specifically we did:

tune2fs -o nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Instead using:
tune2fs -E mount_opts=nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: nodelalloc; 
(null)

which seems better.

For filesystems mounted from userspace with the mount command, either
method works however.  The first option however is what the comment in
fs/ext4/super.c suggests to use.

Of course I also got the messages:
EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities

Those of course all ought to not be there.  If it is ext4, I don't really
care if ext3 doesn't understand the options, it works fine with ext4.
I should not have to explicitly specify that it is ext4 just to avoid
those.  It is not an ext3 filesystem.

And of course the last annoying thing I noticed is that /proc/mounts
doesn't actually tell you that nodelalloc is active when it is set
from the default mount options rather than from the mount command line
(or fstab).  Lots of other non default options are explicitly handled,
but not delalloc.  The only place you see it, is in the dmesg line
telling you what options the filesystem was mounted with.

-- 
Len Sorensen


4.9.80 compile failure with X86_VSYSCALL_EMULATION=n due to 9a0be5af

2018-02-12 Thread Lennart Sorensen
Commit 9a0be5af added a reference to vsyscall_pgprot in
arch/x86/mm/kaiser.c but that is undefined if X86_VSYSCALL_EMULATION=n
which on an embedded system where you know how all your software is
compiled is quite likely.

Of course the condition is always false with that config so the code
will never be run, but the compiler is unhappy.

-- 
Len Sorensen


4.9.80 compile failure with X86_VSYSCALL_EMULATION=n due to 9a0be5af

2018-02-12 Thread Lennart Sorensen
Commit 9a0be5af added a reference to vsyscall_pgprot in
arch/x86/mm/kaiser.c but that is undefined if X86_VSYSCALL_EMULATION=n
which on an embedded system where you know how all your software is
compiled is quite likely.

Of course the condition is always false with that config so the code
will never be run, but the compiler is unhappy.

-- 
Len Sorensen


Re: [PATCH v3] Fix loading of module radeonfb on PowerMac

2018-01-03 Thread Lennart Sorensen
On Wed, Jan 03, 2018 at 03:47:35PM +0100, Bartlomiej Zolnierkiewicz wrote:
> On Thursday, December 21, 2017 11:07:56 PM Mathieu Malaterre wrote:
> > When the linux kernel is build with (typical kernel ship with Debian
> > installer):
> > 
> > CONFIG_FB_OF=y
> > CONFIG_VT_HW_CONSOLE_BINDING=y
> > CONFIG_FB_RADEON=m
> > 
> > The offb driver takes precedence over module radeonfb. It is then
> > impossible to load the module, error reported is:
> > 
> > [   96.551486] radeonfb :00:10.0: enabling device (0006 -> 0007)
> > [   96.551526] radeonfb :00:10.0: BAR 0: can't reserve [mem 
> > 0x9800-0x9fff pref]
> > [   96.551531] radeonfb (:00:10.0): cannot request region 0.
> > [   96.551545] radeonfb: probe of :00:10.0 failed with error -16
> > 
> > This patch reproduce the behavior of the module radeon, so as to make it
> > possible to load radeonfb when offb is first loaded.
> > 
> > It should be noticed that `offb_destroy` is never called which explain the
> > need to skip error detection on the radeon side.
> 
> This still needs to be explained more, from my last mail:
> 
> "The last put_fb_info() on fb_info should call ->fb_destroy
> (offb_destroy in our case) and remove_conflicting_framebuffers()
> is calling put_fb_info() so there is some extra reference on
> fb_info somewhere preventing it from going away.
> 
> Please look into fixing this."

My impression of that problem is that the offb driver is in use because
it is running the console, and until the radeonfb driver is loaded and
ready to take over, you can't stop using the offb driver, but you can't
currently load the radeonfb because offb is holding resources it wants.

Maybe I have misunderstood what the code is doing, but that seems to be
the problem.

On an x86 PC you could drop one fb and go to text mode then start another
fb driver.  PPC doesn't have that option since there is no text mode.

-- 
Len Sorensen


Re: [PATCH v3] Fix loading of module radeonfb on PowerMac

2018-01-03 Thread Lennart Sorensen
On Wed, Jan 03, 2018 at 03:47:35PM +0100, Bartlomiej Zolnierkiewicz wrote:
> On Thursday, December 21, 2017 11:07:56 PM Mathieu Malaterre wrote:
> > When the linux kernel is build with (typical kernel ship with Debian
> > installer):
> > 
> > CONFIG_FB_OF=y
> > CONFIG_VT_HW_CONSOLE_BINDING=y
> > CONFIG_FB_RADEON=m
> > 
> > The offb driver takes precedence over module radeonfb. It is then
> > impossible to load the module, error reported is:
> > 
> > [   96.551486] radeonfb :00:10.0: enabling device (0006 -> 0007)
> > [   96.551526] radeonfb :00:10.0: BAR 0: can't reserve [mem 
> > 0x9800-0x9fff pref]
> > [   96.551531] radeonfb (:00:10.0): cannot request region 0.
> > [   96.551545] radeonfb: probe of :00:10.0 failed with error -16
> > 
> > This patch reproduce the behavior of the module radeon, so as to make it
> > possible to load radeonfb when offb is first loaded.
> > 
> > It should be noticed that `offb_destroy` is never called which explain the
> > need to skip error detection on the radeon side.
> 
> This still needs to be explained more, from my last mail:
> 
> "The last put_fb_info() on fb_info should call ->fb_destroy
> (offb_destroy in our case) and remove_conflicting_framebuffers()
> is calling put_fb_info() so there is some extra reference on
> fb_info somewhere preventing it from going away.
> 
> Please look into fixing this."

My impression of that problem is that the offb driver is in use because
it is running the console, and until the radeonfb driver is loaded and
ready to take over, you can't stop using the offb driver, but you can't
currently load the radeonfb because offb is holding resources it wants.

Maybe I have misunderstood what the code is doing, but that seems to be
the problem.

On an x86 PC you could drop one fb and go to text mode then start another
fb driver.  PPC doesn't have that option since there is no text mode.

-- 
Len Sorensen


Re: [5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-10-24 Thread Lennart Sorensen
On Tue, Sep 19, 2017 at 09:41:02PM +0200, Benjamin Poirier wrote:
> On 2017/09/19 12:38, Philip Prindeville wrote:
> > Hi.
> > 
> > We’ve been running this patchset (all 5) for about as long as they’ve been 
> > under review… about 2 months.  And in a burn-in lab with heavy traffic.
> > 
> > We’ve not seen a single link-flap in hundreds of ours of saturated traffic.
> > 
> > Would love to see some resolution soon on this as we don’t want to ship a 
> > release with unsanctioned patches.
> > 
> > Is there an estimate on when that might be?
> 
> The patches have been added to Jeff Kirsher's next-queue tree. I guess
> they will be submitted for v4.15 which might be released in early
> 2018...
> http://phb-crystal-ball.org/

And then they will be submitted to linux-stable so this long standing
regression can be fixed, right?

-- 
Len Sorensen


Re: [5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-10-24 Thread Lennart Sorensen
On Tue, Sep 19, 2017 at 09:41:02PM +0200, Benjamin Poirier wrote:
> On 2017/09/19 12:38, Philip Prindeville wrote:
> > Hi.
> > 
> > We’ve been running this patchset (all 5) for about as long as they’ve been 
> > under review… about 2 months.  And in a burn-in lab with heavy traffic.
> > 
> > We’ve not seen a single link-flap in hundreds of ours of saturated traffic.
> > 
> > Would love to see some resolution soon on this as we don’t want to ship a 
> > release with unsanctioned patches.
> > 
> > Is there an estimate on when that might be?
> 
> The patches have been added to Jeff Kirsher's next-queue tree. I guess
> they will be submitted for v4.15 which might be released in early
> 2018...
> http://phb-crystal-ball.org/

And then they will be submitted to linux-stable so this long standing
regression can be fixed, right?

-- 
Len Sorensen


Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-08-02 Thread Lennart Sorensen
On Wed, Aug 02, 2017 at 02:28:07PM +0300, Neftin, Sasha wrote:
> On 7/21/2017 21:36, Benjamin Poirier wrote:
> > Lennart reported the following race condition:
> > 
> > \ e1000_watchdog_task
> >  \ e1000e_has_link
> >  \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> >  /* link is up */
> >  mac->get_link_status = false;
> > 
> >  /* interrupt */
> >  \ e1000_msix_other
> >  hw->mac.get_link_status = true;
> > 
> >  link_active = !hw->mac.get_link_status
> >  /* link_active is false, wrongly */
> > 
> > This problem arises because the single flag get_link_status is used to
> > signal two different states: link status needs checking and link status is
> > down.
> > 
> > Avoid the problem by using the return value of .check_for_link to signal
> > the link status to e1000e_has_link().
> > 
> > Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > ---
> >   drivers/net/ethernet/intel/e1000e/mac.c| 11 ---
> >   drivers/net/ethernet/intel/e1000e/netdev.c |  2 +-
> >   2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
> > b/drivers/net/ethernet/intel/e1000e/mac.c
> > index b322011ec282..f457c5703d0c 100644
> > --- a/drivers/net/ethernet/intel/e1000e/mac.c
> > +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> > @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> >*  Checks to see of the link status of the hardware has changed.  If a
> >*  change in link status has been detected, then we read the PHY 
> > registers
> >*  to get the current speed/duplex if link exists.
> > + *
> > + *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 
> > (link
> > + *  up).
> >**/
> >   s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >   {
> > @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >  * Change or Rx Sequence Error interrupt.
> >  */
> > if (!mac->get_link_status)
> > -   return 0;
> > +   return 1;
> > /* First we want to see if the MII Status Register reports
> >  * link.  If so, then we want to get the current speed/duplex
> > @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >  * different link partner.
> >  */
> > ret_val = e1000e_config_fc_after_link_up(hw);
> > -   if (ret_val)
> > +   if (ret_val) {
> > e_dbg("Error configuring flow control\n");
> > +   return ret_val;
> > +   }
> > -   return ret_val;
> > +   return 1;
> >   }
> >   /**
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index fc6a1db2..5a8ab1136566 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter 
> > *adapter)
> > case e1000_media_type_copper:
> > if (hw->mac.get_link_status) {
> > ret_val = hw->mac.ops.check_for_link(hw);
> > -   link_active = !hw->mac.get_link_status;
> > +   link_active = ret_val > 0;
> > } else {
> > link_active = true;
> > }
> 
> Hello Benjamin,
> 
> Will this patch fix any serious problem with link indication? Is it
> necessary? Can we consider your patch series without 4/5 part?

Without this patch, you have the race condition that can make the
watchdog_task mistakenly think the link is down when it isn't, and then
it resets the adapter, which does make the link go down.

So it is rather catastrophic for the interface.

The other patch to the interrupt handling should make it never get hit,
but the issue does still exist if not fixed and I wouldn't rule out that
it could possibly still happen even with the other fix in place.

-- 
Len Sorensen


Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-08-02 Thread Lennart Sorensen
On Wed, Aug 02, 2017 at 02:28:07PM +0300, Neftin, Sasha wrote:
> On 7/21/2017 21:36, Benjamin Poirier wrote:
> > Lennart reported the following race condition:
> > 
> > \ e1000_watchdog_task
> >  \ e1000e_has_link
> >  \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> >  /* link is up */
> >  mac->get_link_status = false;
> > 
> >  /* interrupt */
> >  \ e1000_msix_other
> >  hw->mac.get_link_status = true;
> > 
> >  link_active = !hw->mac.get_link_status
> >  /* link_active is false, wrongly */
> > 
> > This problem arises because the single flag get_link_status is used to
> > signal two different states: link status needs checking and link status is
> > down.
> > 
> > Avoid the problem by using the return value of .check_for_link to signal
> > the link status to e1000e_has_link().
> > 
> > Reported-by: Lennart Sorensen 
> > Signed-off-by: Benjamin Poirier 
> > ---
> >   drivers/net/ethernet/intel/e1000e/mac.c| 11 ---
> >   drivers/net/ethernet/intel/e1000e/netdev.c |  2 +-
> >   2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
> > b/drivers/net/ethernet/intel/e1000e/mac.c
> > index b322011ec282..f457c5703d0c 100644
> > --- a/drivers/net/ethernet/intel/e1000e/mac.c
> > +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> > @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> >*  Checks to see of the link status of the hardware has changed.  If a
> >*  change in link status has been detected, then we read the PHY 
> > registers
> >*  to get the current speed/duplex if link exists.
> > + *
> > + *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 
> > (link
> > + *  up).
> >**/
> >   s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >   {
> > @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >  * Change or Rx Sequence Error interrupt.
> >  */
> > if (!mac->get_link_status)
> > -   return 0;
> > +   return 1;
> > /* First we want to see if the MII Status Register reports
> >  * link.  If so, then we want to get the current speed/duplex
> > @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> >  * different link partner.
> >  */
> > ret_val = e1000e_config_fc_after_link_up(hw);
> > -   if (ret_val)
> > +   if (ret_val) {
> > e_dbg("Error configuring flow control\n");
> > +   return ret_val;
> > +   }
> > -   return ret_val;
> > +   return 1;
> >   }
> >   /**
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index fc6a1db2..5a8ab1136566 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter 
> > *adapter)
> > case e1000_media_type_copper:
> > if (hw->mac.get_link_status) {
> > ret_val = hw->mac.ops.check_for_link(hw);
> > -   link_active = !hw->mac.get_link_status;
> > +   link_active = ret_val > 0;
> > } else {
> > link_active = true;
> > }
> 
> Hello Benjamin,
> 
> Will this patch fix any serious problem with link indication? Is it
> necessary? Can we consider your patch series without 4/5 part?

Without this patch, you have the race condition that can make the
watchdog_task mistakenly think the link is down when it isn't, and then
it resets the adapter, which does make the link go down.

So it is rather catastrophic for the interface.

The other patch to the interrupt handling should make it never get hit,
but the issue does still exist if not fixed and I wouldn't rule out that
it could possibly still happen even with the other fix in place.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

By the way, while at fixing the e1000e, I just noticed that
if you are blasting the port with traffic when it comes up,
you risk getting a transmit queue time out, because the queue
is started before the carrier is up.  ixgbe already fixed that in
cdc04dcce0598fead6029a2f95e95a4d2ea419c2.  igb has the same problem (which
goes away by moving the queue start to the watchdog after carrier_on,
I just haven't got around to sending that patch yet).

I am going to try moving the queue start to the watchdog and try it again.

Trace looked like this:

[ cut here ]
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1f9/0x200
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: dpi_drv(PO) ccu_util(PO) ipv4_mb(PO) 
l2bridge_config_util(PO) l2_config_util(PO) route_config_util(PO) 
qos_config_util(PO) sysapp_common(PO) chantry_fwd_eng_2800_config(PO) 
shim_module(PO) sadb_cc(PO) ipsecXformer(PO) libeCrypto(PO) ipmatch_cc(PO) 
l2h_cc(PO) ndproxy_cc(PO) arpint_cc(PO) portinfo_cc(PO) chantryqos_cc(PO) 
redirector_cc(PO) ix_ph(PO) fpm_core_cc(PO) pulse_cc(PO) vnstt_cc(PO) 
vnsap_cc(PO) fm_cc(PO) rutm_cc(PO) mutm_cc(PO) ethernet_tx_cc(PO) stkdrv_cc(PO) 
l2bridge_cc(PO) events_util(PO) sched_cc(PO) qm_cc(PO) ipv4_cc(PO) wred_cc(PO) 
tc_meter_cc(PO) dscp_classifier_cc(PO) classifier_6t_cc(PO) ent586_cc(PO) 
dev_cc_arp(PO) chantry_fwd_eng_2800_tables(PO) ether_arp_lib(PO) rtmv4_lib(PO) 
lkup_lib(PO) l2tm_lib(PO) fragmentation_lib(PO) properties_lib(PO) 
msg_support_lib(PO)
 utilities_lib(PO) cci_lib(PO) rm_lib(PO) libossl(O) vip(O) 
productSpec_x86_dp(PO) e1000e
CPU: 0 PID: 0 Comm: swapper/0 Tainted: P   O4.9.24 #20
Hardware name: Supermicro X7SPA-HF/X7SPA-HF, BIOS 1.2a   06/23/12  
  811cef1b 88007fc03e88 
 81037ade  88007fc03ed8 0001
  0082 0001 81037b4c
Call Trace:
  
 [] ? dump_stack+0x46/0x5b
 [] ? __warn+0xbe/0xe0
 [] ? warn_slowpath_fmt+0x4c/0x50
 [] ? mod_timer+0xf2/0x150
 [] ? dev_watchdog+0x1f9/0x200
 [] ? dev_graft_qdisc+0x70/0x70
 [] ? call_timer_fn.isra.26+0x11/0x80
 [] ? run_timer_softirq+0x128/0x150
 [] ? __do_softirq+0xeb/0x1f0
 [] ? irq_exit+0x55/0x60
 [] ? smp_apic_timer_interrupt+0x39/0x50
 [] ? apic_timer_interrupt+0x7c/0x90
  
 [] ? mwait_idle+0x51/0x80
 [] ? cpu_startup_entry+0xa7/0x130
 [] ? start_kernel+0x306/0x30e
---[ end trace ee759b7a56e1110b ]---

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

By the way, while at fixing the e1000e, I just noticed that
if you are blasting the port with traffic when it comes up,
you risk getting a transmit queue time out, because the queue
is started before the carrier is up.  ixgbe already fixed that in
cdc04dcce0598fead6029a2f95e95a4d2ea419c2.  igb has the same problem (which
goes away by moving the queue start to the watchdog after carrier_on,
I just haven't got around to sending that patch yet).

I am going to try moving the queue start to the watchdog and try it again.

Trace looked like this:

[ cut here ]
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1f9/0x200
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: dpi_drv(PO) ccu_util(PO) ipv4_mb(PO) 
l2bridge_config_util(PO) l2_config_util(PO) route_config_util(PO) 
qos_config_util(PO) sysapp_common(PO) chantry_fwd_eng_2800_config(PO) 
shim_module(PO) sadb_cc(PO) ipsecXformer(PO) libeCrypto(PO) ipmatch_cc(PO) 
l2h_cc(PO) ndproxy_cc(PO) arpint_cc(PO) portinfo_cc(PO) chantryqos_cc(PO) 
redirector_cc(PO) ix_ph(PO) fpm_core_cc(PO) pulse_cc(PO) vnstt_cc(PO) 
vnsap_cc(PO) fm_cc(PO) rutm_cc(PO) mutm_cc(PO) ethernet_tx_cc(PO) stkdrv_cc(PO) 
l2bridge_cc(PO) events_util(PO) sched_cc(PO) qm_cc(PO) ipv4_cc(PO) wred_cc(PO) 
tc_meter_cc(PO) dscp_classifier_cc(PO) classifier_6t_cc(PO) ent586_cc(PO) 
dev_cc_arp(PO) chantry_fwd_eng_2800_tables(PO) ether_arp_lib(PO) rtmv4_lib(PO) 
lkup_lib(PO) l2tm_lib(PO) fragmentation_lib(PO) properties_lib(PO) 
msg_support_lib(PO)
 utilities_lib(PO) cci_lib(PO) rm_lib(PO) libossl(O) vip(O) 
productSpec_x86_dp(PO) e1000e
CPU: 0 PID: 0 Comm: swapper/0 Tainted: P   O4.9.24 #20
Hardware name: Supermicro X7SPA-HF/X7SPA-HF, BIOS 1.2a   06/23/12  
  811cef1b 88007fc03e88 
 81037ade  88007fc03ed8 0001
  0082 0001 81037b4c
Call Trace:
  
 [] ? dump_stack+0x46/0x5b
 [] ? __warn+0xbe/0xe0
 [] ? warn_slowpath_fmt+0x4c/0x50
 [] ? mod_timer+0xf2/0x150
 [] ? dev_watchdog+0x1f9/0x200
 [] ? dev_graft_qdisc+0x70/0x70
 [] ? call_timer_fn.isra.26+0x11/0x80
 [] ? run_timer_softirq+0x128/0x150
 [] ? __do_softirq+0xeb/0x1f0
 [] ? irq_exit+0x55/0x60
 [] ? smp_apic_timer_interrupt+0x39/0x50
 [] ? apic_timer_interrupt+0x7c/0x90
  
 [] ? mwait_idle+0x51/0x80
 [] ? cpu_startup_entry+0xa7/0x130
 [] ? start_kernel+0x306/0x30e
---[ end trace ee759b7a56e1110b ]---

-- 
Len Sorensen


Re: [PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:36:26AM -0700, Benjamin Poirier wrote:
> Lennart reported the following race condition:
> 
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
> 
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
> 
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
> 
> This problem arises because the single flag get_link_status is used to
> signal two different states: link status needs checking and link status is
> down.
> 
> Avoid the problem by using the return value of .check_for_link to signal
> the link status to e1000e_has_link().
> 
> Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>

This too seems potentially -stable worthy, although with patch 5, the
problem becomes much much less likely to occur.

-- 
Len Sorensen


Re: [PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:36:26AM -0700, Benjamin Poirier wrote:
> Lennart reported the following race condition:
> 
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
> 
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
> 
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
> 
> This problem arises because the single flag get_link_status is used to
> signal two different states: link status needs checking and link status is
> down.
> 
> Avoid the problem by using the return value of .check_for_link to signal
> the link status to e1000e_has_link().
> 
> Reported-by: Lennart Sorensen 
> Signed-off-by: Benjamin Poirier 

This too seems potentially -stable worthy, although with patch 5, the
problem becomes much much less likely to occur.

-- 
Len Sorensen


Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:36:27AM -0700, Benjamin Poirier wrote:
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
> 
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
> 
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
> 
> Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>

Any chance of this fix hitting -stable?  After all adapter reset under
load is not nice.

-- 
Len Sorensen


Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:36:27AM -0700, Benjamin Poirier wrote:
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
> 
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
> 
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
> 
> Reported-by: Lennart Sorensen 
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier 

Any chance of this fix hitting -stable?  After all adapter reset under
load is not nice.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:27:09AM -0400,  wrote:
> On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> > Could you please test the following patch and let me know if it:
> > 1) reduces the interrupt rate of the Other msi-x vector
> > 2) avoids the link flaps
> > or
> > 3) logs some dmesg warnings of the form "Other interrupt with unhandled 
> > [...]"
> > In this case, please paste icr values printed.
> 
> I will give it a try.

So test looks excellent.  Seems to only get interrupts when link state
actually changes now.

> Another odd behaviour I see is that the driver will hang in
> napi_synchronize on shutdown if there is traffic at the time (at least
> I think that's the trigger, maybe the trigger is if there has been an
> overload of traffic and the backlog in napi was used).
> 
> From doing some searching, this seems to be a problem that has plagued
> some people for years with this driver.
> 
> I am having trouble figuring out exactly what napi_synchronize is waiting
> for and who is supposed to toggle the flag it is waiting on.  The flag
> appears to work backwards from what I would have expected it to do.
> I see lots of places that can set the bit, but only napi_enable seems
> to clear it again, and I don't see how that would get called for all
> the places that potentially set the bit.

I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same
thing and I need to look at both of those.

Still something seems odd in some corner case where napi gets stuck and
you can't close the port anymore due to napi_synchronize never being
able to finish.  Some traffic pattern causes that SCHED state bit to
get into the wrong state and nothing ever clears it.  Even managed to
see it get stuck so it never passed traffic again and hung on shutdown.
The napi poll was never called again.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Fri, Jul 21, 2017 at 11:27:09AM -0400,  wrote:
> On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> > Could you please test the following patch and let me know if it:
> > 1) reduces the interrupt rate of the Other msi-x vector
> > 2) avoids the link flaps
> > or
> > 3) logs some dmesg warnings of the form "Other interrupt with unhandled 
> > [...]"
> > In this case, please paste icr values printed.
> 
> I will give it a try.

So test looks excellent.  Seems to only get interrupts when link state
actually changes now.

> Another odd behaviour I see is that the driver will hang in
> napi_synchronize on shutdown if there is traffic at the time (at least
> I think that's the trigger, maybe the trigger is if there has been an
> overload of traffic and the backlog in napi was used).
> 
> From doing some searching, this seems to be a problem that has plagued
> some people for years with this driver.
> 
> I am having trouble figuring out exactly what napi_synchronize is waiting
> for and who is supposed to toggle the flag it is waiting on.  The flag
> appears to work backwards from what I would have expected it to do.
> I see lots of places that can set the bit, but only napi_enable seems
> to clear it again, and I don't see how that would get called for all
> the places that potentially set the bit.

I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same
thing and I need to look at both of those.

Still something seems odd in some corner case where napi gets stuck and
you can't close the port anymore due to napi_synchronize never being
able to finish.  Some traffic pattern causes that SCHED state bit to
get into the wrong state and nothing ever clears it.  Even managed to
see it get stuck so it never passed traffic again and hung on shutdown.
The napi poll was never called again.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

I will give it a try.

Another odd behaviour I see is that the driver will hang in
napi_synchronize on shutdown if there is traffic at the time (at least
I think that's the trigger, maybe the trigger is if there has been an
overload of traffic and the backlog in napi was used).

>From doing some searching, this seems to be a problem that has plagued
some people for years with this driver.

I am having trouble figuring out exactly what napi_synchronize is waiting
for and who is supposed to toggle the flag it is waiting on.  The flag
appears to work backwards from what I would have expected it to do.
I see lots of places that can set the bit, but only napi_enable seems
to clear it again, and I don't see how that would get called for all
the places that potentially set the bit.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-21 Thread Lennart Sorensen
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

I will give it a try.

Another odd behaviour I see is that the driver will hang in
napi_synchronize on shutdown if there is traffic at the time (at least
I think that's the trigger, maybe the trigger is if there has been an
overload of traffic and the backlog in napi was used).

>From doing some searching, this seems to be a problem that has plagued
some people for years with this driver.

I am having trouble figuring out exactly what napi_synchronize is waiting
for and who is supposed to toggle the flag it is waiting on.  The flag
appears to work backwards from what I would have expected it to do.
I see lots of places that can set the bit, but only napi_enable seems
to clear it again, and I don't see how that would get called for all
the places that potentially set the bit.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-20 Thread Lennart Sorensen
On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> Are you sure about this? In my testing, while triggering the overrun
> with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> consistently set.

I had thousands of calls to e1000_msix_other where the only bit set
was OTHER.

I don't know if the cause is overruns, it just seems plausible.

> I'm working on a patch that uses that fact to handle the situation and
> limit the interrupt.

Excellent.

Running in MSI mode rather than MSI-X seems to not have the problem of
unexpected interrupts, but has other issues (such as loosing the IRQ
affinity setting if you do ifconfig down;ifconfig up on the interface,
which does not happen in MSI-X's case.)  That's rather annoying as you
can't set the affinity before bringing up the interface which is rather
undesirable.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-20 Thread Lennart Sorensen
On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> Are you sure about this? In my testing, while triggering the overrun
> with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> consistently set.

I had thousands of calls to e1000_msix_other where the only bit set
was OTHER.

I don't know if the cause is overruns, it just seems plausible.

> I'm working on a patch that uses that fact to handle the situation and
> limit the interrupt.

Excellent.

Running in MSI mode rather than MSI-X seems to not have the problem of
unexpected interrupts, but has other issues (such as loosing the IRQ
affinity setting if you do ifconfig down;ifconfig up on the interface,
which does not happen in MSI-X's case.)  That's rather annoying as you
can't set the affinity before bringing up the interface which is rather
undesirable.

-- 
Len Sorensen


Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-19 Thread Lennart Sorensen
On Tue, Jul 18, 2017 at 04:14:35PM -0700, Benjamin Poirier wrote:
> Thanks for the detailed analysis.
> 
> Refering to the original discussion around this patch series, it seemed like
> the IMS bit for a condition had to be set for the Other interrupt to be raised
> for that condition.
> 
> https://lkml.org/lkml/2015/11/4/683
> 
> In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
> raised for Receiver Overrun. Apparently something is going on...
> 
> I can reproduce the spurious Other interrupts with a simple mdelay()
> With the debugging patch at the end of the mail I see stuff like this
> while blasting with udp frames:
>   -0 [086] d.h1 15338.742675: e1000_msix_other: got Other 
> interrupt, count 15127
><...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other 
> interrupt, count 1
>   -0 [086] d.h1 15340.745123: e1000_msix_other: got Other 
> interrupt, count 27584
><...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other 
> interrupt, count 1
> 
> > hence sets the flag that (unfortunately) means both link is down and link
> > state should be checked.  Since this now happens 3000 times per second,
> > the chances of it happening while the watchdog_task is checking the link
> > state becomes pretty high, and it if does happen to coincice, then the
> > watchdog_task will reset the adapter, which causes a real loss of link.
> 
> Through which path does watchdog_task reset the adapter? I didn't
> reproduce that.

The other interrupt happens and sets get_link_status to true.  At some
point the watchdog_task runs on some core and calls e1000e_has_link,
which then calls check_for_link to find out the current link status.
While e1000e_check_for_copper_link is checking the link state and
after updating get_link_status to false to indicate link is up, another
interrupt occurs and another core handles it and changes get_link_status
to true again.  So by the time e1000e_has_link goes to determine the
return value, get_link_state has changed back again so now it returns
link down, and as a result the watchdog_task calls reset, because we
have packets in the transmit queue (we were busy forwarding over 10
packets per second when it happened).

Running on an Atom D525 which isn't very fast and uses hyperthreading
might have something to do with how the scheduling manages to trigger
this race condition.  On a faster CPU you very likely would be done
checking the link state quickly enough that the interrupt handler rarely
gets a chance to interfere.  Also we have the irq affinity set so the
RX/TX of one port is handled by one CPU, the RX/TX of the other port
by a different CPU and the Other interrupts and other tasks (like the
watchdog) are handled by the last two CPUs.

Either making the current link state its own bool and keeping it's meaning
away from get_link_state, or making the interrupt handler only change
get_link_state when LSC is actually present makes the problem go away.
Having two meanings to get_link_state (both link state needs checking
and what the link state is) causes issues.  After all it is using a bool
to store 3 values: Link is up, link needs checking but is up and link
needs checking but is down.  Of course the last two states are rather
quantum, in that you don't know which it is until you check.

> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index b3679728caac..689ad76d0d12 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,8 @@
>  
>  #include "e1000.h"
>  
> +DEFINE_RATELIMIT_STATE(e1000e_ratelimit_state, 2 * HZ, 4);
> +
>  #define DRV_EXTRAVERSION "-k"
>  
>  #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -937,6 +939,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
> *rx_ring, int *work_done,
>   bool cleaned = false;
>   unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>  
> + mdelay(10);
> +
>   i = rx_ring->next_to_clean;
>   rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
>   staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
> @@ -1067,6 +1071,13 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
> *rx_ring, int *work_done,
>  
>   adapter->total_rx_bytes += total_rx_bytes;
>   adapter->total_rx_packets += total_rx_packets;
> +
> + if (__ratelimit(_ratelimit_state)) {
> + static unsigned int max;
> + max = max(max, total_rx_packets);
> + trace_printk("received %u max %u\n", 

Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-19 Thread Lennart Sorensen
On Tue, Jul 18, 2017 at 04:14:35PM -0700, Benjamin Poirier wrote:
> Thanks for the detailed analysis.
> 
> Refering to the original discussion around this patch series, it seemed like
> the IMS bit for a condition had to be set for the Other interrupt to be raised
> for that condition.
> 
> https://lkml.org/lkml/2015/11/4/683
> 
> In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
> raised for Receiver Overrun. Apparently something is going on...
> 
> I can reproduce the spurious Other interrupts with a simple mdelay()
> With the debugging patch at the end of the mail I see stuff like this
> while blasting with udp frames:
>   -0 [086] d.h1 15338.742675: e1000_msix_other: got Other 
> interrupt, count 15127
><...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other 
> interrupt, count 1
>   -0 [086] d.h1 15340.745123: e1000_msix_other: got Other 
> interrupt, count 27584
><...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other 
> interrupt, count 1
><...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other 
> interrupt, count 1
> 
> > hence sets the flag that (unfortunately) means both link is down and link
> > state should be checked.  Since this now happens 3000 times per second,
> > the chances of it happening while the watchdog_task is checking the link
> > state becomes pretty high, and it if does happen to coincice, then the
> > watchdog_task will reset the adapter, which causes a real loss of link.
> 
> Through which path does watchdog_task reset the adapter? I didn't
> reproduce that.

The other interrupt happens and sets get_link_status to true.  At some
point the watchdog_task runs on some core and calls e1000e_has_link,
which then calls check_for_link to find out the current link status.
While e1000e_check_for_copper_link is checking the link state and
after updating get_link_status to false to indicate link is up, another
interrupt occurs and another core handles it and changes get_link_status
to true again.  So by the time e1000e_has_link goes to determine the
return value, get_link_state has changed back again so now it returns
link down, and as a result the watchdog_task calls reset, because we
have packets in the transmit queue (we were busy forwarding over 10
packets per second when it happened).

Running on an Atom D525 which isn't very fast and uses hyperthreading
might have something to do with how the scheduling manages to trigger
this race condition.  On a faster CPU you very likely would be done
checking the link state quickly enough that the interrupt handler rarely
gets a chance to interfere.  Also we have the irq affinity set so the
RX/TX of one port is handled by one CPU, the RX/TX of the other port
by a different CPU and the Other interrupts and other tasks (like the
watchdog) are handled by the last two CPUs.

Either making the current link state its own bool and keeping it's meaning
away from get_link_state, or making the interrupt handler only change
get_link_state when LSC is actually present makes the problem go away.
Having two meanings to get_link_state (both link state needs checking
and what the link state is) causes issues.  After all it is using a bool
to store 3 values: Link is up, link needs checking but is up and link
needs checking but is down.  Of course the last two states are rather
quantum, in that you don't know which it is until you check.

> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index b3679728caac..689ad76d0d12 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,8 @@
>  
>  #include "e1000.h"
>  
> +DEFINE_RATELIMIT_STATE(e1000e_ratelimit_state, 2 * HZ, 4);
> +
>  #define DRV_EXTRAVERSION "-k"
>  
>  #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -937,6 +939,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
> *rx_ring, int *work_done,
>   bool cleaned = false;
>   unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>  
> + mdelay(10);
> +
>   i = rx_ring->next_to_clean;
>   rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
>   staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
> @@ -1067,6 +1071,13 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
> *rx_ring, int *work_done,
>  
>   adapter->total_rx_bytes += total_rx_bytes;
>   adapter->total_rx_packets += total_rx_packets;
> +
> + if (__ratelimit(_ratelimit_state)) {
> + static unsigned int max;
> + max = max(max, total_rx_packets);
> + trace_printk("received %u max %u\n", 

commit 16ecba59 breaks 82574L under heavy load.

2017-07-18 Thread Lennart Sorensen
Commit 16ecba59bc333d6282ee057fb02339f77a880beb has apparently broken
at least the 82574L under heavy load (as in load heavy enough to cause
packet drops).  In this case, when running in MSI-X mode, the Other
Causes interrupt fires about 3000 times per second, but not due to link
state changes.  Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked.  Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link.

Reverting the commit makes everything work fine again (of course packets
are still dropped, but at least the link stays up, the adapter isn't
reset, and most packets make it through).

I tried checking what the bits in the ICR actually were under these
conditions, and it would appear that the only bit set is 24 (the Other
Causes interrupt bit).  So I don't know what the real cause is although
rx buffer overrun would be my guess, and in fact I see nothing in the
datasheet indicating that you can actually disable the rx buffer overrun
from generating an interrupt.

Prior to this commit, the interrupt handler explicitly checked that the
interrupt was caused by a link state change and only then did it trigger
a recheck which worked fine and did not cause incorrect adapter resets,
although it of course still had lots of undesired interrupts to deal with.

Of course ideally there would be a way to make these 3000 pointless
interrupts per second not happen, but unless there is a way to determine
that, I think this commit needs reverting, since it apparently causes
link failures on actual hardware that exists.

The ports are onboard intel 82574L on a Supermicro X7SPA-HF-D525 with
1.2a BIOS (upgrading to 1.2b to check if it makes a difference is not
an option unfortunately).

-- 
Len Sorensen


commit 16ecba59 breaks 82574L under heavy load.

2017-07-18 Thread Lennart Sorensen
Commit 16ecba59bc333d6282ee057fb02339f77a880beb has apparently broken
at least the 82574L under heavy load (as in load heavy enough to cause
packet drops).  In this case, when running in MSI-X mode, the Other
Causes interrupt fires about 3000 times per second, but not due to link
state changes.  Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked.  Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link.

Reverting the commit makes everything work fine again (of course packets
are still dropped, but at least the link stays up, the adapter isn't
reset, and most packets make it through).

I tried checking what the bits in the ICR actually were under these
conditions, and it would appear that the only bit set is 24 (the Other
Causes interrupt bit).  So I don't know what the real cause is although
rx buffer overrun would be my guess, and in fact I see nothing in the
datasheet indicating that you can actually disable the rx buffer overrun
from generating an interrupt.

Prior to this commit, the interrupt handler explicitly checked that the
interrupt was caused by a link state change and only then did it trigger
a recheck which worked fine and did not cause incorrect adapter resets,
although it of course still had lots of undesired interrupts to deal with.

Of course ideally there would be a way to make these 3000 pointless
interrupts per second not happen, but unless there is a way to determine
that, I think this commit needs reverting, since it apparently causes
link failures on actual hardware that exists.

The ports are onboard intel 82574L on a Supermicro X7SPA-HF-D525 with
1.2a BIOS (upgrading to 1.2b to check if it makes a difference is not
an option unfortunately).

-- 
Len Sorensen


Re: Change PAGE_SIZE from minimum 4k to 12k

2017-05-16 Thread Lennart Sorensen
On Tue, May 16, 2017 at 11:27:08AM -0400, Kevin McKinney wrote:
> Hi Everyone,
> 
> Would it be possible to have a custom block device driver read/write
> in increments of 12k instead of reading/writing data in 4k increments?
> In other words, I would like to change the default page size on a
> x86_64 platform (4.4.0 kernel) from 4k to 12k as the minimum page
> size?  I understand I may have negative performance due to
> fragmentation. Any help would be appreciated.
> 
> If this is the wrong mailing list, please let me know the right one to use.

I believe the answer is no.

x86 supports page sizes of 4K, 2M and 1G.  I don't believe anything else
is possible with the hardware.

The page size is not a software invention.

-- 
Len Sorensen


Re: Change PAGE_SIZE from minimum 4k to 12k

2017-05-16 Thread Lennart Sorensen
On Tue, May 16, 2017 at 11:27:08AM -0400, Kevin McKinney wrote:
> Hi Everyone,
> 
> Would it be possible to have a custom block device driver read/write
> in increments of 12k instead of reading/writing data in 4k increments?
> In other words, I would like to change the default page size on a
> x86_64 platform (4.4.0 kernel) from 4k to 12k as the minimum page
> size?  I understand I may have negative performance due to
> fragmentation. Any help would be appreciated.
> 
> If this is the wrong mailing list, please let me know the right one to use.

I believe the answer is no.

x86 supports page sizes of 4K, 2M and 1G.  I don't believe anything else
is possible with the hardware.

The page size is not a software invention.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 10:02:23PM +0100, Stephen Mueller wrote:
> Looks like it worked! Thanks!

Well at least you could backup the data, just in case.

> I used:
> 
> sudo mdadm --create /dev/md0 --assume-clean --verbose --level-10
> --raid-devices=4 /dev/sdc /dev/sdd /dev/sde /dev/sdf

I wonder about the assume-clean, but I would think any data you wrote
will have been duplicated at the time you wrote it, and only unused
space might not have been synced yet.

Maybe running a forced resync would be worthwhile.

> And I got my instructions for creating the array here, and they
> also don't use partitions...
> 
> https://www.digitalocean.com/community/tutorials/how-to-create-raid-arrays-with-mdadm-on-ubuntu-16-04

Well it works, as long as you don't leave anything around to confuse it,
in this case a GPT partition table that somehow came back to bite you.

I wonder if mdadm could perhaps warn about the existing partition table
when being asked to create a new device.

Does it survive reboots now?

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 10:02:23PM +0100, Stephen Mueller wrote:
> Looks like it worked! Thanks!

Well at least you could backup the data, just in case.

> I used:
> 
> sudo mdadm --create /dev/md0 --assume-clean --verbose --level-10
> --raid-devices=4 /dev/sdc /dev/sdd /dev/sde /dev/sdf

I wonder about the assume-clean, but I would think any data you wrote
will have been duplicated at the time you wrote it, and only unused
space might not have been synced yet.

Maybe running a forced resync would be worthwhile.

> And I got my instructions for creating the array here, and they
> also don't use partitions...
> 
> https://www.digitalocean.com/community/tutorials/how-to-create-raid-arrays-with-mdadm-on-ubuntu-16-04

Well it works, as long as you don't leave anything around to confuse it,
in this case a GPT partition table that somehow came back to bite you.

I wonder if mdadm could perhaps warn about the existing partition table
when being asked to create a new device.

Does it survive reboots now?

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 09:09:39PM +0100, Stephen Mueller wrote:
> OK, I used gdisk to remove the GPT and MBR from each disk.
> mdadm --assemble still doesn't work... says it can't find the
> superblock. The mdadm --examine commands also say that no
> superblock is detected.

Yes the GPT overwrite the superblock.  The GPT uses the first 17KB of
the disk, and the superblock was at 4KB from the start.

> I guess I'll go ahead with --create...

At least given the data should start at 1MB from the start, it should
not have been overwritten.

I do always like disk images before I try anything like that.  I am never
quite sure what create will do although it will likely do the right
thing at least if you get the order right.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 09:09:39PM +0100, Stephen Mueller wrote:
> OK, I used gdisk to remove the GPT and MBR from each disk.
> mdadm --assemble still doesn't work... says it can't find the
> superblock. The mdadm --examine commands also say that no
> superblock is detected.

Yes the GPT overwrite the superblock.  The GPT uses the first 17KB of
the disk, and the superblock was at 4KB from the start.

> I guess I'll go ahead with --create...

At least given the data should start at 1MB from the start, it should
not have been overwritten.

I do always like disk images before I try anything like that.  I am never
quite sure what create will do although it will likely do the right
thing at least if you get the order right.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 08:38:08PM +0100, Stephen Mueller wrote:
> Apologies, I should have started this on linux-raid...
> 
> 
> stephen@fred> sudo gdisk -l /dev/sdc
> GPT fdisk (gdisk) version 1.0.1
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 60F1D54C-9D17-4688-BD6E-447F5E7408EB
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 7814037134
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 7814037101 sectors (3.6 TiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
> stephen@fred>
> 
> The other disks are all similar.
> 
> How do I remove the MBR/GPT tables? So you would do that, and then the
> mdadm --create?

Well at least that seems to confirm that something restored the GPT from
the second copy at the end of the disk, wiping out the md superblock
at 4k.

If you run gdisk /dev/sdc, it has a 'zap' option using the z key to
delete all traces of GPT.  That ought to do what you want.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 08:38:08PM +0100, Stephen Mueller wrote:
> Apologies, I should have started this on linux-raid...
> 
> 
> stephen@fred> sudo gdisk -l /dev/sdc
> GPT fdisk (gdisk) version 1.0.1
> 
> Partition table scan:
>   MBR: protective
>   BSD: not present
>   APM: not present
>   GPT: present
> 
> Found valid GPT with protective MBR; using GPT.
> Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 60F1D54C-9D17-4688-BD6E-447F5E7408EB
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 7814037134
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 7814037101 sectors (3.6 TiB)
> 
> Number  Start (sector)End (sector)  Size   Code  Name
> stephen@fred>
> 
> The other disks are all similar.
> 
> How do I remove the MBR/GPT tables? So you would do that, and then the
> mdadm --create?

Well at least that seems to confirm that something restored the GPT from
the second copy at the end of the disk, wiping out the md superblock
at 4k.

If you run gdisk /dev/sdc, it has a 'zap' option using the z key to
delete all traces of GPT.  That ought to do what you want.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 07:09:41PM +0100, r...@mueller.org wrote:
> Thank you very much or your reply.
> 
> I naively thought that starting without partitions would be the best
> starting point, given 3 of the disks had been in a RAID5 array
> previously (possibly with partitions, not sure), but that looks like
> a bad choice, based on some other things I've googled. Lesson learned.
> 
> I have an mdadm.conf file, but it could be a remnant of my previous array.
> I've already edited it trying to get things to work, so I'm
> not sure if it was updated when I created the new array or not.
> 
> I see various people online have had success in my situation using
> madadm --create /dev/md0 --assume-clean --verbose --level=10 \
> --raid-devices=4 /dev/sdc /dev/sdd /dev/sde /dev/sdf
> 
> Some people used --assume-clean, and some didn't. Given my array wasn't done
> with its resync, maybe I should leave that out.
> 
> If that would work, I guess then I need to get the data off the array,
> delete it, and recreate it with disk partitions, or risk this happening
> again at the next reboot, for whatever reason.
> 
> Anyone think it's a bad idea to try mdadm --create at this point?
> 
> Sorry, I'm not sure how to write 0's to sector 0...

Well the other possibility is that by having a MBR saying this has GPT and
probably having mdadm corrupt the GPT by writing to 4k in, but nothing
overwriting the backup GPT at the end of the device, something may have
restored the GPT to the original location, even if it was an empty GPT,
in which case your superblock is likely overwritten.

Does 'gdisk -l /dev/sdc' claim that it is a valid GPT table or corrupt?

So if that is what happened (not sure what would have done it, although
it seems badly defined who is responsible for fixing the GPT if it is
detected that the backup is valid but the primary is corrupt), then you
will want to make sure the MBR and GPT tables are removed, before you
do anything else.  Likely the data is still there, since I think the
first data block is offset a bit into the device, and the GPT is only
the first 32KB or so of the device.  If that is the case, doing the
create would probably work.  Of course the safest thing to do would be
to clone your disks before trying to reassemble them (which means you
need another four 4TB drives, but if the data is important, that's
probably worth it).

Of course you would also have to be sure the mdadm command you use to
create it again is exactly the same as before, and that the device names
are exactly the same as before.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 07:09:41PM +0100, r...@mueller.org wrote:
> Thank you very much or your reply.
> 
> I naively thought that starting without partitions would be the best
> starting point, given 3 of the disks had been in a RAID5 array
> previously (possibly with partitions, not sure), but that looks like
> a bad choice, based on some other things I've googled. Lesson learned.
> 
> I have an mdadm.conf file, but it could be a remnant of my previous array.
> I've already edited it trying to get things to work, so I'm
> not sure if it was updated when I created the new array or not.
> 
> I see various people online have had success in my situation using
> madadm --create /dev/md0 --assume-clean --verbose --level=10 \
> --raid-devices=4 /dev/sdc /dev/sdd /dev/sde /dev/sdf
> 
> Some people used --assume-clean, and some didn't. Given my array wasn't done
> with its resync, maybe I should leave that out.
> 
> If that would work, I guess then I need to get the data off the array,
> delete it, and recreate it with disk partitions, or risk this happening
> again at the next reboot, for whatever reason.
> 
> Anyone think it's a bad idea to try mdadm --create at this point?
> 
> Sorry, I'm not sure how to write 0's to sector 0...

Well the other possibility is that by having a MBR saying this has GPT and
probably having mdadm corrupt the GPT by writing to 4k in, but nothing
overwriting the backup GPT at the end of the device, something may have
restored the GPT to the original location, even if it was an empty GPT,
in which case your superblock is likely overwritten.

Does 'gdisk -l /dev/sdc' claim that it is a valid GPT table or corrupt?

So if that is what happened (not sure what would have done it, although
it seems badly defined who is responsible for fixing the GPT if it is
detected that the backup is valid but the primary is corrupt), then you
will want to make sure the MBR and GPT tables are removed, before you
do anything else.  Likely the data is still there, since I think the
first data block is offset a bit into the device, and the GPT is only
the first 32KB or so of the device.  If that is the case, doing the
create would probably work.  Of course the safest thing to do would be
to clone your disks before trying to reassemble them (which means you
need another four 4TB drives, but if the data is important, that's
probably worth it).

Of course you would also have to be sure the mdadm command you use to
create it again is exactly the same as before, and that the device names
are exactly the same as before.

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 05:49:05PM +0100, r...@mueller.org wrote:
> I am hoping someone here will help me. Was reading this site...
> 
> https://raid.wiki.kernel.org/index.php/Linux_Raid
> 
> and it said to email this list if you've tried everything other than mdadm
> --create.
> 
> 
> I am running Ubuntu 16.04. Machine name is fred. I used webmin to create a 4
> disk RAID10 array yesterday. I moved all my data onto the array.
> 
> Today, I had to reboot my PC. The resync was still not done, but I read
> online that it's OK to boot during resync. After boot, my array was gone. I
> checked syslog, and it just has this line:
> 
> DeviceDisappeared event detected on md device /dev/md0
> 
> I did not partition my disks before building the array. So I believe the
> array consisted of /dev/sdc, /dev/sdd, /dev/sde, and /dev/sdf.
> 
> Here's some info...
> 
> stephen@fred> lsblk
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda  8:00 117.4G  0 disk
> ├─sda1   8:10 109.7G  0 part /
> ├─sda2   8:20 1K  0 part
> └─sda5   8:50   7.7G  0 part [SWAP]
> sdb  8:16   0 465.8G  0 disk
> └─sdb1   8:17   0 465.8G  0 part
> sdc  8:32   0   3.7T  0 disk
> sdd  8:48   0   3.7T  0 disk
> sde  8:64   0   3.7T  0 disk
> sdf  8:80   0   3.7T  0 disk
> 
> stephen@fred> sudo mdadm --examine /dev/sdc
> [sudo] password for stephen:
> /dev/sdc:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdc1
> mdadm: cannot open /dev/sdc1: No such file or directory
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdd
> /dev/sdd:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sde
> /dev/sde:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdf
> /dev/sdf:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> 
> stephen@fred> sudo mdadm --assemble --force /dev/md0 /dev/sdc /dev/sdd
> /dev/sde /dev/sdf
> mdadm: Cannot assemble mbr metadata on /dev/sdc
> mdadm: /dev/sdc has no superblock - assembly aborted
> 
> Thank you for any help you can provide.

Did your disks have partitions previously?  That output looks a lot like
the protective MBR partition table for a disk with GPT partitions.

Could that still existing in sector 0 be confusing mdadm?

I have never personally done any md raid without partitions.  To me they
just make more sense.

One way to test could be to save a copy of sector 0, then overwrite sector
0 with zeros and then run mdadm --examine again to see if that makes a
difference.  You can always put back the saved copy of sector 0 that way.

My understanding is that the default is to put the raid superblock at
offset 4k, so it would not overwrite an existing MBR partition table.
If it also happens due to rounding that the end of the disk isn't
overwritten (or even just because that part of the filesystem wasn't
written to yet), then the backup GPT from before would still be intact,
and could perhaps cause even more confussion later if gdisk or similar
is pointed at the disk.  Really want to be sure there is no trace left
of the partition table before using it raw for md raid.

Any chance the system saved an mdadm.conf file of your setup?

-- 
Len Sorensen


Re: RAID array is gone, please help

2017-03-23 Thread Lennart Sorensen
On Thu, Mar 23, 2017 at 05:49:05PM +0100, r...@mueller.org wrote:
> I am hoping someone here will help me. Was reading this site...
> 
> https://raid.wiki.kernel.org/index.php/Linux_Raid
> 
> and it said to email this list if you've tried everything other than mdadm
> --create.
> 
> 
> I am running Ubuntu 16.04. Machine name is fred. I used webmin to create a 4
> disk RAID10 array yesterday. I moved all my data onto the array.
> 
> Today, I had to reboot my PC. The resync was still not done, but I read
> online that it's OK to boot during resync. After boot, my array was gone. I
> checked syslog, and it just has this line:
> 
> DeviceDisappeared event detected on md device /dev/md0
> 
> I did not partition my disks before building the array. So I believe the
> array consisted of /dev/sdc, /dev/sdd, /dev/sde, and /dev/sdf.
> 
> Here's some info...
> 
> stephen@fred> lsblk
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda  8:00 117.4G  0 disk
> ├─sda1   8:10 109.7G  0 part /
> ├─sda2   8:20 1K  0 part
> └─sda5   8:50   7.7G  0 part [SWAP]
> sdb  8:16   0 465.8G  0 disk
> └─sdb1   8:17   0 465.8G  0 part
> sdc  8:32   0   3.7T  0 disk
> sdd  8:48   0   3.7T  0 disk
> sde  8:64   0   3.7T  0 disk
> sdf  8:80   0   3.7T  0 disk
> 
> stephen@fred> sudo mdadm --examine /dev/sdc
> [sudo] password for stephen:
> /dev/sdc:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdc1
> mdadm: cannot open /dev/sdc1: No such file or directory
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdd
> /dev/sdd:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sde
> /dev/sde:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> stephen@fred>
> stephen@fred> sudo mdadm --examine /dev/sdf
> /dev/sdf:
>MBR Magic : aa55
> Partition[0] :   4294967295 sectors at1 (type ee)
> 
> stephen@fred> sudo mdadm --assemble --force /dev/md0 /dev/sdc /dev/sdd
> /dev/sde /dev/sdf
> mdadm: Cannot assemble mbr metadata on /dev/sdc
> mdadm: /dev/sdc has no superblock - assembly aborted
> 
> Thank you for any help you can provide.

Did your disks have partitions previously?  That output looks a lot like
the protective MBR partition table for a disk with GPT partitions.

Could that still existing in sector 0 be confusing mdadm?

I have never personally done any md raid without partitions.  To me they
just make more sense.

One way to test could be to save a copy of sector 0, then overwrite sector
0 with zeros and then run mdadm --examine again to see if that makes a
difference.  You can always put back the saved copy of sector 0 that way.

My understanding is that the default is to put the raid superblock at
offset 4k, so it would not overwrite an existing MBR partition table.
If it also happens due to rounding that the end of the disk isn't
overwritten (or even just because that part of the filesystem wasn't
written to yet), then the backup GPT from before would still be intact,
and could perhaps cause even more confussion later if gdisk or similar
is pointed at the disk.  Really want to be sure there is no trace left
of the partition table before using it raw for md raid.

Any chance the system saved an mdadm.conf file of your setup?

-- 
Len Sorensen


Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen


Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen


Re: FPU warning on x86_32 on Skylake

2016-12-19 Thread Lennart Sorensen
On Thu, Nov 24, 2016 at 09:11:41AM -0800, Andy Lutomirski wrote:
> I gett this when booting a 32-bit 4.9-rc6-ish on Skylake:
> 
> [0.564506] [ cut here ]
> [0.564994] WARNING: CPU: 0 PID: 1 at
> ./arch/x86/include/asm/fpu/internal.h:368 fpu__restore+0x203/0x210
> [0.565737] Modules linked in:
> [0.566040] CPU: 0 PID: 1 Comm: sh Not tainted 4.9.0-rc6+ #488
> [0.566502] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.fc25 04/01/2014
> [0.567174]  c78a9e5c c135a6d0  c1ac4b4c c78a9e8c c10aeb42
> c1ad53f0 
> [0.567896]  0001 c1ac4b4c 0170 c107e753 0170 c78a06c0
>  c78a0700
> [0.568583]  c78a9ea0 c10aec05 0009   c78a9eb8
> c107e753 c78a0700
> [0.569245] Call Trace:
> [0.569440]  [] dump_stack+0x58/0x78
> [0.569783]  [] __warn+0xe2/0x100
> [0.570109]  [] ? fpu__restore+0x203/0x210
> [0.570519]  [] warn_slowpath_null+0x25/0x30
> [0.570943]  [] fpu__restore+0x203/0x210
> [0.571312]  [] __fpu__restore_sig+0x1fc/0x580
> [0.571719]  [] fpu__restore_sig+0x2a/0x50
> [0.572103]  [] restore_sigcontext.isra.10+0xbd/0xd0
> [0.572546]  [] sys_sigreturn+0x81/0x90
> [0.572908]  [] do_int80_syscall_32+0x57/0xc0
> [0.573306]  [] entry_INT80_32+0x2a/0x2a
> [0.573677] ---[ end trace 88038c46b2a9d23a ]---
> 
> Telling KVM to disable XSAVES makes the warning go away.
> 
> I seem to be the only person testing 32-bit kernels on CPUs this new :-/

Well skylake added XRSTORS, which is used in place of XRSTOR if supported
by the CPU, but XRSTORS requires CPL=0, which XRSTOR did not as far as
I can tell.  Older CPUs don't have XRSTORS so this would not be an
issue there.

I don't know, but would not be surprised if running under kvm means the
guest kernel is not running with CPL=0 and hence the XRSTORS feature
ought not to be exposed as supported by the CPU to the guest kernel.

Just a guess.

Does this happen with a 64 bit kvm guest too?  Does it happen if the 32
bit kernel is booted on bare hardware?  My guess if I am thinking the
right thing is that the answers are yes and no respectively.

Looks like this was hit in jvm a couple of years ago:
https://www.spinics.net/lists/kvm/msg110434.html 

No idea what the resolution was if any.

-- 
Len Sorensen


Re: FPU warning on x86_32 on Skylake

2016-12-19 Thread Lennart Sorensen
On Thu, Nov 24, 2016 at 09:11:41AM -0800, Andy Lutomirski wrote:
> I gett this when booting a 32-bit 4.9-rc6-ish on Skylake:
> 
> [0.564506] [ cut here ]
> [0.564994] WARNING: CPU: 0 PID: 1 at
> ./arch/x86/include/asm/fpu/internal.h:368 fpu__restore+0x203/0x210
> [0.565737] Modules linked in:
> [0.566040] CPU: 0 PID: 1 Comm: sh Not tainted 4.9.0-rc6+ #488
> [0.566502] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.fc25 04/01/2014
> [0.567174]  c78a9e5c c135a6d0  c1ac4b4c c78a9e8c c10aeb42
> c1ad53f0 
> [0.567896]  0001 c1ac4b4c 0170 c107e753 0170 c78a06c0
>  c78a0700
> [0.568583]  c78a9ea0 c10aec05 0009   c78a9eb8
> c107e753 c78a0700
> [0.569245] Call Trace:
> [0.569440]  [] dump_stack+0x58/0x78
> [0.569783]  [] __warn+0xe2/0x100
> [0.570109]  [] ? fpu__restore+0x203/0x210
> [0.570519]  [] warn_slowpath_null+0x25/0x30
> [0.570943]  [] fpu__restore+0x203/0x210
> [0.571312]  [] __fpu__restore_sig+0x1fc/0x580
> [0.571719]  [] fpu__restore_sig+0x2a/0x50
> [0.572103]  [] restore_sigcontext.isra.10+0xbd/0xd0
> [0.572546]  [] sys_sigreturn+0x81/0x90
> [0.572908]  [] do_int80_syscall_32+0x57/0xc0
> [0.573306]  [] entry_INT80_32+0x2a/0x2a
> [0.573677] ---[ end trace 88038c46b2a9d23a ]---
> 
> Telling KVM to disable XSAVES makes the warning go away.
> 
> I seem to be the only person testing 32-bit kernels on CPUs this new :-/

Well skylake added XRSTORS, which is used in place of XRSTOR if supported
by the CPU, but XRSTORS requires CPL=0, which XRSTOR did not as far as
I can tell.  Older CPUs don't have XRSTORS so this would not be an
issue there.

I don't know, but would not be surprised if running under kvm means the
guest kernel is not running with CPL=0 and hence the XRSTORS feature
ought not to be exposed as supported by the CPU to the guest kernel.

Just a guess.

Does this happen with a 64 bit kvm guest too?  Does it happen if the 32
bit kernel is booted on bare hardware?  My guess if I am thinking the
right thing is that the answers are yes and no respectively.

Looks like this was hit in jvm a couple of years ago:
https://www.spinics.net/lists/kvm/msg110434.html 

No idea what the resolution was if any.

-- 
Len Sorensen


Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen
I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is 

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen
I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is 

Re: [PATCH 0/5] Support for LEGO MINDSTORTMS EV3

2016-10-21 Thread Lennart Sorensen
On Fri, Oct 21, 2016 at 01:36:52PM -0500, David Lechner wrote:
> This patch series adds support for LEGO MINDSTORTMS EV3[1]. This is a TI 
> AM1808
> based board.
> 
> This patch series has been tested working (along with some hacks to fix the
> GPIOs) on an EV3.
> 
> There are still quite a few additional new drivers that need to be submitted
> to get everything working. This patch series just adds support for the parts
> that already have mainline kernel drivers.
> 
> I have a plan/driver in progress for many of the components[2], but I could 
> use
> some advice on a few particulars.
> 
> Bluetooth: This needs a driver to sequence a GPIO to take the Bluetooth chip
> out of shutdown *after* the Bluetooth clock has been configured and started.
> Is there a generic driver that can do this sort of thing? Or, if not, which
> subsystem should the new driver go in?
> 
> Input and output ports: These ports are capable of hotplugging various 
> devices,
> such as sensors and motors. I have written a driver for these that can detect
> most devices. I created a new subsystem for this called `lego-port`. However,
> I am wondering if the existing phy or extcon subsystems might be a good fit 
> for
> this sort of thing.

Both this cover and patch 5/5 has MINDSTORTMS instead of MINDSTORMS in
a few places (including the subject line).

-- 
Len Sorensen


Re: [PATCH 0/5] Support for LEGO MINDSTORTMS EV3

2016-10-21 Thread Lennart Sorensen
On Fri, Oct 21, 2016 at 01:36:52PM -0500, David Lechner wrote:
> This patch series adds support for LEGO MINDSTORTMS EV3[1]. This is a TI 
> AM1808
> based board.
> 
> This patch series has been tested working (along with some hacks to fix the
> GPIOs) on an EV3.
> 
> There are still quite a few additional new drivers that need to be submitted
> to get everything working. This patch series just adds support for the parts
> that already have mainline kernel drivers.
> 
> I have a plan/driver in progress for many of the components[2], but I could 
> use
> some advice on a few particulars.
> 
> Bluetooth: This needs a driver to sequence a GPIO to take the Bluetooth chip
> out of shutdown *after* the Bluetooth clock has been configured and started.
> Is there a generic driver that can do this sort of thing? Or, if not, which
> subsystem should the new driver go in?
> 
> Input and output ports: These ports are capable of hotplugging various 
> devices,
> such as sensors and motors. I have written a driver for these that can detect
> most devices. I created a new subsystem for this called `lego-port`. However,
> I am wondering if the existing phy or extcon subsystems might be a good fit 
> for
> this sort of thing.

Both this cover and patch 5/5 has MINDSTORTMS instead of MINDSTORMS in
a few places (including the subject line).

-- 
Len Sorensen


Re: lspci not showing motherboard ethernet controller after PCIe card firmware update change from 32-bit to 64-bit BAR

2016-09-29 Thread Lennart Sorensen
On Wed, Sep 28, 2016 at 09:05:44AM -0500, Steve Kenton wrote:
> I decided to experiment with /sys/bus/pci and the BIOS settings to try
> and understand things better.
> 
> The BIOS has a settings to enable/disable the on-board LAN. When the
> Blackmagic card firmware is upgraded
> bus[03] and hence the on-board LAN disappears, but the apparently unused
> bus[02] is still visible and
> the BIOS setting is still enabled. When the BIOS LAN settings is
> disabled busses[02][03] and the on-board
> LAN disappear. In both cases, the "disappeared" resources are not
> restored by echo to /sys/.../rescan
> 
> Why? What can the BIOS be doing to the chipset/busses[02][03] to make
> them invisible to /sys manipulation?

It can activate the reset line to the network chip making it never able
to respond to anything.  Or it could even power it off.

And PCIe is point to point, so they are not really busses anymore.
So at least turning off the network chip means no link on that point to
point PCIe link and hence no "bus" there.

> There is apparently something I do not understand about the H81 chipset
> PCIe configuration. I have
> been reading the Intel Series 8 / Series C220 Chipset documentation but
> so far have come up empty.
> 
> Anybody have cluebat for me?

I wish.

-- 
Len Sorensen


Re: lspci not showing motherboard ethernet controller after PCIe card firmware update change from 32-bit to 64-bit BAR

2016-09-29 Thread Lennart Sorensen
On Wed, Sep 28, 2016 at 09:05:44AM -0500, Steve Kenton wrote:
> I decided to experiment with /sys/bus/pci and the BIOS settings to try
> and understand things better.
> 
> The BIOS has a settings to enable/disable the on-board LAN. When the
> Blackmagic card firmware is upgraded
> bus[03] and hence the on-board LAN disappears, but the apparently unused
> bus[02] is still visible and
> the BIOS setting is still enabled. When the BIOS LAN settings is
> disabled busses[02][03] and the on-board
> LAN disappear. In both cases, the "disappeared" resources are not
> restored by echo to /sys/.../rescan
> 
> Why? What can the BIOS be doing to the chipset/busses[02][03] to make
> them invisible to /sys manipulation?

It can activate the reset line to the network chip making it never able
to respond to anything.  Or it could even power it off.

And PCIe is point to point, so they are not really busses anymore.
So at least turning off the network chip means no link on that point to
point PCIe link and hence no "bus" there.

> There is apparently something I do not understand about the H81 chipset
> PCIe configuration. I have
> been reading the Intel Series 8 / Series C220 Chipset documentation but
> so far have come up empty.
> 
> Anybody have cluebat for me?

I wish.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Lennart Sorensen
On Wed, Sep 28, 2016 at 09:26:42PM -0500, Larry Finger wrote:
> By the time it gets slow, the CPU's cool, and one cannot see the temp just
> before that event happened.

Hmm, I would not expect the CPU to drop from 80 to 40 degrees in a few
seconds if the fan is not spinning.  I wouldn't even expect it if the
fan was spinning.  I would think at least 30 to 60 seconds if not more.

The only way I would think the temperature could change quickly would
be if the heatsink isn't even touching the CPU anymore so there is very
little material to hold the heat in the CPU.

> The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7.
> Of course, it could be something subtle that slightly changes the heat load,
> which causes the CPU temp to be a little higher so that the effect is
> triggered.
> 
> I am reasonably confident that it is not a hardware problem, but we may have
> to wait until 4.8 is released and gets wider usage. If no one else reports a
> problem, then I am certainly wrong.

Well hard to reproduce bugs are always really annoying.

This old bug sounds a lot like what you are seeing:
https://bugzilla.redhat.com/show_bug.cgi?id=924570
and it links to this:
https://www.phoronix.com/scan.php?page=news_item=Linux-4.6-Thermal-Updates

Apparently turning off turbo boost seems to stop the problem for a lot
of people in that case.  Doesn't explain why it started happening
recently.  And of course that may have been a different problem in
the past.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Lennart Sorensen
On Wed, Sep 28, 2016 at 09:26:42PM -0500, Larry Finger wrote:
> By the time it gets slow, the CPU's cool, and one cannot see the temp just
> before that event happened.

Hmm, I would not expect the CPU to drop from 80 to 40 degrees in a few
seconds if the fan is not spinning.  I wouldn't even expect it if the
fan was spinning.  I would think at least 30 to 60 seconds if not more.

The only way I would think the temperature could change quickly would
be if the heatsink isn't even touching the CPU anymore so there is very
little material to hold the heat in the CPU.

> The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7.
> Of course, it could be something subtle that slightly changes the heat load,
> which causes the CPU temp to be a little higher so that the effect is
> triggered.
> 
> I am reasonably confident that it is not a hardware problem, but we may have
> to wait until 4.8 is released and gets wider usage. If no one else reports a
> problem, then I am certainly wrong.

Well hard to reproduce bugs are always really annoying.

This old bug sounds a lot like what you are seeing:
https://bugzilla.redhat.com/show_bug.cgi?id=924570
and it links to this:
https://www.phoronix.com/scan.php?page=news_item=Linux-4.6-Thermal-Updates

Apparently turning off turbo boost seems to stop the problem for a lot
of people in that case.  Doesn't explain why it started happening
recently.  And of course that may have been a different problem in
the past.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-27 Thread Lennart Sorensen
On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.
> 
> Commit f7816ad, which had run for 7 days without showing the bug, failed
> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
> well, that's the way it goes!

Is it possible there is no bug and instead you have a hardware problem?

What I am thinking:

CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
kicks in to protect CPU and it gets VERY slow.

So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
you have a motherboard that varies the CPU fan depending on need and the
fan doesn't like the lowest speed and sometimes gets stuck when asked
to go slow.

Of course if the CPU fan is the problem that could explain why it takes
varying amounts of time to see the problem.

I suggest checking what the cpu temperature sensors are showing next
time it gets slow.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-27 Thread Lennart Sorensen
On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.
> 
> Commit f7816ad, which had run for 7 days without showing the bug, failed
> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
> well, that's the way it goes!

Is it possible there is no bug and instead you have a hardware problem?

What I am thinking:

CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
kicks in to protect CPU and it gets VERY slow.

So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
you have a motherboard that varies the CPU fan depending on need and the
fan doesn't like the lowest speed and sometimes gets stuck when asked
to go slow.

Of course if the CPU fan is the problem that could explain why it takes
varying amounts of time to see the problem.

I suggest checking what the cpu temperature sensors are showing next
time it gets slow.

-- 
Len Sorensen


Re: lspci not showing motherboard ethernet controller after PCIe card firmware update change from 32-bit to 64-bit BAR

2016-09-26 Thread Lennart Sorensen
On Sat, Sep 24, 2016 at 07:24:14PM -0500, Steve Kenton wrote:
> I have two different systems with Blackmagic Design Intensity Pro-4K
> cards. One uses an ECS H81H3-I motherboard and the other uses a Gigabyte
> GA-H110N motherboard. Previously both were working properly. When I
> upgraded the Blackmagic Design desktopvideo package from version 10.7 to
> version 10.8 it included an update of the firmware on the cards from
> version 0x85 to 0xbd. After the  upgrade the Gigabyte-motherboard system
> continues to work properly, but the ECS-motherboard system no longer
> shows the motherboard ethernet controller with lspci. Downgrading the
> firmware makes the ethernet controller reappear with lspci. Upgrading
> again makes it disappear again. Attached are the output of lscpi -vv
> for each version of the firmware on the ECS-motherboard system.
> Upgrading to the most recent ECS-motherboard bios did not have any effect.
> 
> There are a number of differences in the two lspci outputs but
> based on conversations I've seen on LKM, which I peruse but am not
> subscribed to, I suspect that this is the important one:
> 
> lspci-Intensity-Pro-4K-10.7-firmware.txt:Region 0: Memory at
> f7d0 (32-bit, non-prefetchable) [size=1M]
> lspci-Intensity-Pro-4K-10.8-firmware.txt:Region 0: Memory at
> f7c0 (64-bit, non-prefetchable) [size=1M]
> 
> The Engineers at Blackmagic Design have confirmed that the change to a
> 64-bit BAR with firmware version 0xbd was intentional and correct. And
> to be clear, the Blackmagic cards continue to work properly. The issue
> is the motherboard ethernet controller disappearing after upgrading the
> Blackmagic cards to new firmware with a 64-bit BAR. Note, however, that
> 64-bit BAR is below 4GB with the high four bytes being all zeros. After
> the firmware update it's like there is "no room at the inn" for the
> motherboard ethernet controller so it never gets enabled. I think we can
> all agree that it's a likely a problem with the PC motherboard bios. I'm
> hoping to find a workaround to allow me to configure a number of
> existing ECS-motherboard systems in the field for use with the new firmware.
> 
> I normally run Ubuntu 14.04 LTS and tried the most recent 4.4.0-38
> kernel but then I noticed that the Ubuntu 16.10 daily build ISO switched
> to kernel 4.8-RC-something on Sept 21st. So I booted that
> livedvd and tried it to see if 4.8-RC included some recent PCIe
> 32-bit/64-bit resource fix but sadly the results were the same. With
> firmware 0xbd, the ethernet controller was still missing from the lspci
> output. I poked around with various of the kernel pci= parameters such
> as bios and nobios to no avail, but I really am out of my depth here.
> 
> I'm available to run tests if anyone has any suggestions.

Well pci=bios is 32 bit only, so that doesn't work, pci=bfs was probably
supposed to be pci=bfsort.  It sure looks like the bios fails to allocate
the network device when the other device is 64 bit.  Sure looks like
a bios bug.  I see nothing indicating that the address space ran out.
It looks fine.

You could try:

pci=realloc
pci=assign-busses
pci=bfsort (the one I think you were trying to do).

I certainly wouldn't be surprised if the only thing ever tested in that
board was a graphics card.

Of course your network card is also 64 bit bar (when seen), so the system
can work with such.  Makes one wonder if the bios has a bus, or the H81
has a limit that only one of the ethernet and the add in card can use
64 bit bar at a time.  That would seem weird of course.

-- 
Len Sorensen


Re: lspci not showing motherboard ethernet controller after PCIe card firmware update change from 32-bit to 64-bit BAR

2016-09-26 Thread Lennart Sorensen
On Sat, Sep 24, 2016 at 07:24:14PM -0500, Steve Kenton wrote:
> I have two different systems with Blackmagic Design Intensity Pro-4K
> cards. One uses an ECS H81H3-I motherboard and the other uses a Gigabyte
> GA-H110N motherboard. Previously both were working properly. When I
> upgraded the Blackmagic Design desktopvideo package from version 10.7 to
> version 10.8 it included an update of the firmware on the cards from
> version 0x85 to 0xbd. After the  upgrade the Gigabyte-motherboard system
> continues to work properly, but the ECS-motherboard system no longer
> shows the motherboard ethernet controller with lspci. Downgrading the
> firmware makes the ethernet controller reappear with lspci. Upgrading
> again makes it disappear again. Attached are the output of lscpi -vv
> for each version of the firmware on the ECS-motherboard system.
> Upgrading to the most recent ECS-motherboard bios did not have any effect.
> 
> There are a number of differences in the two lspci outputs but
> based on conversations I've seen on LKM, which I peruse but am not
> subscribed to, I suspect that this is the important one:
> 
> lspci-Intensity-Pro-4K-10.7-firmware.txt:Region 0: Memory at
> f7d0 (32-bit, non-prefetchable) [size=1M]
> lspci-Intensity-Pro-4K-10.8-firmware.txt:Region 0: Memory at
> f7c0 (64-bit, non-prefetchable) [size=1M]
> 
> The Engineers at Blackmagic Design have confirmed that the change to a
> 64-bit BAR with firmware version 0xbd was intentional and correct. And
> to be clear, the Blackmagic cards continue to work properly. The issue
> is the motherboard ethernet controller disappearing after upgrading the
> Blackmagic cards to new firmware with a 64-bit BAR. Note, however, that
> 64-bit BAR is below 4GB with the high four bytes being all zeros. After
> the firmware update it's like there is "no room at the inn" for the
> motherboard ethernet controller so it never gets enabled. I think we can
> all agree that it's a likely a problem with the PC motherboard bios. I'm
> hoping to find a workaround to allow me to configure a number of
> existing ECS-motherboard systems in the field for use with the new firmware.
> 
> I normally run Ubuntu 14.04 LTS and tried the most recent 4.4.0-38
> kernel but then I noticed that the Ubuntu 16.10 daily build ISO switched
> to kernel 4.8-RC-something on Sept 21st. So I booted that
> livedvd and tried it to see if 4.8-RC included some recent PCIe
> 32-bit/64-bit resource fix but sadly the results were the same. With
> firmware 0xbd, the ethernet controller was still missing from the lspci
> output. I poked around with various of the kernel pci= parameters such
> as bios and nobios to no avail, but I really am out of my depth here.
> 
> I'm available to run tests if anyone has any suggestions.

Well pci=bios is 32 bit only, so that doesn't work, pci=bfs was probably
supposed to be pci=bfsort.  It sure looks like the bios fails to allocate
the network device when the other device is 64 bit.  Sure looks like
a bios bug.  I see nothing indicating that the address space ran out.
It looks fine.

You could try:

pci=realloc
pci=assign-busses
pci=bfsort (the one I think you were trying to do).

I certainly wouldn't be surprised if the only thing ever tested in that
board was a graphics card.

Of course your network card is also 64 bit bar (when seen), so the system
can work with such.  Makes one wonder if the bios has a bus, or the H81
has a limit that only one of the ethernet and the add in card can use
64 bit bar at a time.  That would seem weird of course.

-- 
Len Sorensen


Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'

2016-09-16 Thread Lennart Sorensen
On Fri, Sep 16, 2016 at 01:32:05PM -0700, Guenter Roeck wrote:
> On Fri, Sep 16, 2016 at 09:03:20PM +0100, Al Viro wrote:
> > On Fri, Sep 16, 2016 at 12:12:18PM -0700, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > I see the following runtime failure when running a 'sh' image with qemu 
> > > in -next.
> > > 
> > > [ ... ]
> > > 
> > > sd 0:0:0:0: [sda] Attached SCSI disk
> > > EXT2-fs (sda): warning: mounting unchecked fs, running e2fsck is 
> > > recommended
> > > VFS: Mounted root (ext2 filesystem) on device 8:0.
> > > Freeing unused kernel memory: 124K (8c48a000 - 8c4a9000)
> > > This architecture does not have kernel memory protection.
> > > random: fast init done
> > > Starting logging: OK
> > > usb 1-1: new full-speed USB device number 2 using sm501-usb
> > > Initializing random number generator... done.
> > > Starting network...
> > > ip: OVERRUN: Invalid argument
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > [repeats until the test aborts]
> > > 
> > > Bisect points to commit 6e050503a150 ("sh: fix copy_from_user()"). Bisect 
> > > log is
> > > attached.
> > 
> > BTW, could you post your .config and information about your userland?  E.g.
> > is that ip(8) a busybox one, etc.  If it's busybox, this smells like EINVAL
> > and EFAULT resp. coming from recvmsg() on netlink sockets, with nothing
> > extraordinary in iovecs, AFAICS...
> 
> Please see https://github.com/groeck/linux-build-test/tree/master/rootfs/sh.
> The rootfs is some three years old. I don't remember how I created it :-).

Well it certainly says in it that ip is busybox, and the version is 1.21.1

-- 
Len Sorensen


Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'

2016-09-16 Thread Lennart Sorensen
On Fri, Sep 16, 2016 at 01:32:05PM -0700, Guenter Roeck wrote:
> On Fri, Sep 16, 2016 at 09:03:20PM +0100, Al Viro wrote:
> > On Fri, Sep 16, 2016 at 12:12:18PM -0700, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > I see the following runtime failure when running a 'sh' image with qemu 
> > > in -next.
> > > 
> > > [ ... ]
> > > 
> > > sd 0:0:0:0: [sda] Attached SCSI disk
> > > EXT2-fs (sda): warning: mounting unchecked fs, running e2fsck is 
> > > recommended
> > > VFS: Mounted root (ext2 filesystem) on device 8:0.
> > > Freeing unused kernel memory: 124K (8c48a000 - 8c4a9000)
> > > This architecture does not have kernel memory protection.
> > > random: fast init done
> > > Starting logging: OK
> > > usb 1-1: new full-speed USB device number 2 using sm501-usb
> > > Initializing random number generator... done.
> > > Starting network...
> > > ip: OVERRUN: Invalid argument
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > ip: OVERRUN: Bad address
> > > [repeats until the test aborts]
> > > 
> > > Bisect points to commit 6e050503a150 ("sh: fix copy_from_user()"). Bisect 
> > > log is
> > > attached.
> > 
> > BTW, could you post your .config and information about your userland?  E.g.
> > is that ip(8) a busybox one, etc.  If it's busybox, this smells like EINVAL
> > and EFAULT resp. coming from recvmsg() on netlink sockets, with nothing
> > extraordinary in iovecs, AFAICS...
> 
> Please see https://github.com/groeck/linux-build-test/tree/master/rootfs/sh.
> The rootfs is some three years old. I don't remember how I created it :-).

Well it certainly says in it that ip is busybox, and the version is 1.21.1

-- 
Len Sorensen


Re: e1000e on Thinkpad x60: gigabit not available due to "SmartSpeed"

2016-09-02 Thread Lennart Sorensen
On Thu, Sep 01, 2016 at 02:58:13PM -0700, Greg wrote:
> On Thu, 2016-09-01 at 22:14 +0200, Pavel Machek wrote:
> > Hi!
> > 
> > I have trouble getting 1000mbit out of my ethernet card.
> > 
> > I tried direct connection between two PCs with different cables, and
> > no luck.
> > 
> > Today I tried connection to 1000mbit switch, and no luck, either. (Two
> > cables, one was cat6, both short).
> > 
> > My computer sees 1000mbit being advertised by the other side, but does
> > not advertise 1000mbit, "Link Speed was downgraded by SmartSpeed".
> 
> Check your cables?
> 
> https://vmxp.wordpress.com/2015/01/06/1gbe-intel-nic-throttled-to-100mbit-by-smartspeed/

Of course if it isn't the cable, then it could even be a broken pin in
the port.  As far as I can tell, anything that causes one of the 3rd
or 4th pairs of wires to not work will degrade to 100Mbit on just the
first 2 pairs of wires and give that message.  Some badly implemented
switches can also cause it of course.

-- 
Len Sorensen


Re: e1000e on Thinkpad x60: gigabit not available due to "SmartSpeed"

2016-09-02 Thread Lennart Sorensen
On Thu, Sep 01, 2016 at 02:58:13PM -0700, Greg wrote:
> On Thu, 2016-09-01 at 22:14 +0200, Pavel Machek wrote:
> > Hi!
> > 
> > I have trouble getting 1000mbit out of my ethernet card.
> > 
> > I tried direct connection between two PCs with different cables, and
> > no luck.
> > 
> > Today I tried connection to 1000mbit switch, and no luck, either. (Two
> > cables, one was cat6, both short).
> > 
> > My computer sees 1000mbit being advertised by the other side, but does
> > not advertise 1000mbit, "Link Speed was downgraded by SmartSpeed".
> 
> Check your cables?
> 
> https://vmxp.wordpress.com/2015/01/06/1gbe-intel-nic-throttled-to-100mbit-by-smartspeed/

Of course if it isn't the cable, then it could even be a broken pin in
the port.  As far as I can tell, anything that causes one of the 3rd
or 4th pairs of wires to not work will degrade to 100Mbit on just the
first 2 pairs of wires and give that message.  Some badly implemented
switches can also cause it of course.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-25 Thread Lennart Sorensen
On Thu, Aug 25, 2016 at 09:38:39AM +0200, Ralph Sennhauser wrote:
> I'm also not interested in a never ending thread. It's moot that udev
> can't rename to kernel names sanely and we were sold ep34aj17asz98 as
> the replacement. Or that tearing apart the casing to replace the wifi
> modules with an ethernet one when there are already 5 Gbit ports is not
> a case I care about.
> 
> Relying on the order might in theory have flaws but works just fine in
> practice. Thomas Petazzoni, I, OpenWrt / Lede / etc all do so with
> success.

It has worked so far.  I expect it to break at some point.

> It's also not a side effect from major changes, so it didn't break by
> accident but as the subject says deliberately.

In this case that is certainly true.

> How exceptional is this exception we are talking about? I mean if it's
> the only place you might want to change it even in 2 years but you
> can't because of the very same rule which was broken here.

I did a quick look at the dts files, and there are quite a few cases
that are not ordered, and I can't find anything in Documentation that
says they should be sorted.  It just appears that in most cases people
do sort them, since that just makes sense for finding things and it is
as good an order as any other.

> I think if we were at 4.6-rcX reverting would be an easy call. I know
> it's more difficult to make this call now.
> 
> Ltsi is 4.1, longterm is 4.4, so penetration is probably marginal at
> best at this time. From my view the damage caused by a revert would be
> less than the damage that will be caused by the commit if left in but we
> can't wait much much longer until this changes.

Certainly no longterm kernels have this change yet, so it certainly does
seem the impact would be rather minimal.  Usually the rule seems to be
"Don't break user space".  This did potentially break userspace, so then
the question is whether what user space was doing was considred proper
and something that should expect not to break.

> If you use the ordering by address as main argument for the revert there
> will be nothing to argue about.

If only there was a documented place that had that rule.  I can't find it.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-25 Thread Lennart Sorensen
On Thu, Aug 25, 2016 at 09:38:39AM +0200, Ralph Sennhauser wrote:
> I'm also not interested in a never ending thread. It's moot that udev
> can't rename to kernel names sanely and we were sold ep34aj17asz98 as
> the replacement. Or that tearing apart the casing to replace the wifi
> modules with an ethernet one when there are already 5 Gbit ports is not
> a case I care about.
> 
> Relying on the order might in theory have flaws but works just fine in
> practice. Thomas Petazzoni, I, OpenWrt / Lede / etc all do so with
> success.

It has worked so far.  I expect it to break at some point.

> It's also not a side effect from major changes, so it didn't break by
> accident but as the subject says deliberately.

In this case that is certainly true.

> How exceptional is this exception we are talking about? I mean if it's
> the only place you might want to change it even in 2 years but you
> can't because of the very same rule which was broken here.

I did a quick look at the dts files, and there are quite a few cases
that are not ordered, and I can't find anything in Documentation that
says they should be sorted.  It just appears that in most cases people
do sort them, since that just makes sense for finding things and it is
as good an order as any other.

> I think if we were at 4.6-rcX reverting would be an easy call. I know
> it's more difficult to make this call now.
> 
> Ltsi is 4.1, longterm is 4.4, so penetration is probably marginal at
> best at this time. From my view the damage caused by a revert would be
> less than the damage that will be caused by the commit if left in but we
> can't wait much much longer until this changes.

Certainly no longterm kernels have this change yet, so it certainly does
seem the impact would be rather minimal.  Usually the rule seems to be
"Don't break user space".  This did potentially break userspace, so then
the question is whether what user space was doing was considred proper
and something that should expect not to break.

> If you use the ordering by address as main argument for the revert there
> will be nothing to argue about.

If only there was a documented place that had that rule.  I can't find it.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 09:52:00PM +0200, Thomas Petazzoni wrote:
> I'll let the platform maintainers decide what's the least
> intrusive/problematic option. Both solutions have drawbacks, so it's
> really a "political" decision to make here.

I think the main valid argument for a revert is that it violates the
documented dtb ordering rule.  It also fails to actually do what it was
intended to do since network device probing order really isn't defined
in linux and if you care you should fix it in userspace.  Fixing a
regression would be a side effect, and since the ordering isn't certain
anyhow, anyone that did see a regression was doing it "wrong" already,
although also anyone that saw a benefit from the change was also doing it
"wrong".

> Not only async probing, but also PCIe devices, as you mentioned
> earlier :-)

Yeah they better come up with a safer way to determine which network
device is which from user space.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 09:52:00PM +0200, Thomas Petazzoni wrote:
> I'll let the platform maintainers decide what's the least
> intrusive/problematic option. Both solutions have drawbacks, so it's
> really a "political" decision to make here.

I think the main valid argument for a revert is that it violates the
documented dtb ordering rule.  It also fails to actually do what it was
intended to do since network device probing order really isn't defined
in linux and if you care you should fix it in userspace.  Fixing a
regression would be a side effect, and since the ordering isn't certain
anyhow, anyone that did see a regression was doing it "wrong" already,
although also anyone that saw a benefit from the change was also doing it
"wrong".

> Not only async probing, but also PCIe devices, as you mentioned
> earlier :-)

Yeah they better come up with a safer way to determine which network
device is which from user space.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 01:10:23PM -0400, Lennart Sorensen wrote:
> Well certainly doing udevtrigger -n -v I see no ethernet devices (but
> lots of other things).  Looking in sysfs it is possible to dereive which
> ethX belongs to which port based on the directory names, but that's
> probably not the most convinient manner to deal with it.

OK, udev DOES work:

# udevadm info -p /sys/class/net/eth0
P: /devices/platform/soc/soc:internal-regs/f1034000.ethernet/net/eth0
E: DEVPATH=/devices/platform/soc/soc:internal-regs/f1034000.ethernet/net/eth0
E: IFINDEX=2
E: INTERFACE=eth0
E: SUBSYSTEM=net

# udevadm info -p /sys/class/net/eth1
P: /devices/platform/soc/soc:internal-regs/f107.ethernet/net/eth1
E: DEVPATH=/devices/platform/soc/soc:internal-regs/f107.ethernet/net/eth1
E: IFINDEX=3
E: INTERFACE=eth1
E: SUBSYSTEM=net

So it isn't hopeless.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 01:10:23PM -0400, Lennart Sorensen wrote:
> Well certainly doing udevtrigger -n -v I see no ethernet devices (but
> lots of other things).  Looking in sysfs it is possible to dereive which
> ethX belongs to which port based on the directory names, but that's
> probably not the most convinient manner to deal with it.

OK, udev DOES work:

# udevadm info -p /sys/class/net/eth0
P: /devices/platform/soc/soc:internal-regs/f1034000.ethernet/net/eth0
E: DEVPATH=/devices/platform/soc/soc:internal-regs/f1034000.ethernet/net/eth0
E: IFINDEX=2
E: INTERFACE=eth0
E: SUBSYSTEM=net

# udevadm info -p /sys/class/net/eth1
P: /devices/platform/soc/soc:internal-regs/f107.ethernet/net/eth1
E: DEVPATH=/devices/platform/soc/soc:internal-regs/f107.ethernet/net/eth1
E: IFINDEX=3
E: INTERFACE=eth1
E: SUBSYSTEM=net

So it isn't hopeless.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 08:14:44PM +0200, Thomas Petazzoni wrote:
> Depends on the network driver I believe. But with an e1000e NIC plugged
> in a PCIe slot, it indeed gets assigned as eth0, and the internal
> mvneta devices get eth1, eth2, etc.

Which of course means the change does not actually ensure the port
ordering matches the marvell documentation or u-boot.  It only handles
the relative order of the ports.  For now.

So since it doesn't actually work, maybe reverting it so it no longer
violates the dtb ordeting rule makes sense.

Doesn't mean openwrt/lede/etc don't have to deal with the ordering in
the future if async probing takes off.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 08:14:44PM +0200, Thomas Petazzoni wrote:
> Depends on the network driver I believe. But with an e1000e NIC plugged
> in a PCIe slot, it indeed gets assigned as eth0, and the internal
> mvneta devices get eth1, eth2, etc.

Which of course means the change does not actually ensure the port
ordering matches the marvell documentation or u-boot.  It only handles
the relative order of the ports.  For now.

So since it doesn't actually work, maybe reverting it so it no longer
violates the dtb ordeting rule makes sense.

Doesn't mean openwrt/lede/etc don't have to deal with the ordering in
the future if async probing takes off.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 07:10:04PM +0200, Ralph Sennhauser wrote:
> And in how many places this discrepancy was documented? You won't be
> able to update them all. Mailing lists, blogs, fora posts and what ever
> else. I'd say the damage is done and can't be fixed by simply changing
> the order now.
> 
> Whether strong rule or not, obviously it went in so bending the rule is
> at least accepted. It's not what I have an issue with but with the
> reordering at this point.
> 
> I'm clearly not convinced reverting would do more harm than good,
> otherwise I wouldn't have brought it up.

I would agree, although if you see below I think it maybe something
that user space will have to figure out a way to deal with sooner or
later anyhow.

> It's not about being able to fix the code or documentation but about
> having to do so in the first place.
> 
> The nice thing about having the order in the dtb I thought was that it
> wont ever change.

I wonder, if someone was to build a box with this cpu, and add a PCIe
network device, which order would they get probed in?  Any chance the
PCIe could grab eth0 before the mvneta devices get probed?

I also would not count on this not changing in the future potentially
by accident.  The mmc block devices used to get probed in DTB order, then
that changed at some point and reordering the dtb no longer controlled it,
rather it was whichever device seemed to finish fastests, and then later
that was changed to using the controller's opinion of the device number
(which at least made it predicable again, even if not controllable from
the dtb).

So at the moment the dtb order controls ethernet probe order.  Well if
other things have stopped doing that, why expect ethernet won't some
day stop doing that too?

"Fixing" the port numbering by reordering is hence a hack as far as I
am concerned and not certain to stay working long term.

> Going forward, as we disagree and it's basically a political decision,
> whom do we ask to rule here? Linus?

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 07:10:04PM +0200, Ralph Sennhauser wrote:
> And in how many places this discrepancy was documented? You won't be
> able to update them all. Mailing lists, blogs, fora posts and what ever
> else. I'd say the damage is done and can't be fixed by simply changing
> the order now.
> 
> Whether strong rule or not, obviously it went in so bending the rule is
> at least accepted. It's not what I have an issue with but with the
> reordering at this point.
> 
> I'm clearly not convinced reverting would do more harm than good,
> otherwise I wouldn't have brought it up.

I would agree, although if you see below I think it maybe something
that user space will have to figure out a way to deal with sooner or
later anyhow.

> It's not about being able to fix the code or documentation but about
> having to do so in the first place.
> 
> The nice thing about having the order in the dtb I thought was that it
> wont ever change.

I wonder, if someone was to build a box with this cpu, and add a PCIe
network device, which order would they get probed in?  Any chance the
PCIe could grab eth0 before the mvneta devices get probed?

I also would not count on this not changing in the future potentially
by accident.  The mmc block devices used to get probed in DTB order, then
that changed at some point and reordering the dtb no longer controlled it,
rather it was whichever device seemed to finish fastests, and then later
that was changed to using the controller's opinion of the device number
(which at least made it predicable again, even if not controllable from
the dtb).

So at the moment the dtb order controls ethernet probe order.  Well if
other things have stopped doing that, why expect ethernet won't some
day stop doing that too?

"Fixing" the port numbering by reordering is hence a hack as far as I
am concerned and not certain to stay working long term.

> Going forward, as we disagree and it's basically a political decision,
> whom do we ask to rule here? Linus?

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 06:43:34PM +0200, Thomas Petazzoni wrote:
> Well, just like the for the documentation aspect, you're seeing this
> from the OpenWRT/LEDE angle only. Other people are using plenty of
> other things.
> 
> We knew it would potentially cause some breakage, so it was a
> trade-off. I still believe the new arrangement is better, and you've so
> far been the only person reporting an issue with this (compared to
> numerous people being confused by the original ordering problem).

I didn't report it.  I just commented.  I will be affected when LEDE
does move their kernel past 4.4 I suppose, but I imagine something will
be done to deal with it.

I have run into issues with things being reordered on other systems
though, and it sure is annoying, especially when the new behaviour
removes the ability to control the order that was previously there
(that is not what is happening here of course).

> This is more problematic, and something to be investigated. I don't
> immediately see why the Marvell network interfaces would not be visible
> by udev, but I haven't tested.

Well certainly doing udevtrigger -n -v I see no ethernet devices (but
lots of other things).  Looking in sysfs it is possible to dereive which
ethX belongs to which port based on the directory names, but that's
probably not the most convinient manner to deal with it.

> The solution of adding an alias in the DT, and using that to name
> network interfaces has already been proposed multiple times, but has
> been rejected by the networking maintainer, who suggests to use
> userspace tools (udev or something else) to rename network interfaces.
> See for example https://patchwork.kernel.org/patch/4122441/, which was
> proposed by my colleague Boris Brezillon.

Sure, although it seems many embedded systems would rather avoid udev
(even more so after systemd seems to have taken it over), and in this
case I get the impression so far that udev may not even be able to help
on this system.  (although maybe I just did it wrong).

> You can always try to propose again a solution, but I doubt it will be
> accepted.

Yeah me too, hence why I can't be bothered to try.  I have accepted
there are some things the maintainers won't fix so I have a few patches 
I maintain myself and can keep doing so.  They are tiny after all.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 06:43:34PM +0200, Thomas Petazzoni wrote:
> Well, just like the for the documentation aspect, you're seeing this
> from the OpenWRT/LEDE angle only. Other people are using plenty of
> other things.
> 
> We knew it would potentially cause some breakage, so it was a
> trade-off. I still believe the new arrangement is better, and you've so
> far been the only person reporting an issue with this (compared to
> numerous people being confused by the original ordering problem).

I didn't report it.  I just commented.  I will be affected when LEDE
does move their kernel past 4.4 I suppose, but I imagine something will
be done to deal with it.

I have run into issues with things being reordered on other systems
though, and it sure is annoying, especially when the new behaviour
removes the ability to control the order that was previously there
(that is not what is happening here of course).

> This is more problematic, and something to be investigated. I don't
> immediately see why the Marvell network interfaces would not be visible
> by udev, but I haven't tested.

Well certainly doing udevtrigger -n -v I see no ethernet devices (but
lots of other things).  Looking in sysfs it is possible to dereive which
ethX belongs to which port based on the directory names, but that's
probably not the most convinient manner to deal with it.

> The solution of adding an alias in the DT, and using that to name
> network interfaces has already been proposed multiple times, but has
> been rejected by the networking maintainer, who suggests to use
> userspace tools (udev or something else) to rename network interfaces.
> See for example https://patchwork.kernel.org/patch/4122441/, which was
> proposed by my colleague Boris Brezillon.

Sure, although it seems many embedded systems would rather avoid udev
(even more so after systemd seems to have taken it over), and in this
case I get the impression so far that udev may not even be able to help
on this system.  (although maybe I just did it wrong).

> You can always try to propose again a solution, but I doubt it will be
> accepted.

Yeah me too, hence why I can't be bothered to try.  I have accepted
there are some things the maintainers won't fix so I have a few patches 
I maintain myself and can keep doing so.  They are tiny after all.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 04:50:11PM +0200, Thomas Petazzoni wrote:
> We had many many users getting confused by the fact that the order of
> the network interfaces was inverted compared to:
> 
>  * The board documentations
>  * The U-Boot numbering
>  * And to a lesser extent, the vendor kernel
> 
> So having things match the documentation numbering was in our opinion
> the least confusing thing moving forward. We should have done it
> earlier, but we thought that the rule "order by register address" was a
> very strong rule.
> 
> At this point, reverting the patch is I believe cause more harm than
> good. It's going to re-confuse again people.

Well wouldn't it only affect the tiny proportian that have moved to a >4.4
kernel?  Looking at LEDE, openWrt, etc, it seems very few have so far.

I don't think that's a good argument.

> Regarding the fact that the "wrong numbering if actually documented" is
> a fairly specious argument. The OpenWRT Wiki has never been an official
> documentation of any sort. I see it as a much more important aspect
> that the numbering of the Ethernet interfaces matches the user manual
> Marvell provides with its development and evaluation boards. The
> OpenWRT Wiki can certainly be fixed accordingly.

That could be a good argument.

Of course this would not be the first linux release to change network
device ordering.  It has happened many times before.

Unfortunately I don't see the eth ports on the marvell in udev, so I
don't know if udev could even help fix the names based on the address
of the port.

On any system I have been part of making in the past, I always added an
ifname attribute to the dtb and made the driver use that.  This problem
is just too silly to put up with.  There needs to be some sane way to
control the names of the interfaces.

-- 
Len Sorensen


Re: [Regression?] Commit cb4f71c429 deliberately changes order of network interfaces

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 04:50:11PM +0200, Thomas Petazzoni wrote:
> We had many many users getting confused by the fact that the order of
> the network interfaces was inverted compared to:
> 
>  * The board documentations
>  * The U-Boot numbering
>  * And to a lesser extent, the vendor kernel
> 
> So having things match the documentation numbering was in our opinion
> the least confusing thing moving forward. We should have done it
> earlier, but we thought that the rule "order by register address" was a
> very strong rule.
> 
> At this point, reverting the patch is I believe cause more harm than
> good. It's going to re-confuse again people.

Well wouldn't it only affect the tiny proportian that have moved to a >4.4
kernel?  Looking at LEDE, openWrt, etc, it seems very few have so far.

I don't think that's a good argument.

> Regarding the fact that the "wrong numbering if actually documented" is
> a fairly specious argument. The OpenWRT Wiki has never been an official
> documentation of any sort. I see it as a much more important aspect
> that the numbering of the Ethernet interfaces matches the user manual
> Marvell provides with its development and evaluation boards. The
> OpenWRT Wiki can certainly be fixed accordingly.

That could be a good argument.

Of course this would not be the first linux release to change network
device ordering.  It has happened many times before.

Unfortunately I don't see the eth ports on the marvell in udev, so I
don't know if udev could even help fix the names based on the address
of the port.

On any system I have been part of making in the past, I always added an
ifname attribute to the dtb and made the driver use that.  This problem
is just too silly to put up with.  There needs to be some sane way to
control the names of the interfaces.

-- 
Len Sorensen


Re: linux kernel 3.19.X does not load when compiled with binutils 2.6, works fine when compiled with binutils 2.22

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 04:42:20PM +0200, luigi.gen...@it.telecomitalia.it 
wrote:
> Hi all,
> 
> recentrly I was in need to compile and run linux kernel 3.19.8 on some
> servers, with different bios and cpus.
> 
> so i noticed that If i compile this kernel with binutils 2.6 (and 2.7) it
> does not load the image, both lilo, grub/grub2 show the message:

Did you by any chance mean 2.26 and 2.27 or are you really trying to
use binutils from 1996?

> loading kernel
> 
> loading ramdisk and nothing happens.
> 
> i tried the debug option, at boot, but it simply does not load the kernel.
> 
> 
> if i compile this kernel with:
> 
> binutils 2.22
> 
> everything works fine,
> 
> I tried this on a xeon with startdard bios, a xeon with efi, an amd
> Athlon-mp with bios and an AMD A8 with EFI.
> 
> I tried gcc 4.3 and gcc 4.8, with both the kernel loads if compiled with
> binutils 2.22, and doesn't load with binutils 2.6.
> 
> 
> What really bugs me is that on all the 4 servers i user i could compile and
> run without problems linux 4.7 compiled with binutils 2.6 and 2.7 with gcc
> 4.8 5.9 and 6.1

You must mean 2.26 and 2.27.

> i cannot afford to downgrate binutils (and glibc to 2.15) every time i need
> to compiled and e 3.19.X kernel.

Well you are allowed to install an older binutils in another location
and set your path.  You can even set your LD_LIBRARY_PATH if you really
also need an older glibc to do it (not idea why that would be needed).

You may have hit this binutils bug: 
https://sourceware.org/bugzilla/show_bug.cgi?id=19807
Also maybe related: https://bugzilla.kernel.org/show_bug.cgi?id=114671

-- 
Len Sorensen


Re: linux kernel 3.19.X does not load when compiled with binutils 2.6, works fine when compiled with binutils 2.22

2016-08-24 Thread Lennart Sorensen
On Wed, Aug 24, 2016 at 04:42:20PM +0200, luigi.gen...@it.telecomitalia.it 
wrote:
> Hi all,
> 
> recentrly I was in need to compile and run linux kernel 3.19.8 on some
> servers, with different bios and cpus.
> 
> so i noticed that If i compile this kernel with binutils 2.6 (and 2.7) it
> does not load the image, both lilo, grub/grub2 show the message:

Did you by any chance mean 2.26 and 2.27 or are you really trying to
use binutils from 1996?

> loading kernel
> 
> loading ramdisk and nothing happens.
> 
> i tried the debug option, at boot, but it simply does not load the kernel.
> 
> 
> if i compile this kernel with:
> 
> binutils 2.22
> 
> everything works fine,
> 
> I tried this on a xeon with startdard bios, a xeon with efi, an amd
> Athlon-mp with bios and an AMD A8 with EFI.
> 
> I tried gcc 4.3 and gcc 4.8, with both the kernel loads if compiled with
> binutils 2.22, and doesn't load with binutils 2.6.
> 
> 
> What really bugs me is that on all the 4 servers i user i could compile and
> run without problems linux 4.7 compiled with binutils 2.6 and 2.7 with gcc
> 4.8 5.9 and 6.1

You must mean 2.26 and 2.27.

> i cannot afford to downgrate binutils (and glibc to 2.15) every time i need
> to compiled and e 3.19.X kernel.

Well you are allowed to install an older binutils in another location
and set your path.  You can even set your LD_LIBRARY_PATH if you really
also need an older glibc to do it (not idea why that would be needed).

You may have hit this binutils bug: 
https://sourceware.org/bugzilla/show_bug.cgi?id=19807
Also maybe related: https://bugzilla.kernel.org/show_bug.cgi?id=114671

-- 
Len Sorensen


Re: CVE-2014-9900 fix is not upstream

2016-08-24 Thread Lennart Sorensen
On Tue, Aug 23, 2016 at 10:25:45PM +0100, Al Viro wrote:
> Sadly, sizeof is what we use when copying that sucker to userland.  So these
> padding bits in the end would've leaked, true enough, and the case is somewhat
> weaker.  And any normal architecture will have those, but then any such
> architecture will have no more trouble zeroing a 32bit value than 16bit one.

Hmm, good point.  Too bad I don't see a compiler option of "zero all
padding in structs".  Certainly generating the code should not really
be that different.

I see someone did request it 2 years ago:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63479

-- 
Len Sorensen


Re: CVE-2014-9900 fix is not upstream

2016-08-24 Thread Lennart Sorensen
On Tue, Aug 23, 2016 at 10:25:45PM +0100, Al Viro wrote:
> Sadly, sizeof is what we use when copying that sucker to userland.  So these
> padding bits in the end would've leaked, true enough, and the case is somewhat
> weaker.  And any normal architecture will have those, but then any such
> architecture will have no more trouble zeroing a 32bit value than 16bit one.

Hmm, good point.  Too bad I don't see a compiler option of "zero all
padding in structs".  Certainly generating the code should not really
be that different.

I see someone did request it 2 years ago:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63479

-- 
Len Sorensen


Re: CVE-2014-9900 fix is not upstream

2016-08-23 Thread Lennart Sorensen
On Tue, Aug 23, 2016 at 01:34:05PM -0700, Joe Perches wrote:
> On Tue, 2016-08-23 at 21:09 +0100, Al Viro wrote:
> > On Tue, Aug 23, 2016 at 11:24:06AM -0700, David Miller wrote:
> > ... and then we can file a bug report against the sodding compiler.  Note
> > that
> > struct ethtool_wolinfo {
> > __u32   cmd;
> > __u32   supported;
> > __u32   wolopts;
> > __u8sopass[SOPASS_MAX]; // 6, actually
> > };
> > is not going to *have* padding.  Not on anything even remotely sane.
> > If array of 6 char as member of a struct requires 64bit alignment on some
> > architecture, I would really like some of what the designers of that ABI
> > must have been smoking.
> 
> try this on x86-64
> 
> $ pahole -C ethtool_wolinfo vmlinux
> struct ethtool_wolinfo {
>   __u32  cmd;  /* 0 4 */
>   __u32  supported;/* 4 4 */
>   __u32  wolopts;  /* 8 4 */
>   __u8   sopass[6];/*12 6 */
> 
>   /* size: 20, cachelines: 1, members: 4 */
>   /* padding: 2 */
>   /* last cacheline: 20 bytes */
> };

That would be padding after the structure elements.

I think what was meant is that it won't add padding in the middle of the
structure due to alignment, ie it isn't doing:

struct ethtool_wolinfo {
__u32  cmd;  /* 0 4 */
__u32  supported;/* 4 4 */
__u32  wolopts;  /* 8 4 */
<4 bytes padding here>
__u8   sopass[6];/*16 6 */
};

which would have 4 bytes of padding in the middle between wolopts
and sopass.

I would not think it is the compilers job to worry about what is after
your structure elements, since you shouldn't be going there.

-- 
Len Sorensen


Re: CVE-2014-9900 fix is not upstream

2016-08-23 Thread Lennart Sorensen
On Tue, Aug 23, 2016 at 01:34:05PM -0700, Joe Perches wrote:
> On Tue, 2016-08-23 at 21:09 +0100, Al Viro wrote:
> > On Tue, Aug 23, 2016 at 11:24:06AM -0700, David Miller wrote:
> > ... and then we can file a bug report against the sodding compiler.  Note
> > that
> > struct ethtool_wolinfo {
> > __u32   cmd;
> > __u32   supported;
> > __u32   wolopts;
> > __u8sopass[SOPASS_MAX]; // 6, actually
> > };
> > is not going to *have* padding.  Not on anything even remotely sane.
> > If array of 6 char as member of a struct requires 64bit alignment on some
> > architecture, I would really like some of what the designers of that ABI
> > must have been smoking.
> 
> try this on x86-64
> 
> $ pahole -C ethtool_wolinfo vmlinux
> struct ethtool_wolinfo {
>   __u32  cmd;  /* 0 4 */
>   __u32  supported;/* 4 4 */
>   __u32  wolopts;  /* 8 4 */
>   __u8   sopass[6];/*12 6 */
> 
>   /* size: 20, cachelines: 1, members: 4 */
>   /* padding: 2 */
>   /* last cacheline: 20 bytes */
> };

That would be padding after the structure elements.

I think what was meant is that it won't add padding in the middle of the
structure due to alignment, ie it isn't doing:

struct ethtool_wolinfo {
__u32  cmd;  /* 0 4 */
__u32  supported;/* 4 4 */
__u32  wolopts;  /* 8 4 */
<4 bytes padding here>
__u8   sopass[6];/*16 6 */
};

which would have 4 bytes of padding in the middle between wolopts
and sopass.

I would not think it is the compilers job to worry about what is after
your structure elements, since you shouldn't be going there.

-- 
Len Sorensen


Re: [PATCH] net: ethernet: ti: cpdma: switch to use genalloc

2016-06-24 Thread Lennart Sorensen
On Fri, Jun 24, 2016 at 07:58:32PM +0300, Grygorii Strashko wrote:
> Oh. nice :( So, seems, I'd need to send v3. Right?
> By the way, this code hasn't been introduced by this patch - I've
> just moved whole function from one place to another.

Well since it is moving I would think that was a handy time to fix the
coding style violation too, since it got noticed.

That leaves just one place in that file violating that part of the coding
style (the other is in cpdma_chan_dump).

Somehow it wasn't spotted when the code was put in back in 2010, and since
they were wrapped lines, they don't stand out quite as much visually.

-- 
Len Sorensen


Re: [PATCH] net: ethernet: ti: cpdma: switch to use genalloc

2016-06-24 Thread Lennart Sorensen
On Fri, Jun 24, 2016 at 07:58:32PM +0300, Grygorii Strashko wrote:
> Oh. nice :( So, seems, I'd need to send v3. Right?
> By the way, this code hasn't been introduced by this patch - I've
> just moved whole function from one place to another.

Well since it is moving I would think that was a handy time to fix the
coding style violation too, since it got noticed.

That leaves just one place in that file violating that part of the coding
style (the other is in cpdma_chan_dump).

Somehow it wasn't spotted when the code was put in back in 2010, and since
they were wrapped lines, they don't stand out quite as much visually.

-- 
Len Sorensen


Re: [PATCH] net: ethernet: ti: cpdma: switch to use genalloc

2016-06-24 Thread Lennart Sorensen
On Fri, Jun 24, 2016 at 11:35:15AM +0530, Mugunthan V N wrote:
> >> +static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
> >> +{
> >> +if (!pool)
> >> +return;
> >> +
> >> +WARN_ON(pool->used_desc);
> >> +if (pool->cpumap) {
> >> +dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap,
> >> +  pool->phys);
> >> +} else {
> >> +iounmap(pool->iomap);
> >> +}
> >> +}
> >> +
> > single if, brackets?
> 
> if() has multiple line statement, so brackets are must.

It is line wrapped, it is still one statement.  And you can't argue the
else being multiple lines, although the style does require using brackets
for the else if the if required them.

Style says "Do not unnecessarily use braces where a single statement will do."
It says statement, not line.  A multiline wrapped statement is still
one statement.

I may personally hate the lack of brackets, but style wise it seems very
clear that the linux kernel only uses brakcets when required, which is
only when there is more than one statement.  I prefer what you did,
but not as much as I prefer consistency.

-- 
Len Sorensen


Re: [PATCH] net: ethernet: ti: cpdma: switch to use genalloc

2016-06-24 Thread Lennart Sorensen
On Fri, Jun 24, 2016 at 11:35:15AM +0530, Mugunthan V N wrote:
> >> +static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
> >> +{
> >> +if (!pool)
> >> +return;
> >> +
> >> +WARN_ON(pool->used_desc);
> >> +if (pool->cpumap) {
> >> +dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap,
> >> +  pool->phys);
> >> +} else {
> >> +iounmap(pool->iomap);
> >> +}
> >> +}
> >> +
> > single if, brackets?
> 
> if() has multiple line statement, so brackets are must.

It is line wrapped, it is still one statement.  And you can't argue the
else being multiple lines, although the style does require using brackets
for the else if the if required them.

Style says "Do not unnecessarily use braces where a single statement will do."
It says statement, not line.  A multiline wrapped statement is still
one statement.

I may personally hate the lack of brackets, but style wise it seems very
clear that the linux kernel only uses brakcets when required, which is
only when there is more than one statement.  I prefer what you did,
but not as much as I prefer consistency.

-- 
Len Sorensen


  1   2   3   4   5   6   7   8   9   10   >