Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-31 Thread Borislav Petkov
On Mon, Oct 31, 2016 at 11:54:44PM +0200, sonofa...@openmailbox.org wrote:
> No, will simply crash without running something special! If you get no
> issues then that is bad news. I get frequent and repeatable crashes. I
> forgot to mention that all those crashes occur at program launch. If the
> program launches, it does not crash. Unfortunately with Ubuntu it is not
> possible to keep oopses from the error report program.

Can't you enable crash dumps?

$ ulimit -c unlimited

That should reenable core dumping which can then be examined with gdb.
You could put the program executable and the core somewhere on the web
so that I can take a look...

In any case, I'd like to see what those crashes look like. Can you send
dmesg, does it even say something in dmesg related to those crashes?

> On the laptop we have a Debian installation, I will switch to it and
> get crash information there so that we figure out why it behaves that
> way. Besides that, I have some more tests to do but I am running out
> of ideas so I might not be able to help you more on it as my laptop
> appears to be really broken!

Maybe a hw issue? RAM broken, cooling failing...

> Poor performance might be the result here and not the cause of my
> issues. The 688 fix just makes the system respond better. As far as I
> am concerned, I do not wish any module options for turning on the fix.
> I would prefer to use DMI maching for this specific machine and thus
> having the fix automatically.

The problem with DMI strings is that then we have to always go and
update them. And that's always a PITA.

I'm just trying to avoid an unnecessary performance penalty to users
with the erratum workaround where the erratum itself didn't even occur
in the first place. Like in my case, for example. I've never had any
issues with that machine for the time I've been using it.

> All subsystems of the laptop appear to be good(RAM has been tested and
> the HDD has passed our test). There is a sound issue on the HDA but
> such issues are common on most laptops and will be dealt soon.

If by "tested" you mean, you ran memtest on it, memtest is notorious for
not always catching faulty DIMMs.

> > Is it a desktop system or a laptop?
> I got no reply on this question so I suppose you have a desktop.

Oh sorry, I must've missed that question. No, the Ontario I have is a
laptop, something like this one:

https://support.lenovo.com/de/en/documents/pd015763

x121e with an AMD CPU, I *think* it is E-350.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-31 Thread Borislav Petkov
On Mon, Oct 31, 2016 at 11:54:44PM +0200, sonofa...@openmailbox.org wrote:
> No, will simply crash without running something special! If you get no
> issues then that is bad news. I get frequent and repeatable crashes. I
> forgot to mention that all those crashes occur at program launch. If the
> program launches, it does not crash. Unfortunately with Ubuntu it is not
> possible to keep oopses from the error report program.

Can't you enable crash dumps?

$ ulimit -c unlimited

That should reenable core dumping which can then be examined with gdb.
You could put the program executable and the core somewhere on the web
so that I can take a look...

In any case, I'd like to see what those crashes look like. Can you send
dmesg, does it even say something in dmesg related to those crashes?

> On the laptop we have a Debian installation, I will switch to it and
> get crash information there so that we figure out why it behaves that
> way. Besides that, I have some more tests to do but I am running out
> of ideas so I might not be able to help you more on it as my laptop
> appears to be really broken!

Maybe a hw issue? RAM broken, cooling failing...

> Poor performance might be the result here and not the cause of my
> issues. The 688 fix just makes the system respond better. As far as I
> am concerned, I do not wish any module options for turning on the fix.
> I would prefer to use DMI maching for this specific machine and thus
> having the fix automatically.

The problem with DMI strings is that then we have to always go and
update them. And that's always a PITA.

I'm just trying to avoid an unnecessary performance penalty to users
with the erratum workaround where the erratum itself didn't even occur
in the first place. Like in my case, for example. I've never had any
issues with that machine for the time I've been using it.

> All subsystems of the laptop appear to be good(RAM has been tested and
> the HDD has passed our test). There is a sound issue on the HDA but
> such issues are common on most laptops and will be dealt soon.

If by "tested" you mean, you ran memtest on it, memtest is notorious for
not always catching faulty DIMMs.

> > Is it a desktop system or a laptop?
> I got no reply on this question so I suppose you have a desktop.

Oh sorry, I must've missed that question. No, the Ontario I have is a
laptop, something like this one:

https://support.lenovo.com/de/en/documents/pd015763

x121e with an AMD CPU, I *think* it is E-350.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-31 Thread sonofagun



Ok, Ubuntu 16.04.1 is running on the box now, no issues so far. Any
special workload I should run?
No, will simply crash without running something special! If you get no 
issues then that is bad news. I get frequent and repeatable crashes. I 
forgot to mention that all those crashes occur at program launch. If the 
program launches, it does not crash. Unfortunately with Ubuntu it is not 
possible to keep oopses from the error report program. On the laptop  we 
have a Debian installation, I will switch to it and get crash 
information there so that we figure out why it behaves that way. Besides 
that, I have some more tests to do but I am running out of ideas so I 
might not be able to help you more on it as my laptop appears to be 
really broken! Poor performance might be the result here and not the 
cause of my issues. The 688 fix just makes the system respond better. As 
far as I am concerned, I do not wish any module options for turning on 
the fix. I would prefer to use DMI maching for this specific machine and 
thus having the fix automatically.


All subsystems of the laptop appear to be good(RAM has been tested and 
the HDD has passed our test). There is a sound issue on the HDA but such 
issues are common on most laptops and will be dealt soon.



Is it a desktop system or a laptop?
I got no reply on this question so I suppose you have a desktop. Since 
my APU is installed on a laptop, I expect different behaviour.
There were many intel based laptops that had fewer lanes on the DMI 
interconnect bridging northbridge and southbridge. Maybe my laptop has 
the A-Link in reduced mode. That could explain my performance issues. It 
must be easy to verify that as all documents are available for its 
northbridge and southbridge. I will check the settings of both chips 
thoroughly.


My brother has spotted a C70 board. Normally I would not buy it but it 
has a better but slower(without CPB) F14 CPU and I am curious if it will 
behave better like your board does. I might order it if it is still 
available.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-31 Thread sonofagun



Ok, Ubuntu 16.04.1 is running on the box now, no issues so far. Any
special workload I should run?
No, will simply crash without running something special! If you get no 
issues then that is bad news. I get frequent and repeatable crashes. I 
forgot to mention that all those crashes occur at program launch. If the 
program launches, it does not crash. Unfortunately with Ubuntu it is not 
possible to keep oopses from the error report program. On the laptop  we 
have a Debian installation, I will switch to it and get crash 
information there so that we figure out why it behaves that way. Besides 
that, I have some more tests to do but I am running out of ideas so I 
might not be able to help you more on it as my laptop appears to be 
really broken! Poor performance might be the result here and not the 
cause of my issues. The 688 fix just makes the system respond better. As 
far as I am concerned, I do not wish any module options for turning on 
the fix. I would prefer to use DMI maching for this specific machine and 
thus having the fix automatically.


All subsystems of the laptop appear to be good(RAM has been tested and 
the HDD has passed our test). There is a sound issue on the HDA but such 
issues are common on most laptops and will be dealt soon.



Is it a desktop system or a laptop?
I got no reply on this question so I suppose you have a desktop. Since 
my APU is installed on a laptop, I expect different behaviour.
There were many intel based laptops that had fewer lanes on the DMI 
interconnect bridging northbridge and southbridge. Maybe my laptop has 
the A-Link in reduced mode. That could explain my performance issues. It 
must be easy to verify that as all documents are available for its 
northbridge and southbridge. I will check the settings of both chips 
thoroughly.


My brother has spotted a C70 board. Normally I would not buy it but it 
has a better but slower(without CPB) F14 CPU and I am curious if it will 
behave better like your board does. I might order it if it is still 
available.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-28 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 07:14:50PM +0200, Borislav Petkov wrote:
> > Yes, using Ubuntu 16.04 will just crash everything! For example I had
> > crashes with the software updater program. Moreover firefox would become
> > unresponsive even with one tab.
> 
> Ok, lemme install 16.04 on that box and see if I can reproduce.

Ok, Ubuntu 16.04.1 is running on the box now, no issues so far. Any
special workload I should run?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-28 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 07:14:50PM +0200, Borislav Petkov wrote:
> > Yes, using Ubuntu 16.04 will just crash everything! For example I had
> > crashes with the software updater program. Moreover firefox would become
> > unresponsive even with one tab.
> 
> Ok, lemme install 16.04 on that box and see if I can reproduce.

Ok, Ubuntu 16.04.1 is running on the box now, no issues so far. Any
special workload I should run?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-25 Thread sonofagun



Why not? It all depends on the load type, working set and the access
patterns. There's no strong correlation between the load of a machine
and the amount of branch misses...
Yes I did not say that there is a linear correlation but that does not 
mean that those two numbers move opposite to each other. On all our 
systems running more tasks that consume more CPU and memory result in 
increased branch misses.
 It is normal as one thread might block another and a third thread might 
wait for the first thread to finish in order to resume.
 It is not normal to have increased misses only when the OS is loaded 
and running in idle without doing anything. Unless you are talking for 
AMD F14.


I wonder if we should just flush the L2 and disable it completely on AMD 
F14. Since this is an APU I have no idea if the onboard graphics can 
operate properly without L2.



setpci -s 0x18.4 0x164.l

and looking at bit 2. If it is set, the erratum is fixed.
Will do but there is no meaning as I already told you on the first mail 
that D18F4x164 is 0003h. It will not change.



No, I don't mean that - I'm talking about *not* applying it by default
and when people start seeing issues like that, they can boot their
machines with something like "enable_e688_workaround" or so and it will
get applied then. I.e., an "opt-in" deal.
Yes I got it. I have no problem, you are free to do what you think is 
the best solution. Just ensure that it will not be possible to apply the 
fix to F16.


Even if you decide to not include the fix at all in the kernel, I still 
have the patch for my system and it works.


Did you get any crashes on your B0 box with Ubuntu? Is it a desktop 
system or a laptop?


The irony is that this laptop was bought without USB3 on purpose to 
achieve maximum stability... Luckily we didn't stick to the original 
plan to buy two laptops :)




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-25 Thread sonofagun



Why not? It all depends on the load type, working set and the access
patterns. There's no strong correlation between the load of a machine
and the amount of branch misses...
Yes I did not say that there is a linear correlation but that does not 
mean that those two numbers move opposite to each other. On all our 
systems running more tasks that consume more CPU and memory result in 
increased branch misses.
 It is normal as one thread might block another and a third thread might 
wait for the first thread to finish in order to resume.
 It is not normal to have increased misses only when the OS is loaded 
and running in idle without doing anything. Unless you are talking for 
AMD F14.


I wonder if we should just flush the L2 and disable it completely on AMD 
F14. Since this is an APU I have no idea if the onboard graphics can 
operate properly without L2.



setpci -s 0x18.4 0x164.l

and looking at bit 2. If it is set, the erratum is fixed.
Will do but there is no meaning as I already told you on the first mail 
that D18F4x164 is 0003h. It will not change.



No, I don't mean that - I'm talking about *not* applying it by default
and when people start seeing issues like that, they can boot their
machines with something like "enable_e688_workaround" or so and it will
get applied then. I.e., an "opt-in" deal.
Yes I got it. I have no problem, you are free to do what you think is 
the best solution. Just ensure that it will not be possible to apply the 
fix to F16.


Even if you decide to not include the fix at all in the kernel, I still 
have the patch for my system and it works.


Did you get any crashes on your B0 box with Ubuntu? Is it a desktop 
system or a laptop?


The irony is that this laptop was bought without USB3 on purpose to 
achieve maximum stability... Luckily we didn't stick to the original 
plan to buy two laptops :)




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-25 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 11:39:47PM +0300, sonofa...@openmailbox.org wrote:
> It does to me! That cpu family is "broken" both on B0 and C0. I think
> that a CPU at 30% load should not have >31% branch misses. For example
> with 5% CPU usage you can't expect to get 10% branch-misses...

Why not? It all depends on the load type, working set and the access
patterns. There's no strong correlation between the load of a machine
and the amount of branch misses...

> Yes but on C0 I got better results. Maybe the BIOS vendor got similar
> results and did not apply the fix.

Well, there's a C0 stepping which doesn't need the fix because it was
fixed in the silicon.

You can check that by doing:

setpci -s 0x18.4 0x164.l

and looking at bit 2. If it is set, the erratum is fixed.

> They use the same BIOS for all machines B0, C0 and that could be the
> reason for not applying the 688 workaround. I think we are going to
> the wrong place here but I will not try to influence you at all. I
> only apply the fix once per boot and I think that we are not supposed
> to apply, remove and then reapply workarounds on the fly.

No, I don't mean that - I'm talking about *not* applying it by default
and when people start seeing issues like that, they can boot their
machines with something like "enable_e688_workaround" or so and it will
get applied then. I.e., an "opt-in" deal.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-25 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 11:39:47PM +0300, sonofa...@openmailbox.org wrote:
> It does to me! That cpu family is "broken" both on B0 and C0. I think
> that a CPU at 30% load should not have >31% branch misses. For example
> with 5% CPU usage you can't expect to get 10% branch-misses...

Why not? It all depends on the load type, working set and the access
patterns. There's no strong correlation between the load of a machine
and the amount of branch misses...

> Yes but on C0 I got better results. Maybe the BIOS vendor got similar
> results and did not apply the fix.

Well, there's a C0 stepping which doesn't need the fix because it was
fixed in the silicon.

You can check that by doing:

setpci -s 0x18.4 0x164.l

and looking at bit 2. If it is set, the erratum is fixed.

> They use the same BIOS for all machines B0, C0 and that could be the
> reason for not applying the 688 workaround. I think we are going to
> the wrong place here but I will not try to influence you at all. I
> only apply the fix once per boot and I think that we are not supposed
> to apply, remove and then reapply workarounds on the fly.

No, I don't mean that - I'm talking about *not* applying it by default
and when people start seeing issues like that, they can boot their
machines with something like "enable_e688_workaround" or so and it will
get applied then. I.e., an "opt-in" deal.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun



so that doesn't tell me a whole lot.
It does to me! That cpu family is "broken" both on B0 and C0. I think 
that a CPU at 30% load should not have >31% branch misses. For example 
with 5% CPU usage you can't expect to get 10% branch-misses...



Well, Ontario is a small core and with the erratum workaround in place,
it does get a bit worse too, apparently.
Yes but on C0 I got better results. Maybe the BIOS vendor got similar 
results and did not apply the fix. They use the same BIOS for all 
machines B0, C0 and that could be the reason for not applying the 688 
workaround.
I think we are going to the wrong place here but I will not try to 
influence you at all. I only apply the fix once per boot and I think 
that we are not supposed to apply, remove and then reapply workarounds 
on the fly. Be carefull, you might hang your machine, brick your board 
or destroy your APU!


The truth is that my system behaves better with the patch.

The problem is that there is no way to get what I need! That is the 
E-300 datasheet...They give everything for the north and the south but 
we have poor documentation for the APU itself...I will contact AMD to 
see if I can get the APU datasheet so that we have a clue what those 
bits actualy do.



Hohumm, yeah, the workaround impacts the number of branch misses. It
probably disables some branch predictor optimization or so, which is
"problematic" in certain scenarios.
That is obvious. You can't say what it does, it might disable an 
internal buffer or force a CPU subsystem to run at a lower frequency, 
who knows?


I guess we still want it because first we should not explode and then 
go

fast :)
Exactly. I agree with that as I want to eliminate the crashes. Keep in 
mind that speed is something that all those APUs do not have and will 
never have, stability is what we are trying to improve.



I'm thinking currently that if it is not easily triggerable, I could
make the erratum workaround off by default and have a command line
option which people can enable in case they experience any of the
issues...
No problem, it is up to you. As I said above, I will not try to change 
your mind.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun



so that doesn't tell me a whole lot.
It does to me! That cpu family is "broken" both on B0 and C0. I think 
that a CPU at 30% load should not have >31% branch misses. For example 
with 5% CPU usage you can't expect to get 10% branch-misses...



Well, Ontario is a small core and with the erratum workaround in place,
it does get a bit worse too, apparently.
Yes but on C0 I got better results. Maybe the BIOS vendor got similar 
results and did not apply the fix. They use the same BIOS for all 
machines B0, C0 and that could be the reason for not applying the 688 
workaround.
I think we are going to the wrong place here but I will not try to 
influence you at all. I only apply the fix once per boot and I think 
that we are not supposed to apply, remove and then reapply workarounds 
on the fly. Be carefull, you might hang your machine, brick your board 
or destroy your APU!


The truth is that my system behaves better with the patch.

The problem is that there is no way to get what I need! That is the 
E-300 datasheet...They give everything for the north and the south but 
we have poor documentation for the APU itself...I will contact AMD to 
see if I can get the APU datasheet so that we have a clue what those 
bits actualy do.



Hohumm, yeah, the workaround impacts the number of branch misses. It
probably disables some branch predictor optimization or so, which is
"problematic" in certain scenarios.
That is obvious. You can't say what it does, it might disable an 
internal buffer or force a CPU subsystem to run at a lower frequency, 
who knows?


I guess we still want it because first we should not explode and then 
go

fast :)
Exactly. I agree with that as I want to eliminate the crashes. Keep in 
mind that speed is something that all those APUs do not have and will 
never have, stability is what we are trying to improve.



I'm thinking currently that if it is not easily triggerable, I could
make the erratum workaround off by default and have a command line
option which people can enable in case they experience any of the
issues...
No problem, it is up to you. As I said above, I will not try to change 
your mind.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 04:13:25PM +0300, sonofa...@openmailbox.org wrote:
> No command needed, just type: sudo perf stat -a and immediately exit
> with ctrl+C. That will give you a glimpse. See "% of all branches"

$ ./perf stat -a --repeat 10 sleep 1s 

 Performance counter stats for 'system wide' (10 runs):

   2013.974964  cpu-clock (msec)  #1.999 CPUs utilized  
  ( +-  0.02% )
88  context-switches  #0.044 K/sec  
  ( +-  2.05% )
 2  cpu-migrations#0.001 K/sec  
  ( +-  8.55% )
75  page-faults   #0.037 K/sec  
  ( +-  0.42% )
81,177,296  cycles#0.040 GHz
  ( +-  0.76% )  (66.62%)
 0  stalled-cycles-frontend 
  (66.63%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.64%)
 9,602,846  instructions  #0.12  insn per cycle 
  ( +-  2.08% )  (66.65%)
 1,698,414  branches  #0.843 M/sec  
  ( +-  4.26% )  (66.75%)
   327,945  branch-misses #   19.31% of all branches
  ( +-  1.76% )  (66.72%)

   1.007545371 seconds time elapsed 
 ( +-  0.02% )

Now disable erratum workaround:

$ wrmsr --all 0xc0011021 0x10008000
$ rdmsr --all 0xc0011021
10008000
10008000

$ ./perf stat -a --repeat 10 sleep 1s

 Performance counter stats for 'system wide' (10 runs):

   2012.521775  cpu-clock (msec)  #1.999 CPUs utilized  
  ( +-  0.02% )
91  context-switches  #0.045 K/sec  
  ( +-  2.62% )
 3  cpu-migrations#0.001 K/sec  
  ( +- 13.07% )
75  page-faults   #0.037 K/sec  
  ( +-  0.66% )
82,215,531  cycles#0.041 GHz
  ( +-  1.08% )  (66.60%)
 0  stalled-cycles-frontend 
  (66.60%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.62%)
 9,444,884  instructions  #0.11  insn per cycle 
  ( +-  2.11% )  (66.70%)
 1,484,480  branches  #0.738 M/sec  
  ( +-  5.16% )  (66.78%)
   303,382  branch-misses #   20.44% of all branches
  ( +-  1.44% )  (66.70%)

   1.006812225 seconds time elapsed 
 ( +-  0.02% )

so that doesn't tell me a whole lot.

> next open firefox, rerun the same command after firefox launches and
> immediately exit with ctrl+C On that piece of crap I get branch-misses
> above 10% from boot without executing anything and perf does not like
> it so it displays it with red colour. On my quad core kabini APU,
> in order to get 9% branch-misses I have to open 50 tabs on firefox.
> Something is terribly wrong here.

Well, Ontario is a small core and with the erratum workaround in place,
it does get a bit worse too, apparently.

Let's see how many branch misses we get when starting firefox:

* with workaround:

$ echo 3 > /proc/sys/vm/drop_caches && ./perf stat ./firefox.sh

 Performance counter stats for './firefox.sh':

257.037242  task-clock (msec) #0.103 CPUs utilized
   332  context-switches  #0.001 M/sec
 6  cpu-migrations#0.023 K/sec
 1,022  page-faults   #0.004 M/sec
   213,464,893  cycles#0.830 GHz
  (63.29%)
 0  stalled-cycles-frontend 
  (62.76%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.88%)
   106,763,405  instructions  #0.50  insn per cycle 
  (73.54%)
23,794,511  branches  #   92.572 M/sec  
  (73.32%)
 2,629,193  branch-misses #   11.05% of all branches
  (66.16%)

   2.501140816 seconds time elapsed


* without it:

$ echo 3 > /proc/sys/vm/drop_caches && ./perf stat ./firefox.sh

 Performance counter stats for './firefox.sh':

196.561165  task-clock (msec) #0.082 CPUs utilized
   276  context-switches  #0.001 M/sec
 9  cpu-migrations#0.046 K/sec
   932  page-faults   #0.005 M/sec
   162,697,731  cycles#0.828 GHz
  (70.27%)
 0  stalled-cycles-frontend   

Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 04:13:25PM +0300, sonofa...@openmailbox.org wrote:
> No command needed, just type: sudo perf stat -a and immediately exit
> with ctrl+C. That will give you a glimpse. See "% of all branches"

$ ./perf stat -a --repeat 10 sleep 1s 

 Performance counter stats for 'system wide' (10 runs):

   2013.974964  cpu-clock (msec)  #1.999 CPUs utilized  
  ( +-  0.02% )
88  context-switches  #0.044 K/sec  
  ( +-  2.05% )
 2  cpu-migrations#0.001 K/sec  
  ( +-  8.55% )
75  page-faults   #0.037 K/sec  
  ( +-  0.42% )
81,177,296  cycles#0.040 GHz
  ( +-  0.76% )  (66.62%)
 0  stalled-cycles-frontend 
  (66.63%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.64%)
 9,602,846  instructions  #0.12  insn per cycle 
  ( +-  2.08% )  (66.65%)
 1,698,414  branches  #0.843 M/sec  
  ( +-  4.26% )  (66.75%)
   327,945  branch-misses #   19.31% of all branches
  ( +-  1.76% )  (66.72%)

   1.007545371 seconds time elapsed 
 ( +-  0.02% )

Now disable erratum workaround:

$ wrmsr --all 0xc0011021 0x10008000
$ rdmsr --all 0xc0011021
10008000
10008000

$ ./perf stat -a --repeat 10 sleep 1s

 Performance counter stats for 'system wide' (10 runs):

   2012.521775  cpu-clock (msec)  #1.999 CPUs utilized  
  ( +-  0.02% )
91  context-switches  #0.045 K/sec  
  ( +-  2.62% )
 3  cpu-migrations#0.001 K/sec  
  ( +- 13.07% )
75  page-faults   #0.037 K/sec  
  ( +-  0.66% )
82,215,531  cycles#0.041 GHz
  ( +-  1.08% )  (66.60%)
 0  stalled-cycles-frontend 
  (66.60%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.62%)
 9,444,884  instructions  #0.11  insn per cycle 
  ( +-  2.11% )  (66.70%)
 1,484,480  branches  #0.738 M/sec  
  ( +-  5.16% )  (66.78%)
   303,382  branch-misses #   20.44% of all branches
  ( +-  1.44% )  (66.70%)

   1.006812225 seconds time elapsed 
 ( +-  0.02% )

so that doesn't tell me a whole lot.

> next open firefox, rerun the same command after firefox launches and
> immediately exit with ctrl+C On that piece of crap I get branch-misses
> above 10% from boot without executing anything and perf does not like
> it so it displays it with red colour. On my quad core kabini APU,
> in order to get 9% branch-misses I have to open 50 tabs on firefox.
> Something is terribly wrong here.

Well, Ontario is a small core and with the erratum workaround in place,
it does get a bit worse too, apparently.

Let's see how many branch misses we get when starting firefox:

* with workaround:

$ echo 3 > /proc/sys/vm/drop_caches && ./perf stat ./firefox.sh

 Performance counter stats for './firefox.sh':

257.037242  task-clock (msec) #0.103 CPUs utilized
   332  context-switches  #0.001 M/sec
 6  cpu-migrations#0.023 K/sec
 1,022  page-faults   #0.004 M/sec
   213,464,893  cycles#0.830 GHz
  (63.29%)
 0  stalled-cycles-frontend 
  (62.76%)
 0  stalled-cycles-backend#0.00% backend cycles 
idle  (66.88%)
   106,763,405  instructions  #0.50  insn per cycle 
  (73.54%)
23,794,511  branches  #   92.572 M/sec  
  (73.32%)
 2,629,193  branch-misses #   11.05% of all branches
  (66.16%)

   2.501140816 seconds time elapsed


* without it:

$ echo 3 > /proc/sys/vm/drop_caches && ./perf stat ./firefox.sh

 Performance counter stats for './firefox.sh':

196.561165  task-clock (msec) #0.082 CPUs utilized
   276  context-switches  #0.001 M/sec
 9  cpu-migrations#0.046 K/sec
   932  page-faults   #0.005 M/sec
   162,697,731  cycles#0.828 GHz
  (70.27%)
 0  stalled-cycles-frontend   

Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun


Sure, give me the exact command you're executing so that I can do it 
here.

No command needed, just type:
sudo perf stat -a
and immediately exit with ctrl+C. That will give you a glimpse. See "% 
of all branches"
next open firefox, rerun the same command after firefox launches and 
immediately exit with ctrl+C
On that piece of crap I get branch-misses above 10% from boot without 
executing anything and perf does not like it so it displays it with red 
colour.
On my quad core kabini APU, in order to get 9% branch-misses I have to 
open 50 tabs on firefox. Something is terribly wrong here.


Out of pure interest: do you remember how exactly you did reproduce 
this

issue?
Yes, using Ubuntu 16.04 will just crash everything! For example I had 
crashes with the software updater program. Moreover firefox would become 
unresponsive even with one tab.
Luckily initial tests of 16.10 seem promising as it is lighter and 
consumes 3~5% less RAM! Debian which was lighter was more responsive and 
had no crashes except an oops from adobe flash.
I believe that the bug is triggered by the unusually high branch-misses 
specific to this machine. After the fix, I got better OS and program 
responsiveness.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun


Sure, give me the exact command you're executing so that I can do it 
here.

No command needed, just type:
sudo perf stat -a
and immediately exit with ctrl+C. That will give you a glimpse. See "% 
of all branches"
next open firefox, rerun the same command after firefox launches and 
immediately exit with ctrl+C
On that piece of crap I get branch-misses above 10% from boot without 
executing anything and perf does not like it so it displays it with red 
colour.
On my quad core kabini APU, in order to get 9% branch-misses I have to 
open 50 tabs on firefox. Something is terribly wrong here.


Out of pure interest: do you remember how exactly you did reproduce 
this

issue?
Yes, using Ubuntu 16.04 will just crash everything! For example I had 
crashes with the software updater program. Moreover firefox would become 
unresponsive even with one tab.
Luckily initial tests of 16.10 seem promising as it is lighter and 
consumes 3~5% less RAM! Debian which was lighter was more responsive and 
had no crashes except an oops from adobe flash.
I believe that the bug is triggered by the unusually high branch-misses 
specific to this machine. After the fix, I got better OS and program 
responsiveness.




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 02:38:06PM +0300, sonofa...@openmailbox.org wrote:
> The patch is not equivalent to the original. As a result it behaves
> differently. To be specific, using dmesg I get the expected value from the
> affected MSR with the original patch. With the latest patch, patching of the
> MSR occurs after dmesg prints the MSR information. That is why I thought it
> did nothing.

Gah, that "show_msr" is crap - it gets issued too early and we can -
and we do - set MSRs later too. Oh and it prints only the BSP. I should
probably rip it out - there's msr-tools for that which is much better.

> rdmsr --all 0xc0011021 returns the expected results on all CPUs with both
> patches. I have the impression that the system boots slower because the fix
> is applied later compared to the original patch.

Could be - setting those bits 3 in 14 in that MSR is probably disabling
some hw features which may impact performance.

> Could you please use perf and tell me what values do you get at perf
> branch-misses right after boot on your ON-B0 box? Launching firefox with
> only one tab gives you similar numbers?

Sure, give me the exact command you're executing so that I can do it here.

> If you need anything more, feel free to ask.

Out of pure interest: do you remember how exactly you did reproduce this
issue?

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 02:38:06PM +0300, sonofa...@openmailbox.org wrote:
> The patch is not equivalent to the original. As a result it behaves
> differently. To be specific, using dmesg I get the expected value from the
> affected MSR with the original patch. With the latest patch, patching of the
> MSR occurs after dmesg prints the MSR information. That is why I thought it
> did nothing.

Gah, that "show_msr" is crap - it gets issued too early and we can -
and we do - set MSRs later too. Oh and it prints only the BSP. I should
probably rip it out - there's msr-tools for that which is much better.

> rdmsr --all 0xc0011021 returns the expected results on all CPUs with both
> patches. I have the impression that the system boots slower because the fix
> is applied later compared to the original patch.

Could be - setting those bits 3 in 14 in that MSR is probably disabling
some hw features which may impact performance.

> Could you please use perf and tell me what values do you get at perf
> branch-misses right after boot on your ON-B0 box? Launching firefox with
> only one tab gives you similar numbers?

Sure, give me the exact command you're executing so that I can do it here.

> If you need anything more, feel free to ask.

Out of pure interest: do you remember how exactly you did reproduce this
issue?

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun



Hmm, so did you apply the patch correctly?

Yes

The patch is not equivalent to the original. As a result it behaves 
differently. To be specific, using dmesg I get the expected value from 
the affected MSR with the original patch. With the latest patch, 
patching of the MSR occurs after dmesg prints the MSR information. That 
is why I thought it did nothing.


rdmsr --all 0xc0011021 returns the expected results on all CPUs with 
both patches. I have the impression that the system boots slower because 
the fix is applied later compared to the original patch.


Since the code works there is no need to attach the compile config so I 
attach a dmesg and rdmsr --all 0xc0011021.


Could you please use perf and tell me what values do you get at perf 
branch-misses right after boot on your ON-B0 box? Launching firefox with 
only one tab gives you similar numbers?


If you need anything more, feel free to ask.


$ rdmsr --all 0xc0011021
1020c008
1020c008

$ dmesg
[0.00] Linux version 4.8.4-vnl-14h-688-amd64 (root@FXLSI) (gcc 
version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #1 SMP Sun Oct 
23 16:19:41 EEST 2016
[0.00] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.8.4-vnl-14h-688-amd64 
root=UUID=124d207f-6ec4-4270-a1a3-2878e0756f25 ro quiet show_msr=1 
clocksource=hpet hpet=verbose acpi_sleep=s3_beep mce=bootlog 
pcie_aspm.policy=powersave debug=y splash vt.handoff=7

[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Legacy x87 FPU detected.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009f7ff] 
usable
[0.00] BIOS-e820: [mem 0x0009f800-0x0009] 
reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] 
reserved
[0.00] BIOS-e820: [mem 0x0010-0xdfb3efff] 
usable
[0.00] BIOS-e820: [mem 0xdfb3f000-0xdfbbefff] 
reserved
[0.00] BIOS-e820: [mem 0xdfbbf000-0xdfebefff] 
ACPI NVS
[0.00] BIOS-e820: [mem 0xdfebf000-0xdfef4fff] 
ACPI data
[0.00] BIOS-e820: [mem 0xdfef5000-0xdfef] 
usable
[0.00] BIOS-e820: [mem 0xdff0-0xdfff] 
reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] 
reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] 
reserved
[0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] 
reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] 
reserved
[0.00] BIOS-e820: [mem 0xffe0-0x] 
reserved
[0.00] BIOS-e820: [mem 0x0001-0x000206ff] 
usable
[0.00] BIOS-e820: [mem 0x00020700-0x00021eff] 
reserved

[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Hewlett-Packard Presario CQ57 Notebook PC/3577, BIOS 
F.47 12/17/2011
[0.00] e820: update [mem 0x-0x0fff] usable ==> 
reserved

[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x207000 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-through
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask F8000 write-back
[0.00]   1 base 08000 mask FC000 write-back
[0.00]   2 base 0C000 mask FE000 write-back
[0.00]   3 base 0DFEBD000 mask FF000 uncachable
[0.00]   4 base 0FFE0 mask FFFE0 write-protect
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00021f00 aka 8688M
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- 
WT

[0.00] e820: last_pfn = 0xdff00 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fe1b0-0x000fe1bf] mapped 
at [a0e7400fe1b0]

[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [a0e740099000] 99000 size 
24576

[0.00] Using GB pages for direct mapping
[0.00] BRK [0x8222a000, 0x8222afff] PGTABLE
[0.00] BRK [0x8222b000, 0x8222bfff] PGTABLE
[0.00] BRK [0x8222c000, 0x8222cfff] PGTABLE
[0.00] BRK [0x8222d000, 0x8222dfff] PGTABLE
[0.00] BRK [0x8222e000, 0x8222efff] PGTABLE
[0.00] BRK [0x8222f000, 0x8222] PGTABLE
[0.00] RAMDISK: [mem 0x33a42000-0x35d18fff]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000FE020 24 (v02 HPQOEM)
[

Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-24 Thread sonofagun



Hmm, so did you apply the patch correctly?

Yes

The patch is not equivalent to the original. As a result it behaves 
differently. To be specific, using dmesg I get the expected value from 
the affected MSR with the original patch. With the latest patch, 
patching of the MSR occurs after dmesg prints the MSR information. That 
is why I thought it did nothing.


rdmsr --all 0xc0011021 returns the expected results on all CPUs with 
both patches. I have the impression that the system boots slower because 
the fix is applied later compared to the original patch.


Since the code works there is no need to attach the compile config so I 
attach a dmesg and rdmsr --all 0xc0011021.


Could you please use perf and tell me what values do you get at perf 
branch-misses right after boot on your ON-B0 box? Launching firefox with 
only one tab gives you similar numbers?


If you need anything more, feel free to ask.


$ rdmsr --all 0xc0011021
1020c008
1020c008

$ dmesg
[0.00] Linux version 4.8.4-vnl-14h-688-amd64 (root@FXLSI) (gcc 
version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #1 SMP Sun Oct 
23 16:19:41 EEST 2016
[0.00] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.8.4-vnl-14h-688-amd64 
root=UUID=124d207f-6ec4-4270-a1a3-2878e0756f25 ro quiet show_msr=1 
clocksource=hpet hpet=verbose acpi_sleep=s3_beep mce=bootlog 
pcie_aspm.policy=powersave debug=y splash vt.handoff=7

[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Legacy x87 FPU detected.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009f7ff] 
usable
[0.00] BIOS-e820: [mem 0x0009f800-0x0009] 
reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] 
reserved
[0.00] BIOS-e820: [mem 0x0010-0xdfb3efff] 
usable
[0.00] BIOS-e820: [mem 0xdfb3f000-0xdfbbefff] 
reserved
[0.00] BIOS-e820: [mem 0xdfbbf000-0xdfebefff] 
ACPI NVS
[0.00] BIOS-e820: [mem 0xdfebf000-0xdfef4fff] 
ACPI data
[0.00] BIOS-e820: [mem 0xdfef5000-0xdfef] 
usable
[0.00] BIOS-e820: [mem 0xdff0-0xdfff] 
reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] 
reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] 
reserved
[0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] 
reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] 
reserved
[0.00] BIOS-e820: [mem 0xffe0-0x] 
reserved
[0.00] BIOS-e820: [mem 0x0001-0x000206ff] 
usable
[0.00] BIOS-e820: [mem 0x00020700-0x00021eff] 
reserved

[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Hewlett-Packard Presario CQ57 Notebook PC/3577, BIOS 
F.47 12/17/2011
[0.00] e820: update [mem 0x-0x0fff] usable ==> 
reserved

[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x207000 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-through
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask F8000 write-back
[0.00]   1 base 08000 mask FC000 write-back
[0.00]   2 base 0C000 mask FE000 write-back
[0.00]   3 base 0DFEBD000 mask FF000 uncachable
[0.00]   4 base 0FFE0 mask FFFE0 write-protect
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00021f00 aka 8688M
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- 
WT

[0.00] e820: last_pfn = 0xdff00 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fe1b0-0x000fe1bf] mapped 
at [a0e7400fe1b0]

[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [a0e740099000] 99000 size 
24576

[0.00] Using GB pages for direct mapping
[0.00] BRK [0x8222a000, 0x8222afff] PGTABLE
[0.00] BRK [0x8222b000, 0x8222bfff] PGTABLE
[0.00] BRK [0x8222c000, 0x8222cfff] PGTABLE
[0.00] BRK [0x8222d000, 0x8222dfff] PGTABLE
[0.00] BRK [0x8222e000, 0x8222efff] PGTABLE
[0.00] BRK [0x8222f000, 0x8222] PGTABLE
[0.00] RAMDISK: [mem 0x33a42000-0x35d18fff]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000FE020 24 (v02 HPQOEM)
[

Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 12:02:39AM +0300, sonofa...@openmailbox.org wrote:
> Good to hear but something is still wrong on my laptop as nothing worked as
> expected :(

Hmm, so did you apply the patch correctly?

Send me arch/x86/kernel/amd_nb.c after you've applied the patch.

Then, boot the kernel with my patch applied, send me full dmesg,
the .config used and do as root:

$ rdmsr --all 0xc0011021

and paste the output here please.

For that you'd need the msr-tools package and you'd need to modprobe
msr.ko if you haven't done so.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Mon, Oct 24, 2016 at 12:02:39AM +0300, sonofa...@openmailbox.org wrote:
> Good to hear but something is still wrong on my laptop as nothing worked as
> expected :(

Hmm, so did you apply the patch correctly?

Send me arch/x86/kernel/amd_nb.c after you've applied the patch.

Then, boot the kernel with my patch applied, send me full dmesg,
the .config used and do as root:

$ rdmsr --all 0xc0011021

and paste the output here please.

For that you'd need the msr-tools package and you'd need to modprobe
msr.ko if you haven't done so.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun



In any case, I tested it on my ON-B0 box and it looked good.
Good to hear but something is still wrong on my laptop as nothing worked 
as expected :(


Since I have a working custom kernel including the fix from my original 
patch it was clear from boot that the last patched kernel did not touch 
the MSR we want to modify at all. The machine was slower compared with 
my kernel using the original patch. As I use the show_msr option, a 
quick look at the dmesg proved that easily. Nowadays that processors 
have many cores, I wonder if the kernel should report which CPU MSRs are 
displayed at dmesg.


Take your time to see what is wrong, we already have one working kernel 
for our machine :)


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun



In any case, I tested it on my ON-B0 box and it looked good.
Good to hear but something is still wrong on my laptop as nothing worked 
as expected :(


Since I have a working custom kernel including the fix from my original 
patch it was clear from boot that the last patched kernel did not touch 
the MSR we want to modify at all. The machine was slower compared with 
my kernel using the original patch. As I use the show_msr option, a 
quick look at the dmesg proved that easily. Nowadays that processors 
have many cores, I wonder if the kernel should report which CPU MSRs are 
displayed at dmesg.


Take your time to see what is wrong, we already have one working kernel 
for our machine :)


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Sun, Oct 23, 2016 at 08:06:44PM +0300, sonofa...@openmailbox.org wrote:
> I use the patchwork site and my brother uses an LKML mirror site. He gets
> patches from there. This worked the with the first two patches but the last
> one was a big one and that site truncated some bytes from a line...Sorry for
> the trouble.

You can simply save the email text if your mail client doesn't mangle
white space. Alternatively, there's

https://patchwork.kernel.org/project/LKML/list/

which people do use.

> Kernel is now ready and moved to USB stick. Testing is about to begin. If
> everything works as expected I shall send V2 late at night! Thanks!!

Good.

But you don't need to send v2 - you just need to say whether my version
fixes it for you or not. If not, then I need to stare at it more. :)

In any case, I tested it on my ON-B0 box and it looked good.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Sun, Oct 23, 2016 at 08:06:44PM +0300, sonofa...@openmailbox.org wrote:
> I use the patchwork site and my brother uses an LKML mirror site. He gets
> patches from there. This worked the with the first two patches but the last
> one was a big one and that site truncated some bytes from a line...Sorry for
> the trouble.

You can simply save the email text if your mail client doesn't mangle
white space. Alternatively, there's

https://patchwork.kernel.org/project/LKML/list/

which people do use.

> Kernel is now ready and moved to USB stick. Testing is about to begin. If
> everything works as expected I shall send V2 late at night! Thanks!!

Good.

But you don't need to send v2 - you just need to say whether my version
fixes it for you or not. If not, then I need to stare at it more. :)

In any case, I tested it on my ON-B0 box and it looked good.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun



Are you sure you did it right?

Yes and no.

I use the patchwork site and my brother uses an LKML mirror site. He 
gets patches from there. This worked the with the first two patches but 
the last one was a big one and that site truncated some bytes from a 
line...Sorry for the trouble.


For reasons I cannot explain I haven't used git till now even though I 
have downloaded it with its source files from the very first versions 
that got released years ago.


Kernel is now ready and moved to USB stick. Testing is about to begin. 
If everything works as expected I shall send V2 late at night! Thanks!!


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun



Are you sure you did it right?

Yes and no.

I use the patchwork site and my brother uses an LKML mirror site. He 
gets patches from there. This worked the with the first two patches but 
the last one was a big one and that site truncated some bytes from a 
line...Sorry for the trouble.


For reasons I cannot explain I haven't used git till now even though I 
have downloaded it with its source files from the very first versions 
that got released years ago.


Kernel is now ready and moved to USB stick. Testing is about to begin. 
If everything works as expected I shall send V2 late at night! Thanks!!


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Sun, Oct 23, 2016 at 12:39:37PM +0300, sonofa...@openmailbox.org wrote:
> Last night attempt failed as patch does not apply to 4.8. Neither 4.8.1 nor
> 4.8.4. Did you switch to 4.9? Please use 4.8 as we prefer to avoid rc
> kernels as we had casualties in the past. Do you want to add changes by
> hand?

Are you sure you did it right?

I saved the mail I sent you before in /tmp/e688.mail.

$ git checkout v4.8.1
Previous HEAD position was e58b634ca001... Merge branch 'tip-microcode-rc1+' 
into rc1+1
HEAD is now at a7fac751ddba... Linux 4.8.1
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c
$ git checkout v4.8.3
Previous HEAD position was a7fac751ddba... Linux 4.8.1
HEAD is now at 1888926ea8d2... Linux 4.8.3
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c
$ git checkout v4.8
Previous HEAD position was 1888926ea8d2... Linux 4.8.3
HEAD is now at c8d2bc9bc39e... Linux 4.8
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c

Now you only have to remove "--dry-run"

Paste here the error messages when trying to apply it.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread Borislav Petkov
On Sun, Oct 23, 2016 at 12:39:37PM +0300, sonofa...@openmailbox.org wrote:
> Last night attempt failed as patch does not apply to 4.8. Neither 4.8.1 nor
> 4.8.4. Did you switch to 4.9? Please use 4.8 as we prefer to avoid rc
> kernels as we had casualties in the past. Do you want to add changes by
> hand?

Are you sure you did it right?

I saved the mail I sent you before in /tmp/e688.mail.

$ git checkout v4.8.1
Previous HEAD position was e58b634ca001... Merge branch 'tip-microcode-rc1+' 
into rc1+1
HEAD is now at a7fac751ddba... Linux 4.8.1
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c
$ git checkout v4.8.3
Previous HEAD position was a7fac751ddba... Linux 4.8.1
HEAD is now at 1888926ea8d2... Linux 4.8.3
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c
$ git checkout v4.8
Previous HEAD position was 1888926ea8d2... Linux 4.8.3
HEAD is now at c8d2bc9bc39e... Linux 4.8
$ patch -p1 --dry-run -i /tmp/e688.mail
checking file arch/x86/kernel/amd_nb.c

Now you only have to remove "--dry-run"

Paste here the error messages when trying to apply it.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun
Last night attempt failed as patch does not apply to 4.8. Neither 4.8.1 
nor 4.8.4. Did you switch to 4.9? Please use 4.8 as we prefer to avoid 
rc kernels as we had casualties in the past. Do you want to add changes 
by hand?




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-23 Thread sonofagun
Last night attempt failed as patch does not apply to 4.8. Neither 4.8.1 
nor 4.8.4. Did you switch to 4.9? Please use 4.8 as we prefer to avoid 
rc kernels as we had casualties in the past. Do you want to add changes 
by hand?




Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-22 Thread Borislav Petkov
On Sat, Oct 22, 2016 at 02:16:41PM +0300, sonofa...@openmailbox.org wrote:
> Patch does not compile.

Yeah, it needs more work.

Try the version below. It needs to be done differently because we need
PCI extended config space access to be enabled in order to check bit 2.

> To be honest I can't say. My brother's machine(s) has random crashes from
> time to time. We suspect that this erratum is to blame. He must have kept
> some information at least from one of those crashes but there was no time to
> analyze them till now. Finding those logs on our disks needs a big effort
> but it will be done! We are willing to discover the trouble maker no matter
> what it takes. To do that the machine must be stripped off from all cards
> and then put on quarantine. Then we can connect it with a another machine
> with an RS-232 cable to see what is wrong. After that we must test it on a
> different motherboard we have. I think we have one with a BIOS from a
> different BIOS vendor. We will surely inform you on this one as we can't do
> such a patch. So we will focus on triggering this bug. One thing is sure,
> its BIOS has no workaround for that erratum.

Ok, good. Let me know how it goes.

Thanks.

---
>From ddce976ba7fc44922a6c4e9e58bbdf65c65c4ae4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov 
Date: Sat, 22 Oct 2016 15:23:54 +0200
Subject: [PATCH] E688, v1

Signed-off-by: Borislav Petkov 
---
diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
index 4fdf6230d93c..bfde06b1a587 100644
--- a/arch/x86/kernel/amd_nb.c
+++ b/arch/x86/kernel/amd_nb.c
@@ -15,6 +15,8 @@
 
 static u32 *flush_words;
 
+#define PCI_DEVICE_ID_AMD_CNB17H_F4 0x1704
+
 const struct pci_device_id amd_nb_misc_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB_MISC) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
@@ -24,6 +26,7 @@ const struct pci_device_id amd_nb_misc_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_M60H_NB_F3) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F3) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_M30H_NB_F3) },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F3) },
{}
 };
 EXPORT_SYMBOL(amd_nb_misc_ids);
@@ -34,6 +37,7 @@ static const struct pci_device_id amd_nb_link_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_M60H_NB_F4) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F4) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_M30H_NB_F4) },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) },
{}
 };
 
@@ -274,11 +278,46 @@ void amd_flush_garts(void)
 }
 EXPORT_SYMBOL_GPL(amd_flush_garts);
 
+static void __fix_erratum_688(void *info)
+{
+#define MSR_AMD64_IC_CFG 0xC0011021
+
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
+
+/* Apply erratum 688 fix so machines without a BIOS fix work. */
+static __init void fix_erratum_688(void)
+{
+   struct pci_dev *F4;
+   u32 val;
+
+   if (boot_cpu_data.x86 != 0x14)
+   return;
+
+   if (!amd_northbridges.num)
+   return;
+
+   F4 = node_to_amd_nb(0)->link;
+   if (!F4)
+   return;
+
+   if (pci_read_config_dword(F4, 0x164, ))
+   return;
+
+   if (val & BIT(2))
+   return;
+
+   on_each_cpu(__fix_erratum_688, NULL, 0);
+}
+
 static __init int init_amd_nbs(void)
 {
amd_cache_northbridges();
amd_cache_gart();
 
+   fix_erratum_688();
+
return 0;
 }
 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-22 Thread Borislav Petkov
On Sat, Oct 22, 2016 at 02:16:41PM +0300, sonofa...@openmailbox.org wrote:
> Patch does not compile.

Yeah, it needs more work.

Try the version below. It needs to be done differently because we need
PCI extended config space access to be enabled in order to check bit 2.

> To be honest I can't say. My brother's machine(s) has random crashes from
> time to time. We suspect that this erratum is to blame. He must have kept
> some information at least from one of those crashes but there was no time to
> analyze them till now. Finding those logs on our disks needs a big effort
> but it will be done! We are willing to discover the trouble maker no matter
> what it takes. To do that the machine must be stripped off from all cards
> and then put on quarantine. Then we can connect it with a another machine
> with an RS-232 cable to see what is wrong. After that we must test it on a
> different motherboard we have. I think we have one with a BIOS from a
> different BIOS vendor. We will surely inform you on this one as we can't do
> such a patch. So we will focus on triggering this bug. One thing is sure,
> its BIOS has no workaround for that erratum.

Ok, good. Let me know how it goes.

Thanks.

---
>From ddce976ba7fc44922a6c4e9e58bbdf65c65c4ae4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov 
Date: Sat, 22 Oct 2016 15:23:54 +0200
Subject: [PATCH] E688, v1

Signed-off-by: Borislav Petkov 
---
diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
index 4fdf6230d93c..bfde06b1a587 100644
--- a/arch/x86/kernel/amd_nb.c
+++ b/arch/x86/kernel/amd_nb.c
@@ -15,6 +15,8 @@
 
 static u32 *flush_words;
 
+#define PCI_DEVICE_ID_AMD_CNB17H_F4 0x1704
+
 const struct pci_device_id amd_nb_misc_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB_MISC) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
@@ -24,6 +26,7 @@ const struct pci_device_id amd_nb_misc_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_M60H_NB_F3) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F3) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_M30H_NB_F3) },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F3) },
{}
 };
 EXPORT_SYMBOL(amd_nb_misc_ids);
@@ -34,6 +37,7 @@ static const struct pci_device_id amd_nb_link_ids[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_15H_M60H_NB_F4) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F4) },
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_M30H_NB_F4) },
+   { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) },
{}
 };
 
@@ -274,11 +278,46 @@ void amd_flush_garts(void)
 }
 EXPORT_SYMBOL_GPL(amd_flush_garts);
 
+static void __fix_erratum_688(void *info)
+{
+#define MSR_AMD64_IC_CFG 0xC0011021
+
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
+
+/* Apply erratum 688 fix so machines without a BIOS fix work. */
+static __init void fix_erratum_688(void)
+{
+   struct pci_dev *F4;
+   u32 val;
+
+   if (boot_cpu_data.x86 != 0x14)
+   return;
+
+   if (!amd_northbridges.num)
+   return;
+
+   F4 = node_to_amd_nb(0)->link;
+   if (!F4)
+   return;
+
+   if (pci_read_config_dword(F4, 0x164, ))
+   return;
+
+   if (val & BIT(2))
+   return;
+
+   on_each_cpu(__fix_erratum_688, NULL, 0);
+}
+
 static __init int init_amd_nbs(void)
 {
amd_cache_northbridges();
amd_cache_gart();
 
+   fix_erratum_688();
+
return 0;
 }
 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-22 Thread sonofagun


Patch does not compile.

I tried to add pci.h but did nothing.
I converted pci_read_config to pci_read_config_dword but again nothing.


On 2016-10-22 02:01, Borislav Petkov wrote:

Do you have a way to trigger that one?
To be honest I can't say. My brother's machine(s) has random crashes 
from time to time. We suspect that this erratum is to blame. He must 
have kept some information at least from one of those crashes but there 
was no time to analyze them till now. Finding those logs on our disks 
needs a big effort but it will be done! We are willing to discover the 
trouble maker no matter what it takes. To do that the machine must be 
stripped off from all cards and then put on quarantine. Then we can 
connect it with a another machine with an RS-232 cable to see what is 
wrong. After that we must test it on a different motherboard we have. I 
think we have one with a BIOS from a different BIOS vendor. We will 
surely inform you on this one as we can't do such a patch. So we will 
focus on triggering this bug. One thing is sure, its BIOS has no 
workaround for that erratum.





Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-22 Thread sonofagun


Patch does not compile.

I tried to add pci.h but did nothing.
I converted pci_read_config to pci_read_config_dword but again nothing.


On 2016-10-22 02:01, Borislav Petkov wrote:

Do you have a way to trigger that one?
To be honest I can't say. My brother's machine(s) has random crashes 
from time to time. We suspect that this erratum is to blame. He must 
have kept some information at least from one of those crashes but there 
was no time to analyze them till now. Finding those logs on our disks 
needs a big effort but it will be done! We are willing to discover the 
trouble maker no matter what it takes. To do that the machine must be 
stripped off from all cards and then put on quarantine. Then we can 
connect it with a another machine with an RS-232 cable to see what is 
wrong. After that we must test it on a different motherboard we have. I 
think we have one with a BIOS from a different BIOS vendor. We will 
surely inform you on this one as we can't do such a patch. So we will 
focus on triggering this bug. One thing is sure, its BIOS has no 
workaround for that erratum.





Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread Borislav Petkov
On Sat, Oct 22, 2016 at 12:51:32AM +0300, sonofa...@openmailbox.org wrote:
> Thank you for your time! I have chosen reply to list and all recipients, it
> must work now.

Yes, exactly what I had in mind.

> My brother rejected the proposed patch because it does not provide
> equivalent functionality with the original.
> 
> Our initial patch would fix 3 broken models and 1 working model. Your patch
> will only work for 1 model. Only machines having our APU will be fixed. All
> B0 APUs will be unpatched. This is not right. Check the revision guide to
> verify that.

Right you are: I read too much into the description of bit 2 of
D18F4x164. Of course we want to apply that fix to to ON-Bs too.

> To avoid unneeded complexity we propose this patch as V2, do you agree?
> 
> +#define MSR_AMD64_IC_CFG 0xC0011021
> +
> +static void init_amd_on(struct cpuinfo_x86 *c)
> +{
> + /*
> +  * Apply erratum 688 fix so machines without a BIOS
> +  * fix work.
> +  */
> +
> + u32 val = pci_read_config(0, 0x18, 0x4, 0x164);
> +
> + if (!(val & BIT(2))) {
> + msr_set_bit(MSR_AMD64_IC_CFG, 3);
> + msr_set_bit(MSR_AMD64_IC_CFG, 14);

Yes, that should work fine.

Btw, there's missing a closing } for the if-test here.

> +}
>  static void init_amd_bd(struct cpuinfo_x86 *c)
>  {
>   u64 value;
> @@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
>   case 0xf:  init_amd_k8(c); break;
>   case 0x10: init_amd_gh(c); break;
>   case 0x12: init_amd_ln(c); break;
> + case 0x14: init_amd_on(c); break;
>   case 0x15: init_amd_bd(c); break;
>   }
> 
> Please advice to proceed!

Right, please send a tested version of the above with the explanation
text from your initial submission.

Thanks.

> erratum 721 :-(

Hmm, interesting.

Do you have a way to trigger that one?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread Borislav Petkov
On Sat, Oct 22, 2016 at 12:51:32AM +0300, sonofa...@openmailbox.org wrote:
> Thank you for your time! I have chosen reply to list and all recipients, it
> must work now.

Yes, exactly what I had in mind.

> My brother rejected the proposed patch because it does not provide
> equivalent functionality with the original.
> 
> Our initial patch would fix 3 broken models and 1 working model. Your patch
> will only work for 1 model. Only machines having our APU will be fixed. All
> B0 APUs will be unpatched. This is not right. Check the revision guide to
> verify that.

Right you are: I read too much into the description of bit 2 of
D18F4x164. Of course we want to apply that fix to to ON-Bs too.

> To avoid unneeded complexity we propose this patch as V2, do you agree?
> 
> +#define MSR_AMD64_IC_CFG 0xC0011021
> +
> +static void init_amd_on(struct cpuinfo_x86 *c)
> +{
> + /*
> +  * Apply erratum 688 fix so machines without a BIOS
> +  * fix work.
> +  */
> +
> + u32 val = pci_read_config(0, 0x18, 0x4, 0x164);
> +
> + if (!(val & BIT(2))) {
> + msr_set_bit(MSR_AMD64_IC_CFG, 3);
> + msr_set_bit(MSR_AMD64_IC_CFG, 14);

Yes, that should work fine.

Btw, there's missing a closing } for the if-test here.

> +}
>  static void init_amd_bd(struct cpuinfo_x86 *c)
>  {
>   u64 value;
> @@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
>   case 0xf:  init_amd_k8(c); break;
>   case 0x10: init_amd_gh(c); break;
>   case 0x12: init_amd_ln(c); break;
> + case 0x14: init_amd_on(c); break;
>   case 0x15: init_amd_bd(c); break;
>   }
> 
> Please advice to proceed!

Right, please send a tested version of the above with the explanation
text from your initial submission.

Thanks.

> erratum 721 :-(

Hmm, interesting.

Do you have a way to trigger that one?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread sonofagun


Thank you for your time! I have chosen reply to list and all recipients, 
it must work now.


My brother rejected the proposed patch because it does not provide 
equivalent functionality with the original.


Our initial patch would fix 3 broken models and 1 working model. Your 
patch will only work for 1 model. Only machines having our APU will be 
fixed. All B0 APUs will be unpatched. This is not right. Check the 
revision guide to verify that.


To avoid unneeded complexity we propose this patch as V2, do you agree?

+#define MSR_AMD64_IC_CFG   0xC0011021
+
+static void init_amd_on(struct cpuinfo_x86 *c)
+{
+   /*
+* Apply erratum 688 fix so machines without a BIOS
+* fix work.
+*/
+
+   u32 val = pci_read_config(0, 0x18, 0x4, 0x164);
+
+   if (!(val & BIT(2))) {
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
 static void init_amd_bd(struct cpuinfo_x86 *c)
 {
u64 value;
@@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
case 0xf:  init_amd_k8(c); break;
case 0x10: init_amd_gh(c); break;
case 0x12: init_amd_ln(c); break;
+   case 0x14: init_amd_on(c); break;
case 0x15: init_amd_bd(c); break;
}

Please advice to proceed!



Why, what's wrong with that one? That one should be all fixed! :-)

I have such box too and it runs fine.

erratum 721 :-(



Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread sonofagun


Thank you for your time! I have chosen reply to list and all recipients, 
it must work now.


My brother rejected the proposed patch because it does not provide 
equivalent functionality with the original.


Our initial patch would fix 3 broken models and 1 working model. Your 
patch will only work for 1 model. Only machines having our APU will be 
fixed. All B0 APUs will be unpatched. This is not right. Check the 
revision guide to verify that.


To avoid unneeded complexity we propose this patch as V2, do you agree?

+#define MSR_AMD64_IC_CFG   0xC0011021
+
+static void init_amd_on(struct cpuinfo_x86 *c)
+{
+   /*
+* Apply erratum 688 fix so machines without a BIOS
+* fix work.
+*/
+
+   u32 val = pci_read_config(0, 0x18, 0x4, 0x164);
+
+   if (!(val & BIT(2))) {
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
 static void init_amd_bd(struct cpuinfo_x86 *c)
 {
u64 value;
@@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
case 0xf:  init_amd_k8(c); break;
case 0x10: init_amd_gh(c); break;
case 0x12: init_amd_ln(c); break;
+   case 0x14: init_amd_on(c); break;
case 0x15: init_amd_bd(c); break;
}

Please advice to proceed!



Why, what's wrong with that one? That one should be all fixed! :-)

I have such box too and it runs fine.

erratum 721 :-(



Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread Borislav Petkov
Hi Ioannis,

first of all, when you reply to a mail on lkml, please use the "reply-to-all"
functionality of your mail client - otherwise replies might get missed on such a
high volume mailing list.

On Fri, Oct 21, 2016 at 07:19:07PM +0300, sonofa...@openmailbox.org wrote:
> Sorry for the late reply! This machine has caused nothing but trouble. HP
> will not fix it and we will not choose their laptops anymore...

You're not the only one making this experience.

> My brother told me that we apply a quirk to the last Ontario APUs that do
> not need it but I did not think it would be an issue since they have fixed
> the error.

No, you need to apply the fix only on the models which need it.

> It seems better this way so that only affected APUs are patched. Be patient,
> we are compiling the new patch right now but compiling is run on a different
> high end AMD machine of my brother. Tomorrow I will have access to the
> laptop and I will update the kernel and send you the V2 patch. Compiling to
> that laptop would possibly need a whole day even with AC power!

You can build somewhere else and copy the kernel to the laptop. That's
how I do it.

> Do you want /proc/cpuinfo on the V2 patch e-mail? Both CPUs needed?

No, I just wanted to see them and you've pasted them here. Thanks.

> Here is a dump from an older installation some months ago I kept on my
> disk(tomorrow I will dump it again if you want):

No need, one is enough :)

> > Then, keep that *whole* changelog above when sending v2 of the patch
> What do you mean? It is not clear to me, Do you mean all the info we wrote
> on the e-mail, your comments or both?

All the info you wrote in the first mail.

> We have many AMD machines and we will need your help next week to patch our
> Phenom(tm) II X6.

Why, what's wrong with that one? That one should be all fixed! :-)

I have such box too and it runs fine.

> Let's finish this patch first and we will fix that too but it appears
> to be much more difficult...

Don't hesitate to ask if you need help...

HTH.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread Borislav Petkov
Hi Ioannis,

first of all, when you reply to a mail on lkml, please use the "reply-to-all"
functionality of your mail client - otherwise replies might get missed on such a
high volume mailing list.

On Fri, Oct 21, 2016 at 07:19:07PM +0300, sonofa...@openmailbox.org wrote:
> Sorry for the late reply! This machine has caused nothing but trouble. HP
> will not fix it and we will not choose their laptops anymore...

You're not the only one making this experience.

> My brother told me that we apply a quirk to the last Ontario APUs that do
> not need it but I did not think it would be an issue since they have fixed
> the error.

No, you need to apply the fix only on the models which need it.

> It seems better this way so that only affected APUs are patched. Be patient,
> we are compiling the new patch right now but compiling is run on a different
> high end AMD machine of my brother. Tomorrow I will have access to the
> laptop and I will update the kernel and send you the V2 patch. Compiling to
> that laptop would possibly need a whole day even with AC power!

You can build somewhere else and copy the kernel to the laptop. That's
how I do it.

> Do you want /proc/cpuinfo on the V2 patch e-mail? Both CPUs needed?

No, I just wanted to see them and you've pasted them here. Thanks.

> Here is a dump from an older installation some months ago I kept on my
> disk(tomorrow I will dump it again if you want):

No need, one is enough :)

> > Then, keep that *whole* changelog above when sending v2 of the patch
> What do you mean? It is not clear to me, Do you mean all the info we wrote
> on the e-mail, your comments or both?

All the info you wrote in the first mail.

> We have many AMD machines and we will need your help next week to patch our
> Phenom(tm) II X6.

Why, what's wrong with that one? That one should be all fixed! :-)

I have such box too and it runs fine.

> Let's finish this patch first and we will fix that too but it appears
> to be much more difficult...

Don't hesitate to ask if you need help...

HTH.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread sonofagun


Sorry for the late reply! This machine has caused nothing but trouble. 
HP will not fix it and we will not choose their laptops anymore...


My brother told me that we apply a quirk to the last Ontario APUs that 
do not need it but I did not think it would be an issue since they have 
fixed the error.
It seems better this way so that only affected APUs are patched. Be 
patient, we are compiling the new patch right now but compiling is run 
on a different high end AMD machine of my brother. Tomorrow I will have 
access to the laptop and I will update the kernel and send you the V2 
patch. Compiling to that laptop would possibly need a whole day even 
with AC power!



Do you want /proc/cpuinfo on the V2 patch e-mail? Both CPUs needed?
Here is a dump from an older installation some months ago I kept on my 
disk(tomorrow I will dump it again if you want):

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 20
model   : 2
model name  : AMD E-300 APU with Radeon(tm) HD Graphics
stepping: 0
microcode   : 0x5000119
cpu MHz : 1300.000
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid 
aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate 
vmmcall arat npt lbrv svm_lock nrip_save pausefilter

bugs: fxsave_leak sysret_ss_attrs
bogomips: 2594.69
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 20
model   : 2
model name  : AMD E-300 APU with Radeon(tm) HD Graphics
stepping: 0
microcode   : 0x5000119
cpu MHz : 1300.000
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
apicid  : 1
initial apicid  : 1
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid 
aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate 
vmmcall arat npt lbrv svm_lock nrip_save pausefilter

bugs: fxsave_leak sysret_ss_attrs
bogomips: 2594.69
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



Then, keep that *whole* changelog above when sending v2 of the patch
What do you mean? It is not clear to me, Do you mean all the info we 
wrote on the e-mail, your comments or both?


We have many AMD machines and we will need your help next week to patch 
our Phenom(tm) II X6. Let's finish this
patch first and we will fix that too but it appears to be much more 
difficult...


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-21 Thread sonofagun


Sorry for the late reply! This machine has caused nothing but trouble. 
HP will not fix it and we will not choose their laptops anymore...


My brother told me that we apply a quirk to the last Ontario APUs that 
do not need it but I did not think it would be an issue since they have 
fixed the error.
It seems better this way so that only affected APUs are patched. Be 
patient, we are compiling the new patch right now but compiling is run 
on a different high end AMD machine of my brother. Tomorrow I will have 
access to the laptop and I will update the kernel and send you the V2 
patch. Compiling to that laptop would possibly need a whole day even 
with AC power!



Do you want /proc/cpuinfo on the V2 patch e-mail? Both CPUs needed?
Here is a dump from an older installation some months ago I kept on my 
disk(tomorrow I will dump it again if you want):

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 20
model   : 2
model name  : AMD E-300 APU with Radeon(tm) HD Graphics
stepping: 0
microcode   : 0x5000119
cpu MHz : 1300.000
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid 
aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate 
vmmcall arat npt lbrv svm_lock nrip_save pausefilter

bugs: fxsave_leak sysret_ss_attrs
bogomips: 2594.69
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 20
model   : 2
model name  : AMD E-300 APU with Radeon(tm) HD Graphics
stepping: 0
microcode   : 0x5000119
cpu MHz : 1300.000
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
apicid  : 1
initial apicid  : 1
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid 
aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic 
cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate 
vmmcall arat npt lbrv svm_lock nrip_save pausefilter

bugs: fxsave_leak sysret_ss_attrs
bogomips: 2594.69
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



Then, keep that *whole* changelog above when sending v2 of the patch
What do you mean? It is not clear to me, Do you mean all the info we 
wrote on the e-mail, your comments or both?


We have many AMD machines and we will need your help next week to patch 
our Phenom(tm) II X6. Let's finish this
patch first and we will fix that too but it appears to be much more 
difficult...


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-19 Thread Borislav Petkov
On Wed, Oct 19, 2016 at 04:58:08PM +0300, sonofa...@openmailbox.org wrote:
> 
> AMD F14h machines have an erratum which can cause unpredictable program
> behaviour under specific branch conditions. The workaround is to set
> MSRC001_1021[14] and MSRC001_1021[3]. Both bits are reserved for this MSR,
> so we trust AMD suggestions. Since there is no BIOS update containing that
> workaround for some machines, we do it ourselves unconditionally on this
> family too. Our Compaq CQ57 laptop which has broken firmware in various
> areas does not contain both workarounds(MSRc0011021: 10208000)...

...

> +#define MSR_AMD64_IC_CFG 0xC0011021
> +
> +static void init_amd_on(struct cpuinfo_x86 *c)
> +{
> + /*
> +  * Apply erratum 688 fix unconditionally so machines without a BIOS
> +  * fix work.
> +  */
> + msr_set_bit(MSR_AMD64_IC_CFG, 3);
> + msr_set_bit(MSR_AMD64_IC_CFG, 14);
> +}

You can't force this unconditionally. Look at the suggested workaround:

"BIOS should set MSRC001_1021[14] = 1b and MSRC001_1021[3] = 1b. This
workaround is required only when bit 2 of Fixed Errata Status Register
(D18F4x164[2]) = 0b."

So you need to do something like this:

if (c->x86_model == 2 && c->x86_mask == 0) {
u32 val = pci_read_config(0, 0x18, 0x4, 0x164);

if (!(val & BIT(2))) {
msr_set_bit(MSR_AMD64_IC_CFG, 3);
msr_set_bit(MSR_AMD64_IC_CFG, 14);
}
}

Also, please paste /proc/cpuinfo from that machine.

Then, keep that *whole* changelog above when sending v2 of the patch - I like
the level of detail of your explanation! ;-)

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-19 Thread Borislav Petkov
On Wed, Oct 19, 2016 at 04:58:08PM +0300, sonofa...@openmailbox.org wrote:
> 
> AMD F14h machines have an erratum which can cause unpredictable program
> behaviour under specific branch conditions. The workaround is to set
> MSRC001_1021[14] and MSRC001_1021[3]. Both bits are reserved for this MSR,
> so we trust AMD suggestions. Since there is no BIOS update containing that
> workaround for some machines, we do it ourselves unconditionally on this
> family too. Our Compaq CQ57 laptop which has broken firmware in various
> areas does not contain both workarounds(MSRc0011021: 10208000)...

...

> +#define MSR_AMD64_IC_CFG 0xC0011021
> +
> +static void init_amd_on(struct cpuinfo_x86 *c)
> +{
> + /*
> +  * Apply erratum 688 fix unconditionally so machines without a BIOS
> +  * fix work.
> +  */
> + msr_set_bit(MSR_AMD64_IC_CFG, 3);
> + msr_set_bit(MSR_AMD64_IC_CFG, 14);
> +}

You can't force this unconditionally. Look at the suggested workaround:

"BIOS should set MSRC001_1021[14] = 1b and MSRC001_1021[3] = 1b. This
workaround is required only when bit 2 of Fixed Errata Status Register
(D18F4x164[2]) = 0b."

So you need to do something like this:

if (c->x86_model == 2 && c->x86_mask == 0) {
u32 val = pci_read_config(0, 0x18, 0x4, 0x164);

if (!(val & BIT(2))) {
msr_set_bit(MSR_AMD64_IC_CFG, 3);
msr_set_bit(MSR_AMD64_IC_CFG, 14);
}
}

Also, please paste /proc/cpuinfo from that machine.

Then, keep that *whole* changelog above when sending v2 of the patch - I like
the level of detail of your explanation! ;-)

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


[PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-19 Thread sonofagun


AMD F14h machines have an erratum which can cause unpredictable program 
behaviour under specific branch conditions. The workaround is to set 
MSRC001_1021[14] and MSRC001_1021[3]. Both bits are reserved for this 
MSR, so we trust AMD suggestions. Since there is no BIOS update 
containing that workaround for some machines, we do it ourselves 
unconditionally on this family too. Our Compaq CQ57 laptop which has 
broken firmware in various areas does not contain both 
workarounds(MSRc0011021: 10208000)...


HP does not release a proper BIOS even though we have contacted them and 
requested an updated BIOS that will fix all errors we spotted. As it is 
not currently covered by any warranty, they do not support it. HP does 
not care, but Linux kernel cares to patch out-of-warranty hardware with 
crappy firmware!


Thanks to the author of commit d1992996753132e2dafe955cccb2fb0714d3cfc4 
(x86/AMD: Apply erratum 665 on machines without a BIOS fix) as he paved 
the way to this fix. That patch was not applicable on our machine but it 
brought back to surface a long standing bug of our E-300 laptop. Poor 
performance under Debian was observed and things got worse after 
switching to Ubuntu as crashes became more frequent! As a result the 
laptop got replaced with a desktop.


After some time, we decided to dig deeper and see what is wrong with our 
laptop. Actually perf proved that something was terrible wrong as 
branch-misses reached 40% within a minute after booting the E-300 
ontario C0 APU! Disabling the second CPU did not help either. CPU 
Revision Guide erratum 688 seemed promising as it described our issues 
and we prepared a fix. Now the laptop works and has both 
workarounds(MSRc0011021: 1020c008)! Since this erratum affects 
many laptops and some tablets, we request to backport it to stable 
kernels.


Tested on Compaq CQ57-499 laptop.


Signed-off-by: Ioannis Barkas 
Signed-off-by: Nikos Barkas 
Cc: Borislav Petkov 
Cc: 

---

Hello we are Ioannis Barkas (sonofa...@openmailbox.org) and Nikos Barkas 
(level...@gmail.com).


This patch was sent from my yahoo e-mail in the morning and got 
rejected! Why?

Resending...

We have had poor performance on our AMD laptop with Debian for some 
years. Initial value of MSRc0011021 is 10208000h and D18F4x164 
is 0003h. Our laptop was not usable even with Ubuntu 16.04 using the 
radeon driver. What is worse, opening firefox with 
https://planefinder.net/ after booting Ubuntu, resulted in firefox 
crashes again and again. After this patch we have not met any problem 
with that webpage and firefox. Unfortunately linux-tools were not 
present for our custom kernel and perf could not be launched:( When the 
patch arrives on Ubuntu 16.10 kernel, we shall recheck it. If 
branch-misses remain above 10%, we will open a bug for it.


--- a/arch/x86/kernel/cpu/amd.c 2016-10-07 16:03:33.0 +0300
+++ b/arch/x86/kernel/cpu/amd.c 2016-10-12 13:25:34.791720549 +0300
@@ -680,6 +680,18 @@ static void init_amd_ln(struct cpuinfo_x
msr_set_bit(MSR_AMD64_DE_CFG, 31);
 }

+#define MSR_AMD64_IC_CFG   0xC0011021
+
+static void init_amd_on(struct cpuinfo_x86 *c)
+{
+   /*
+* Apply erratum 688 fix unconditionally so machines without a BIOS
+* fix work.
+*/
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
+
 static void init_amd_bd(struct cpuinfo_x86 *c)
 {
u64 value;
@@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
case 0xf:  init_amd_k8(c); break;
case 0x10: init_amd_gh(c); break;
case 0x12: init_amd_ln(c); break;
+   case 0x14: init_amd_on(c); break;
case 0x15: init_amd_bd(c); break;
}


[PATCH] x86/AMD: Apply erratum 688 on machines without a BIOS fix

2016-10-19 Thread sonofagun


AMD F14h machines have an erratum which can cause unpredictable program 
behaviour under specific branch conditions. The workaround is to set 
MSRC001_1021[14] and MSRC001_1021[3]. Both bits are reserved for this 
MSR, so we trust AMD suggestions. Since there is no BIOS update 
containing that workaround for some machines, we do it ourselves 
unconditionally on this family too. Our Compaq CQ57 laptop which has 
broken firmware in various areas does not contain both 
workarounds(MSRc0011021: 10208000)...


HP does not release a proper BIOS even though we have contacted them and 
requested an updated BIOS that will fix all errors we spotted. As it is 
not currently covered by any warranty, they do not support it. HP does 
not care, but Linux kernel cares to patch out-of-warranty hardware with 
crappy firmware!


Thanks to the author of commit d1992996753132e2dafe955cccb2fb0714d3cfc4 
(x86/AMD: Apply erratum 665 on machines without a BIOS fix) as he paved 
the way to this fix. That patch was not applicable on our machine but it 
brought back to surface a long standing bug of our E-300 laptop. Poor 
performance under Debian was observed and things got worse after 
switching to Ubuntu as crashes became more frequent! As a result the 
laptop got replaced with a desktop.


After some time, we decided to dig deeper and see what is wrong with our 
laptop. Actually perf proved that something was terrible wrong as 
branch-misses reached 40% within a minute after booting the E-300 
ontario C0 APU! Disabling the second CPU did not help either. CPU 
Revision Guide erratum 688 seemed promising as it described our issues 
and we prepared a fix. Now the laptop works and has both 
workarounds(MSRc0011021: 1020c008)! Since this erratum affects 
many laptops and some tablets, we request to backport it to stable 
kernels.


Tested on Compaq CQ57-499 laptop.


Signed-off-by: Ioannis Barkas 
Signed-off-by: Nikos Barkas 
Cc: Borislav Petkov 
Cc: 

---

Hello we are Ioannis Barkas (sonofa...@openmailbox.org) and Nikos Barkas 
(level...@gmail.com).


This patch was sent from my yahoo e-mail in the morning and got 
rejected! Why?

Resending...

We have had poor performance on our AMD laptop with Debian for some 
years. Initial value of MSRc0011021 is 10208000h and D18F4x164 
is 0003h. Our laptop was not usable even with Ubuntu 16.04 using the 
radeon driver. What is worse, opening firefox with 
https://planefinder.net/ after booting Ubuntu, resulted in firefox 
crashes again and again. After this patch we have not met any problem 
with that webpage and firefox. Unfortunately linux-tools were not 
present for our custom kernel and perf could not be launched:( When the 
patch arrives on Ubuntu 16.10 kernel, we shall recheck it. If 
branch-misses remain above 10%, we will open a bug for it.


--- a/arch/x86/kernel/cpu/amd.c 2016-10-07 16:03:33.0 +0300
+++ b/arch/x86/kernel/cpu/amd.c 2016-10-12 13:25:34.791720549 +0300
@@ -680,6 +680,18 @@ static void init_amd_ln(struct cpuinfo_x
msr_set_bit(MSR_AMD64_DE_CFG, 31);
 }

+#define MSR_AMD64_IC_CFG   0xC0011021
+
+static void init_amd_on(struct cpuinfo_x86 *c)
+{
+   /*
+* Apply erratum 688 fix unconditionally so machines without a BIOS
+* fix work.
+*/
+   msr_set_bit(MSR_AMD64_IC_CFG, 3);
+   msr_set_bit(MSR_AMD64_IC_CFG, 14);
+}
+
 static void init_amd_bd(struct cpuinfo_x86 *c)
 {
u64 value;
@@ -738,6 +750,7 @@ static void init_amd(struct cpuinfo_x86
case 0xf:  init_amd_k8(c); break;
case 0x10: init_amd_gh(c); break;
case 0x12: init_amd_ln(c); break;
+   case 0x14: init_amd_on(c); break;
case 0x15: init_amd_bd(c); break;
}