Re: [Beowulf] How to debug slow compute node?

2017-08-16 Thread Bill Broadley via Beowulf
On 08/10/2017 07:39 AM, Faraz Hussain wrote:
> One of our compute nodes runs ~30% slower than others. It has the exact same
> image so I am baffled why it is running slow . I have tested OMP and MPI
> benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks
> normal there.

We got some supermicro dual socket nodes without the little plastic air guides.
They thermally throttled really quickly.

I've also seen nodes that fall back to 1 channel because the dimms were in the
wrong slots.

I suggest comparing the physical nodes, double check fans (which should be
spinning), air conduits, dimm placement, etc.  Then check dmesg, syslog,
temperatures, and compare a fast node to a slow node.


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-14 Thread mathog

(Sorry for the duplicate of this went out with the wrong subject.)

On 12-Aug-2017 Chris Samuel  wrote:

Just to add to the excellent suggestions from others: have you compared 
BIOS/
UEFI settings & versions across these nodes to ensure they're 
identical?


Also verify that the CR2032 on the motherboard is in working order.
Usually this can be done with "sensors" or "ipmitool" which shows it
with a value near 3.3v  If the motherboard battery fails the settings
can change, and not always to the default
BIOS values.  Unfortunately Dell servers rarely expose this
information with either of those tools but other manufacturers often
do.

Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
On 12/08/17 17:35, William Johnson wrote:

> This may be a long shot, especially in a server room where everything
> else is working as expected.

Oh agreed! But given people have covered a lot of other bases I thought
I'd throw something in from my own experience.  If all nodes boot the
same OS image then you'd not expect the kernel command lines etc to
differ, but the UEFI settings might (depending on how they are
configured usually).

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
On 14/08/17 08:17, Lachlan Musicman wrote:

> Can you point to some good documentation on this?

There is some on Mellanox's website:

http://www.mellanox.com/related-docs/prod_software/Mellanox_EN_for_Linux_User_Manual_v2_0-3_0_0.pdf

But it it took weeks for $VENDOR to figure out what was
going on and why performance was so bad. It wasn't until
they got Mellanox into the calls that Mellanox pointed
this out to them.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Lachlan Musicman
On 12 August 2017 at 13:35, Chris Samuel  wrote:

> Also remember that the kernel can enable C states that hurt performance
> even
> if they are disabled in the BIOS/UEFI.   This was painfully apparent on our
> first SandyBridge cluster that almost failed the performance part of
> acceptance
> testing until it got found.
>
> Now we boot all nodes with this in the kernel cmdline:
>
> intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable
>
>

Chris,

Can you point to some good documentation on this?

cheers
L.



--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread Chris Samuel
On Friday, 11 August 2017 12:39:07 AM AEST Faraz Hussain wrote:

> I thought it may have to do with cpu scaling, i.e when the kernel
> changes the cpu speed depending on the workload. But we do not have
> that enabled on these machines.

Just to add to the excellent suggestions from others: have you compared BIOS/
UEFI settings & versions across these nodes to ensure they're identical?

Also remember that the kernel can enable C states that hurt performance even 
if they are disabled in the BIOS/UEFI.   This was painfully apparent on our 
first SandyBridge cluster that almost failed the performance part of acceptance 
testing until it got found.

Now we boot all nodes with this in the kernel cmdline:

intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable

Best of luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread mathog

Rushat Rai wrote



I don't know if this has been mentioned, but ECC could be slowing down
that specific node if it has a faulty stick.


To find the bad stick one often must disable ECC, at least that was the 
case many years ago the last time I ran into that.  If ECC is enabled, 
even if the stick is somewhat defective, it may still pass memtest86+.  
That utility will show if ECC is enabled or not, and the ECC disable, if 
there is one, is set in the motherboard BIOS.


I'm late to this thread, does this node have a local disk?  Failing 
disks can really slow things down if the device has to read the same 
block many times before it succeeds. That usually shows up in smartctl.


What sort of network connect?  Try swapping those cables.  Also run the 
network throughput test of your choice.  If the problem is there those 
tests will reveal it.


"sensors" should show roughly the same values as the other nodes, if 
not, figure out why.  As others have suggested that could be blocked 
ventilation,  but more often in my experience it is a fan on the way 
out.


Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread Rushat Rai
Hi, my first post here.

Anyways, I agree with John, I've seen debris caught up in intakes causing some 
performance drop. 30% does seem a little excessive, but you should check first.

I don't know if this has been mentioned, but ECC could be slowing down that 
specific node if it has a faulty stick.

I would also like to know if it is in the exact same environment as the rest. 
Is it close to an Air conditioner exhaust, or something similar? Have you 
checked the thermals for that specific node compared to others?

Let me know

On Thursday 10 August 2017 08:47 PM, John Hearns via Beowulf wrote:
Another thing to perhaps look at. Are you seeing messages abotu thermal 
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught 
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns 
> wrote:
ps.   Look at   watch  cat /proc/interrupts   also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns 
> wrote:
Faraz,
   I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the DIMMS.  I 
have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors, then if 
the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending time 
servicing these interrupts.

So:   look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors:ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will discover it 
within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.







On 10 August 2017 at 16:39, Faraz Hussain 
> wrote:
One of our compute nodes runs ~30% slower than others. It has the exact same 
image so I am baffled why it is running slow . I have tested OMP and MPI 
benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks 
normal there.

I thought it may have to do with cpu scaling, i.e when the kernel changes the 
cpu speed depending on the workload. But we do not have that enabled on these 
machines.

Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our 
other nodes. Any suggestions on what else to check? I have tried rebooting it.

processor   : 19
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping: 4
cpu MHz : 2500.098
cache size  : 25600 KB
physical id : 1
siblings: 10
core id : 12
cpu cores   : 10
apicid  : 56
initial apicid  : 56
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc 
aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr 
pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c 
rdrand lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept 
vpid fsgsbase smep erms
bogomips: 5004.97
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:



___
Beowulf mailing list, Beowulf@beowulf.org sponsored 
by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf






___
Beowulf mailing list, Beowulf@beowulf.org sponsored 
by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Skylar Thompson
We ran into something similar, though it turned out being a microcode bug
in the CPU that caused it to remain stuck in its lowest power state.
Fortunately it was easily testable with "perf stat" so it was pretty clear
which nodes were impacted, which also happened to be bought as a batch with
a unique CPU version. By the time we did our legwork, the vendor had
independently announced a fix for the problem, so I guess we could have
just saved ourselves some work and waited...

Skylar

On Thu, Aug 10, 2017 at 3:59 PM, Lance Wilson <lance.wil...@monash.edu>
wrote:

> Hi Faraz,
> Another one that we have seen was a difference in power profile of the
> node. It caused the node in certain situations to keep the cpu speed low,
> so top looked fine and everything looked fine, just slow. It was a Dell box
> as well. It was interesting that there were so many power settings that
> caused slow downs with Centos 7.
>
> Cheers,
>
> Lance
> --
> Dr Lance Wilson
> Senior HPC Consultant
> Ph: 03 99055942 (+61 3 99055942 <+61%203%209905%205942>
> Mobile: 0437414123 (+61 4 3741 4123)
> Multi-modal Australian ScienceS Imaging and Visualisation Environment
> (www.massive.org.au)
> Monash University
>
> On 11 August 2017 at 04:33, John Hearns via Beowulf <beowulf@beowulf.org>
> wrote:
>
>> Ten euros for me on a faulty DIMM
>>
>>
>>
>>
>>
>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>
>>
>>
>> *From: *Andrew Holway <andrew.hol...@gmail.com>
>> *Sent: *Thursday, 10 August 2017 20:05
>> *To: *Gus Correa <g...@ldeo.columbia.edu>
>> *Cc: *Beowulf Mailing List <beowulf@beowulf.org>
>> *Subject: *Re: [Beowulf] How to debug slow compute node?
>>
>>
>>
>> I put €10 on the nose for a faulty power supply.
>>
>>
>>
>> On 10 August 2017 at 19:45, Gus Correa <g...@ldeo.columbia.edu> wrote:
>>
>> + Leftover processes from previous jobs hogging resources.
>> That's relatively common.
>> That can trigger swapping, the ultimate performance killer.
>> "top" or "htop" on the node should show something.
>> (Will go away with a reboot, of course.)
>>
>> Less likely, but possible:
>>
>> + Different BIOS configuration w.r.t. the other nodes.
>>
>> + Poorly sat memory, IB card, etc, or cable connections.
>>
>> + IPMI may need a hard reset.
>> Power down, remove the power cable, wait several minutes,
>> put the cable back, power on.
>>
>> Gus Correa
>>
>> On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
>>
>> Another thing to perhaps look at. Are you seeing messages abotu thermal
>> throttling events in the system logs?
>> Could that node have a piece of debris caught in its air intake?
>>
>> I dont think that will produce a 30% drop in perfoemance. But I have
>> caught compute nodes with pieces of packaking sucked onto the front,
>> following careless peeople unpacking kit in machine rooms.
>> (Firm rule - no packaging in the machine room. This means you)
>>
>>
>>
>>
>> On 10 August 2017 at 17:00, John Hearns <hear...@googlemail.com > hear...@googlemail.com>> wrote:
>>
>> ps.   Look at   watch  cat /proc/interrupts   also
>> You might get a qualitative idea of a huge rate of interrupts.
>>
>>
>> On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com
>> <mailto:hear...@googlemail.com>> wrote:
>>
>> Faraz,
>> I think you might have to buy me a virtual coffee. Or a beer!
>> Please look at the hardware health of that machine. Specifically
>> the DIMMS.  I have seen this before!
>> If you have some DIMMS which are faulty and are generating ECC
>> errors, then if the mcelog service is enabled
>> an interrupt is generated for every ECC event. SO the system is
>> spending time servicing these interrupts.
>>
>> So:   look in your /var/log/mcelog for hardware errors
>> Look in your /var/log/messages for hardware errors also
>> Look in the IPMI event logs for ECC errors:ipmitool sel elist
>>
>> I would also bring that node down and boot it with memtester.
>> If there is a DIMM which is that badly faulty then memtester
>> will discover it within minutes.
>>
>> Or it could be something else - in which case I get no coffee.
>>
>> Also Intel cluster checker is intended to exacly deal wit

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Lance Wilson
Hi Faraz,
Another one that we have seen was a difference in power profile of the
node. It caused the node in certain situations to keep the cpu speed low,
so top looked fine and everything looked fine, just slow. It was a Dell box
as well. It was interesting that there were so many power settings that
caused slow downs with Centos 7.

Cheers,

Lance
--
Dr Lance Wilson
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University

On 11 August 2017 at 04:33, John Hearns via Beowulf <beowulf@beowulf.org>
wrote:

> Ten euros for me on a faulty DIMM
>
>
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>
> *From: *Andrew Holway <andrew.hol...@gmail.com>
> *Sent: *Thursday, 10 August 2017 20:05
> *To: *Gus Correa <g...@ldeo.columbia.edu>
> *Cc: *Beowulf Mailing List <beowulf@beowulf.org>
> *Subject: *Re: [Beowulf] How to debug slow compute node?
>
>
>
> I put €10 on the nose for a faulty power supply.
>
>
>
> On 10 August 2017 at 19:45, Gus Correa <g...@ldeo.columbia.edu> wrote:
>
> + Leftover processes from previous jobs hogging resources.
> That's relatively common.
> That can trigger swapping, the ultimate performance killer.
> "top" or "htop" on the node should show something.
> (Will go away with a reboot, of course.)
>
> Less likely, but possible:
>
> + Different BIOS configuration w.r.t. the other nodes.
>
> + Poorly sat memory, IB card, etc, or cable connections.
>
> + IPMI may need a hard reset.
> Power down, remove the power cable, wait several minutes,
> put the cable back, power on.
>
> Gus Correa
>
> On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
>
> Another thing to perhaps look at. Are you seeing messages abotu thermal
> throttling events in the system logs?
> Could that node have a piece of debris caught in its air intake?
>
> I dont think that will produce a 30% drop in perfoemance. But I have
> caught compute nodes with pieces of packaking sucked onto the front,
> following careless peeople unpacking kit in machine rooms.
> (Firm rule - no packaging in the machine room. This means you)
>
>
>
>
> On 10 August 2017 at 17:00, John Hearns <hear...@googlemail.com  hear...@googlemail.com>> wrote:
>
> ps.   Look at   watch  cat /proc/interrupts   also
> You might get a qualitative idea of a huge rate of interrupts.
>
>
> On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com
> <mailto:hear...@googlemail.com>> wrote:
>
> Faraz,
> I think you might have to buy me a virtual coffee. Or a beer!
> Please look at the hardware health of that machine. Specifically
> the DIMMS.  I have seen this before!
> If you have some DIMMS which are faulty and are generating ECC
> errors, then if the mcelog service is enabled
> an interrupt is generated for every ECC event. SO the system is
> spending time servicing these interrupts.
>
> So:   look in your /var/log/mcelog for hardware errors
> Look in your /var/log/messages for hardware errors also
> Look in the IPMI event logs for ECC errors:ipmitool sel elist
>
> I would also bring that node down and boot it with memtester.
> If there is a DIMM which is that badly faulty then memtester
> will discover it within minutes.
>
> Or it could be something else - in which case I get no coffee.
>
> Also Intel cluster checker is intended to exacly deal with these
> situations.
> What is your cluster manager, and is Intel CLuster Checker
> available to you?
> I would seriously look at getting this installed.
>
>
>
>
>
>
>
> On 10 August 2017 at 16:39, Faraz Hussain <i...@feacluster.com
>
> <mailto:i...@feacluster.com>> wrote:
>
> One of our compute nodes runs ~30% slower than others. It
> has the exact same image so I am baffled why it is running
> slow . I have tested OMP and MPI benchmarks. Everything runs
> slower. The cpu usage goes to 2000%, so all looks normal there.
>
> I thought it may have to do with cpu scaling, i.e when the
> kernel changes the cpu speed depending on the workload. But
> we do not have that enabled on these machines.
>
> Here is a snippet from "cat /proc/cpuinfo". Everything is
> identical to our other nodes. Any suggestions on 

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Ten euros for me on a faulty DIMM


Sent from Mail for Windows 10

From: Andrew Holway
Sent: Thursday, 10 August 2017 20:05
To: Gus Correa
Cc: Beowulf Mailing List
Subject: Re: [Beowulf] How to debug slow compute node?

I put €10 on the nose for a faulty power supply.

On 10 August 2017 at 19:45, Gus Correa <g...@ldeo.columbia.edu> wrote:
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)

Less likely, but possible:

+ Different BIOS configuration w.r.t. the other nodes.

+ Poorly sat memory, IB card, etc, or cable connections.

+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.

Gus Correa

On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
Another thing to perhaps look at. Are you seeing messages abotu thermal 
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught 
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns <hear...@googlemail.com 
<mailto:hear...@googlemail.com>> wrote:

    ps.   Look at   watch  cat /proc/interrupts   also
    You might get a qualitative idea of a huge rate of interrupts.


    On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com
    <mailto:hear...@googlemail.com>> wrote:

        Faraz,
            I think you might have to buy me a virtual coffee. Or a beer!
        Please look at the hardware health of that machine. Specifically
        the DIMMS.  I have seen this before!
        If you have some DIMMS which are faulty and are generating ECC
        errors, then if the mcelog service is enabled
        an interrupt is generated for every ECC event. SO the system is
        spending time servicing these interrupts.

        So:   look in your /var/log/mcelog for hardware errors
        Look in your /var/log/messages for hardware errors also
        Look in the IPMI event logs for ECC errors:    ipmitool sel elist

        I would also bring that node down and boot it with memtester.
        If there is a DIMM which is that badly faulty then memtester
        will discover it within minutes.

        Or it could be something else - in which case I get no coffee.

        Also Intel cluster checker is intended to exacly deal with these
        situations.
        What is your cluster manager, and is Intel CLuster Checker
        available to you?
        I would seriously look at getting this installed.







        On 10 August 2017 at 16:39, Faraz Hussain <i...@feacluster.com
        <mailto:i...@feacluster.com>> wrote:

            One of our compute nodes runs ~30% slower than others. It
            has the exact same image so I am baffled why it is running
            slow . I have tested OMP and MPI benchmarks. Everything runs
            slower. The cpu usage goes to 2000%, so all looks normal there.

            I thought it may have to do with cpu scaling, i.e when the
            kernel changes the cpu speed depending on the workload. But
            we do not have that enabled on these machines.

            Here is a snippet from "cat /proc/cpuinfo". Everything is
            identical to our other nodes. Any suggestions on what else
            to check? I have tried rebooting it.

            processor       : 19
            vendor_id       : GenuineIntel
            cpu family      : 6
            model           : 62
            model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
            stepping        : 4
            cpu MHz         : 2500.098
            cache size      : 25600 KB
            physical id     : 1
            siblings        : 10
            core id         : 12
            cpu cores       : 10
            apicid          : 56
            initial apicid  : 56
            fpu             : yes
            fpu_exception   : yes
            cpuid level     : 13
            wp              : yes
            flags           : fpu vme de pse tsc msr pae mce cx8 apic
            sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
            sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
            constant_tsc arch_perfmon pebs bts rep_good xtopology
            nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
            vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
            x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
            lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
            flexpriority ept vpid fsgsbase smep erm

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Faraz Hussain
Thanks for the tips! Unfortunately, I am not seeing anything in  
/var/log of interest. The mcelog service is not enabled.  I do not see  
anything /proc/interrupts either.


I will look into full power down , memtester and firmare update. It is  
a blade. We do not have Intel cluster checker, but we have DRAC ( Dell  
Remote Access Controller ). I just logged in there and everything  
checks out, i.e memory, power etc.



Quoting John Hearns via Beowulf :


Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns  wrote:


ps.   Look at   watch  cat /proc/interrupts   also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns  wrote:


Faraz,
   I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS.  I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors,
then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.

So:   look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors:ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will
discover it within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker available to
you?
I would seriously look at getting this installed.







On 10 August 2017 at 16:39, Faraz Hussain  wrote:


One of our compute nodes runs ~30% slower than others. It has the exact
same image so I am baffled why it is running slow . I have tested OMP and
MPI benchmarks. Everything runs slower. The cpu usage goes to  
2000%, so all

looks normal there.

I thought it may have to do with cpu scaling, i.e when the kernel
changes the cpu speed depending on the workload. But we do not have that
enabled on these machines.

Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
our other nodes. Any suggestions on what else to check? I have tried
rebooting it.

processor   : 19
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping: 4
cpu MHz : 2500.098
cache size  : 25600 KB
physical id : 1
siblings: 10
core id : 12
cpu cores   : 10
apicid  : 56
initial apicid  : 56
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips: 5004.97
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:



___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf










___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Andrew Latham
In general if you have a snowflake you need to take some steps.
1. Unrack and remove it from the population
2. Image, document the system
3. Sniff test, visual test, power on fans spinning test in a lab
4. Understand that it is ok for one system out of X (where X could be 1000)
can fail
5. Return the system to rack if drive/image replacement resolves issue
6. Return system to supplier if above fails
7. Keep moving, don't spend the hours that equate to the cost of the node
troubleshooting it unless capital budget is super tricky
8. Keep dialog with supplier all the time to say that everything is awesome
so they are interested in the change of status
9. Don't troubleshoot in production ever

On Thu, Aug 10, 2017 at 9:39 AM, Faraz Hussain  wrote:

> One of our compute nodes runs ~30% slower than others. It has the exact
> same image so I am baffled why it is running slow . I have tested OMP and
> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
> looks normal there.
>
> I thought it may have to do with cpu scaling, i.e when the kernel changes
> the cpu speed depending on the workload. But we do not have that enabled on
> these machines.
>
> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our
> other nodes. Any suggestions on what else to check? I have tried rebooting
> it.
>
> processor   : 19
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 62
> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> stepping: 4
> cpu MHz : 2500.098
> cache size  : 25600 KB
> physical id : 1
> siblings: 10
> core id : 12
> cpu cores   : 10
> apicid  : 56
> initial apicid  : 56
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
> bogomips: 5004.97
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
>
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>



-- 
- Andrew "lathama" Latham lath...@gmail.com http://lathama.com
 -
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Andrew Holway
I put €10 on the nose for a faulty power supply.

On 10 August 2017 at 19:45, Gus Correa  wrote:

> + Leftover processes from previous jobs hogging resources.
> That's relatively common.
> That can trigger swapping, the ultimate performance killer.
> "top" or "htop" on the node should show something.
> (Will go away with a reboot, of course.)
>
> Less likely, but possible:
>
> + Different BIOS configuration w.r.t. the other nodes.
>
> + Poorly sat memory, IB card, etc, or cable connections.
>
> + IPMI may need a hard reset.
> Power down, remove the power cable, wait several minutes,
> put the cable back, power on.
>
> Gus Correa
>
> On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
>
>> Another thing to perhaps look at. Are you seeing messages abotu thermal
>> throttling events in the system logs?
>> Could that node have a piece of debris caught in its air intake?
>>
>> I dont think that will produce a 30% drop in perfoemance. But I have
>> caught compute nodes with pieces of packaking sucked onto the front,
>> following careless peeople unpacking kit in machine rooms.
>> (Firm rule - no packaging in the machine room. This means you)
>>
>>
>>
>>
>> On 10 August 2017 at 17:00, John Hearns  hear...@googlemail.com>> wrote:
>>
>> ps.   Look at   watch  cat /proc/interrupts   also
>> You might get a qualitative idea of a huge rate of interrupts.
>>
>>
>> On 10 August 2017 at 16:59, John Hearns > > wrote:
>>
>> Faraz,
>> I think you might have to buy me a virtual coffee. Or a beer!
>> Please look at the hardware health of that machine. Specifically
>> the DIMMS.  I have seen this before!
>> If you have some DIMMS which are faulty and are generating ECC
>> errors, then if the mcelog service is enabled
>> an interrupt is generated for every ECC event. SO the system is
>> spending time servicing these interrupts.
>>
>> So:   look in your /var/log/mcelog for hardware errors
>> Look in your /var/log/messages for hardware errors also
>> Look in the IPMI event logs for ECC errors:ipmitool sel elist
>>
>> I would also bring that node down and boot it with memtester.
>> If there is a DIMM which is that badly faulty then memtester
>> will discover it within minutes.
>>
>> Or it could be something else - in which case I get no coffee.
>>
>> Also Intel cluster checker is intended to exacly deal with these
>> situations.
>> What is your cluster manager, and is Intel CLuster Checker
>> available to you?
>> I would seriously look at getting this installed.
>>
>>
>>
>>
>>
>>
>>
>> On 10 August 2017 at 16:39, Faraz Hussain > > wrote:
>>
>> One of our compute nodes runs ~30% slower than others. It
>> has the exact same image so I am baffled why it is running
>> slow . I have tested OMP and MPI benchmarks. Everything runs
>> slower. The cpu usage goes to 2000%, so all looks normal
>> there.
>>
>> I thought it may have to do with cpu scaling, i.e when the
>> kernel changes the cpu speed depending on the workload. But
>> we do not have that enabled on these machines.
>>
>> Here is a snippet from "cat /proc/cpuinfo". Everything is
>> identical to our other nodes. Any suggestions on what else
>> to check? I have tried rebooting it.
>>
>> processor   : 19
>> vendor_id   : GenuineIntel
>> cpu family  : 6
>> model   : 62
>> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>> stepping: 4
>> cpu MHz : 2500.098
>> cache size  : 25600 KB
>> physical id : 1
>> siblings: 10
>> core id : 12
>> cpu cores   : 10
>> apicid  : 56
>> initial apicid  : 56
>> fpu : yes
>> fpu_exception   : yes
>> cpuid level : 13
>> wp  : yes
>> flags   : fpu vme de pse tsc msr pae mce cx8 apic
>> sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
>> sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
>> constant_tsc arch_perfmon pebs bts rep_good xtopology
>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
>> vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
>> x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
>> lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
>> flexpriority ept vpid fsgsbase smep erms
>> 

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Gus Correa

+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)

Less likely, but possible:

+ Different BIOS configuration w.r.t. the other nodes.

+ Poorly sat memory, IB card, etc, or cable connections.

+ IPMI may need a hard reset.
Power down, remove the power cable, wait several minutes,
put the cable back, power on.

Gus Correa

On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote:
Another thing to perhaps look at. Are you seeing messages abotu thermal 
throttling events in the system logs?

Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have 
caught compute nodes with pieces of packaking sucked onto the front,

following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns > wrote:


ps.   Look at   watch  cat /proc/interrupts   also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns > wrote:

Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically
the DIMMS.  I have seen this before!
If you have some DIMMS which are faulty and are generating ECC
errors, then if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is
spending time servicing these interrupts.

So:   look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors:ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester
will discover it within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these
situations.
What is your cluster manager, and is Intel CLuster Checker
available to you?
I would seriously look at getting this installed.







On 10 August 2017 at 16:39, Faraz Hussain > wrote:

One of our compute nodes runs ~30% slower than others. It
has the exact same image so I am baffled why it is running
slow . I have tested OMP and MPI benchmarks. Everything runs
slower. The cpu usage goes to 2000%, so all looks normal there.

I thought it may have to do with cpu scaling, i.e when the
kernel changes the cpu speed depending on the workload. But
we do not have that enabled on these machines.

Here is a snippet from "cat /proc/cpuinfo". Everything is
identical to our other nodes. Any suggestions on what else
to check? I have tried rebooting it.

processor   : 19
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping: 4
cpu MHz : 2500.098
cache size  : 25600 KB
physical id : 1
siblings: 10
core id : 12
cpu cores   : 10
apicid  : 56
initial apicid  : 56
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
bogomips: 5004.97
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:



___
Beowulf mailing list, Beowulf@beowulf.org
 sponsored by Penguin Computing
To change your 

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?

I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked onto the front,
following careless peeople unpacking kit in machine rooms.
(Firm rule - no packaging in the machine room. This means you)




On 10 August 2017 at 17:00, John Hearns  wrote:

> ps.   Look at   watch  cat /proc/interrupts   also
> You might get a qualitative idea of a huge rate of interrupts.
>
>
> On 10 August 2017 at 16:59, John Hearns  wrote:
>
>> Faraz,
>>I think you might have to buy me a virtual coffee. Or a beer!
>> Please look at the hardware health of that machine. Specifically the
>> DIMMS.  I have seen this before!
>> If you have some DIMMS which are faulty and are generating ECC errors,
>> then if the mcelog service is enabled
>> an interrupt is generated for every ECC event. SO the system is spending
>> time servicing these interrupts.
>>
>> So:   look in your /var/log/mcelog for hardware errors
>> Look in your /var/log/messages for hardware errors also
>> Look in the IPMI event logs for ECC errors:ipmitool sel elist
>>
>> I would also bring that node down and boot it with memtester.
>> If there is a DIMM which is that badly faulty then memtester will
>> discover it within minutes.
>>
>> Or it could be something else - in which case I get no coffee.
>>
>> Also Intel cluster checker is intended to exacly deal with these
>> situations.
>> What is your cluster manager, and is Intel CLuster Checker available to
>> you?
>> I would seriously look at getting this installed.
>>
>>
>>
>>
>>
>>
>>
>> On 10 August 2017 at 16:39, Faraz Hussain  wrote:
>>
>>> One of our compute nodes runs ~30% slower than others. It has the exact
>>> same image so I am baffled why it is running slow . I have tested OMP and
>>> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
>>> looks normal there.
>>>
>>> I thought it may have to do with cpu scaling, i.e when the kernel
>>> changes the cpu speed depending on the workload. But we do not have that
>>> enabled on these machines.
>>>
>>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
>>> our other nodes. Any suggestions on what else to check? I have tried
>>> rebooting it.
>>>
>>> processor   : 19
>>> vendor_id   : GenuineIntel
>>> cpu family  : 6
>>> model   : 62
>>> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>>> stepping: 4
>>> cpu MHz : 2500.098
>>> cache size  : 25600 KB
>>> physical id : 1
>>> siblings: 10
>>> core id : 12
>>> cpu cores   : 10
>>> apicid  : 56
>>> initial apicid  : 56
>>> fpu : yes
>>> fpu_exception   : yes
>>> cpuid level : 13
>>> wp  : yes
>>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
>>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
>>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
>>> bogomips: 5004.97
>>> clflush size: 64
>>> cache_alignment : 64
>>> address sizes   : 46 bits physical, 48 bits virtual
>>> power management:
>>>
>>>
>>>
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Robert Horton
As John says, I'd start by checking the health of things like memory,
power supplies etc.

I've seen things like this which go away after a firmware update, so
I'd suggest updating the bios etc if you can.

Have you tried completely removing the power for a few minutes then
booting up again?

Any idea when the problem started? I presume from the cpu it's not a
new system. What physical form is it (1u server / blade etc)?

Rob

On Thu, 2017-08-10 at 08:39 -0600, Faraz Hussain wrote:
> One of our compute nodes runs ~30% slower than others. It has the  
> exact same image so I am baffled why it is running slow . I have  
> tested OMP and MPI benchmarks. Everything runs slower. The cpu
> usage  
> goes to 2000%, so all looks normal there.
> 
> I thought it may have to do with cpu scaling, i.e when the kernel  
> changes the cpu speed depending on the workload. But we do not have  
> that enabled on these machines.
> 
> Here is a snippet from "cat /proc/cpuinfo". Everything is identical
> to  
> our other nodes. Any suggestions on what else to check? I have
> tried  
> rebooting it.
> 
> processor   : 19
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 62
> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> stepping: 4
> cpu MHz : 2500.098
> cache size  : 25600 KB
> physical id : 1
> siblings: 10
> core id : 12
> cpu cores   : 10
> apicid  : 56
> initial apicid  : 56
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge  
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe  
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts  
> rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64
> monitor  
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2  
> x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
> ida  
> arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid  
> fsgsbase smep erms
> bogomips: 5004.97
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
> 
> 
> 
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit http:/
> /www.beowulf.org/mailman/listinfo/beowulf
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
ps.   Look at   watch  cat /proc/interrupts   also
You might get a qualitative idea of a huge rate of interrupts.


On 10 August 2017 at 16:59, John Hearns  wrote:

> Faraz,
>I think you might have to buy me a virtual coffee. Or a beer!
> Please look at the hardware health of that machine. Specifically the
> DIMMS.  I have seen this before!
> If you have some DIMMS which are faulty and are generating ECC errors,
> then if the mcelog service is enabled
> an interrupt is generated for every ECC event. SO the system is spending
> time servicing these interrupts.
>
> So:   look in your /var/log/mcelog for hardware errors
> Look in your /var/log/messages for hardware errors also
> Look in the IPMI event logs for ECC errors:ipmitool sel elist
>
> I would also bring that node down and boot it with memtester.
> If there is a DIMM which is that badly faulty then memtester will discover
> it within minutes.
>
> Or it could be something else - in which case I get no coffee.
>
> Also Intel cluster checker is intended to exacly deal with these
> situations.
> What is your cluster manager, and is Intel CLuster Checker available to
> you?
> I would seriously look at getting this installed.
>
>
>
>
>
>
>
> On 10 August 2017 at 16:39, Faraz Hussain  wrote:
>
>> One of our compute nodes runs ~30% slower than others. It has the exact
>> same image so I am baffled why it is running slow . I have tested OMP and
>> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
>> looks normal there.
>>
>> I thought it may have to do with cpu scaling, i.e when the kernel changes
>> the cpu speed depending on the workload. But we do not have that enabled on
>> these machines.
>>
>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to
>> our other nodes. Any suggestions on what else to check? I have tried
>> rebooting it.
>>
>> processor   : 19
>> vendor_id   : GenuineIntel
>> cpu family  : 6
>> model   : 62
>> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
>> stepping: 4
>> cpu MHz : 2500.098
>> cache size  : 25600 KB
>> physical id : 1
>> siblings: 10
>> core id : 12
>> cpu cores   : 10
>> apicid  : 56
>> initial apicid  : 56
>> fpu : yes
>> fpu_exception   : yes
>> cpuid level : 13
>> wp  : yes
>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
>> bogomips: 5004.97
>> clflush size: 64
>> cache_alignment : 64
>> address sizes   : 46 bits physical, 48 bits virtual
>> power management:
>>
>>
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Faraz,
   I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS.  I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors, then
if the mcelog service is enabled
an interrupt is generated for every ECC event. SO the system is spending
time servicing these interrupts.

So:   look in your /var/log/mcelog for hardware errors
Look in your /var/log/messages for hardware errors also
Look in the IPMI event logs for ECC errors:ipmitool sel elist

I would also bring that node down and boot it with memtester.
If there is a DIMM which is that badly faulty then memtester will discover
it within minutes.

Or it could be something else - in which case I get no coffee.

Also Intel cluster checker is intended to exacly deal with these situations.
What is your cluster manager, and is Intel CLuster Checker available to you?
I would seriously look at getting this installed.







On 10 August 2017 at 16:39, Faraz Hussain  wrote:

> One of our compute nodes runs ~30% slower than others. It has the exact
> same image so I am baffled why it is running slow . I have tested OMP and
> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all
> looks normal there.
>
> I thought it may have to do with cpu scaling, i.e when the kernel changes
> the cpu speed depending on the workload. But we do not have that enabled on
> these machines.
>
> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to our
> other nodes. Any suggestions on what else to check? I have tried rebooting
> it.
>
> processor   : 19
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 62
> model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> stepping: 4
> cpu MHz : 2500.098
> cache size  : 25600 KB
> physical id : 1
> siblings: 10
> core id : 12
> cpu cores   : 10
> apicid  : 56
> initial apicid  : 56
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln
> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
> bogomips: 5004.97
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
>
>
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf