On 08/10/2017 07:39 AM, Faraz Hussain wrote:
> One of our compute nodes runs ~30% slower than others. It has the exact same
> image so I am baffled why it is running slow . I have tested OMP and MPI
> benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks
> normal there.
We
(Sorry for the duplicate of this went out with the wrong subject.)
On 12-Aug-2017 Chris Samuel wrote:
Just to add to the excellent suggestions from others: have you compared
BIOS/
UEFI settings & versions across these nodes to ensure they're
identical?
Also verify
On 12/08/17 17:35, William Johnson wrote:
> This may be a long shot, especially in a server room where everything
> else is working as expected.
Oh agreed! But given people have covered a lot of other bases I thought
I'd throw something in from my own experience. If all nodes boot the
same OS
On 14/08/17 08:17, Lachlan Musicman wrote:
> Can you point to some good documentation on this?
There is some on Mellanox's website:
http://www.mellanox.com/related-docs/prod_software/Mellanox_EN_for_Linux_User_Manual_v2_0-3_0_0.pdf
But it it took weeks for $VENDOR to figure out what was
going
On 12 August 2017 at 13:35, Chris Samuel wrote:
> Also remember that the kernel can enable C states that hurt performance
> even
> if they are disabled in the BIOS/UEFI. This was painfully apparent on our
> first SandyBridge cluster that almost failed the performance
On Friday, 11 August 2017 12:39:07 AM AEST Faraz Hussain wrote:
> I thought it may have to do with cpu scaling, i.e when the kernel
> changes the cpu speed depending on the workload. But we do not have
> that enabled on these machines.
Just to add to the excellent suggestions from others: have
Rushat Rai wrote
I don't know if this has been mentioned, but ECC could be slowing down
that specific node if it has a faulty stick.
To find the bad stick one often must disable ECC, at least that was the
case many years ago the last time I ran into that. If ECC is enabled,
even if the
Hi, my first post here.
Anyways, I agree with John, I've seen debris caught up in intakes causing some
performance drop. 30% does seem a little excessive, but you should check first.
I don't know if this has been mentioned, but ECC could be slowing down that
specific node if it has a faulty
<https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>
>>
>>
>> *From: *Andrew Holway <andrew.hol...@gmail.com>
>> *Sent: *Thursday, 10 August 2017 20:05
>> *To: *Gus Correa <g...@ldeo.columbia.edu>
>> *Cc: *Beowulf Mailing
*Sent: *Thursday, 10 August 2017 20:05
> *To: *Gus Correa <g...@ldeo.columbia.edu>
> *Cc: *Beowulf Mailing List <beowulf@beowulf.org>
> *Subject: *Re: [Beowulf] How to debug slow compute node?
>
>
>
> I put €10 on the nose for a faulty power supply.
>
>
>
Ten euros for me on a faulty DIMM
Sent from Mail for Windows 10
From: Andrew Holway
Sent: Thursday, 10 August 2017 20:05
To: Gus Correa
Cc: Beowulf Mailing List
Subject: Re: [Beowulf] How to debug slow compute node?
I put €10 on the nose for a faulty power supply.
On 10 August 2017 at 19:45
Thanks for the tips! Unfortunately, I am not seeing anything in
/var/log of interest. The mcelog service is not enabled. I do not see
anything /proc/interrupts either.
I will look into full power down , memtester and firmare update. It is
a blade. We do not have Intel cluster checker, but
In general if you have a snowflake you need to take some steps.
1. Unrack and remove it from the population
2. Image, document the system
3. Sniff test, visual test, power on fans spinning test in a lab
4. Understand that it is ok for one system out of X (where X could be 1000)
can fail
5. Return
I put €10 on the nose for a faulty power supply.
On 10 August 2017 at 19:45, Gus Correa wrote:
> + Leftover processes from previous jobs hogging resources.
> That's relatively common.
> That can trigger swapping, the ultimate performance killer.
> "top" or "htop" on the
+ Leftover processes from previous jobs hogging resources.
That's relatively common.
That can trigger swapping, the ultimate performance killer.
"top" or "htop" on the node should show something.
(Will go away with a reboot, of course.)
Less likely, but possible:
+ Different BIOS configuration
Another thing to perhaps look at. Are you seeing messages abotu thermal
throttling events in the system logs?
Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have caught
compute nodes with pieces of packaking sucked
As John says, I'd start by checking the health of things like memory,
power supplies etc.
I've seen things like this which go away after a firmware update, so
I'd suggest updating the bios etc if you can.
Have you tried completely removing the power for a few minutes then
booting up again?
Any
ps. Look at watch cat /proc/interrupts also
You might get a qualitative idea of a huge rate of interrupts.
On 10 August 2017 at 16:59, John Hearns wrote:
> Faraz,
>I think you might have to buy me a virtual coffee. Or a beer!
> Please look at the hardware
Faraz,
I think you might have to buy me a virtual coffee. Or a beer!
Please look at the hardware health of that machine. Specifically the
DIMMS. I have seen this before!
If you have some DIMMS which are faulty and are generating ECC errors, then
if the mcelog service is enabled
an interrupt is
19 matches
Mail list logo