Re: [Beowulf] How to debug slow compute node?

2017-08-16 Thread Bill Broadley via Beowulf
On 08/10/2017 07:39 AM, Faraz Hussain wrote: > One of our compute nodes runs ~30% slower than others. It has the exact same > image so I am baffled why it is running slow . I have tested OMP and MPI > benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all looks > normal there. We

Re: [Beowulf] How to debug slow compute node?

2017-08-14 Thread mathog
(Sorry for the duplicate of this went out with the wrong subject.) On 12-Aug-2017 Chris Samuel wrote: Just to add to the excellent suggestions from others: have you compared BIOS/ UEFI settings & versions across these nodes to ensure they're identical? Also verify

Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
On 12/08/17 17:35, William Johnson wrote: > This may be a long shot, especially in a server room where everything > else is working as expected. Oh agreed! But given people have covered a lot of other bases I thought I'd throw something in from my own experience. If all nodes boot the same OS

Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
On 14/08/17 08:17, Lachlan Musicman wrote: > Can you point to some good documentation on this? There is some on Mellanox's website: http://www.mellanox.com/related-docs/prod_software/Mellanox_EN_for_Linux_User_Manual_v2_0-3_0_0.pdf But it it took weeks for $VENDOR to figure out what was going

Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Lachlan Musicman
On 12 August 2017 at 13:35, Chris Samuel wrote: > Also remember that the kernel can enable C states that hurt performance > even > if they are disabled in the BIOS/UEFI. This was painfully apparent on our > first SandyBridge cluster that almost failed the performance

Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread Chris Samuel
On Friday, 11 August 2017 12:39:07 AM AEST Faraz Hussain wrote: > I thought it may have to do with cpu scaling, i.e when the kernel > changes the cpu speed depending on the workload. But we do not have > that enabled on these machines. Just to add to the excellent suggestions from others: have

Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread mathog
Rushat Rai wrote I don't know if this has been mentioned, but ECC could be slowing down that specific node if it has a faulty stick. To find the bad stick one often must disable ECC, at least that was the case many years ago the last time I ran into that. If ECC is enabled, even if the

Re: [Beowulf] How to debug slow compute node?

2017-08-11 Thread Rushat Rai
Hi, my first post here. Anyways, I agree with John, I've seen debris caught up in intakes causing some performance drop. 30% does seem a little excessive, but you should check first. I don't know if this has been mentioned, but ECC could be slowing down that specific node if it has a faulty

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Skylar Thompson
<https://go.microsoft.com/fwlink/?LinkId=550986> for >> Windows 10 >> >> >> >> *From: *Andrew Holway <andrew.hol...@gmail.com> >> *Sent: *Thursday, 10 August 2017 20:05 >> *To: *Gus Correa <g...@ldeo.columbia.edu> >> *Cc: *Beowulf Mailing

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Lance Wilson
*Sent: *Thursday, 10 August 2017 20:05 > *To: *Gus Correa <g...@ldeo.columbia.edu> > *Cc: *Beowulf Mailing List <beowulf@beowulf.org> > *Subject: *Re: [Beowulf] How to debug slow compute node? > > > > I put €10 on the nose for a faulty power supply. > > >

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Ten euros for me on a faulty DIMM Sent from Mail for Windows 10 From: Andrew Holway Sent: Thursday, 10 August 2017 20:05 To: Gus Correa Cc: Beowulf Mailing List Subject: Re: [Beowulf] How to debug slow compute node? I put €10 on the nose for a faulty power supply. On 10 August 2017 at 19:45

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Faraz Hussain
Thanks for the tips! Unfortunately, I am not seeing anything in /var/log of interest. The mcelog service is not enabled. I do not see anything /proc/interrupts either. I will look into full power down , memtester and firmare update. It is a blade. We do not have Intel cluster checker, but

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Andrew Latham
In general if you have a snowflake you need to take some steps. 1. Unrack and remove it from the population 2. Image, document the system 3. Sniff test, visual test, power on fans spinning test in a lab 4. Understand that it is ok for one system out of X (where X could be 1000) can fail 5. Return

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Andrew Holway
I put €10 on the nose for a faulty power supply. On 10 August 2017 at 19:45, Gus Correa wrote: > + Leftover processes from previous jobs hogging resources. > That's relatively common. > That can trigger swapping, the ultimate performance killer. > "top" or "htop" on the

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Gus Correa
+ Leftover processes from previous jobs hogging resources. That's relatively common. That can trigger swapping, the ultimate performance killer. "top" or "htop" on the node should show something. (Will go away with a reboot, of course.) Less likely, but possible: + Different BIOS configuration

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs? Could that node have a piece of debris caught in its air intake? I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Robert Horton
As John says, I'd start by checking the health of things like memory, power supplies etc. I've seen things like this which go away after a firmware update, so I'd suggest updating the bios etc if you can. Have you tried completely removing the power for a few minutes then booting up again? Any

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
ps. Look at watch cat /proc/interrupts also You might get a qualitative idea of a huge rate of interrupts. On 10 August 2017 at 16:59, John Hearns wrote: > Faraz, >I think you might have to buy me a virtual coffee. Or a beer! > Please look at the hardware

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread John Hearns via Beowulf
Faraz, I think you might have to buy me a virtual coffee. Or a beer! Please look at the hardware health of that machine. Specifically the DIMMS. I have seen this before! If you have some DIMMS which are faulty and are generating ECC errors, then if the mcelog service is enabled an interrupt is