Re: [Beowulf] Varying performance across identical cluster nodes.

2018-02-19 Thread Prentice Bisbal
I know this is an old topic. I'm catching up on months' worth of mailing list mail right now. On 09/17/2017 09:09 PM, Christopher Samuel wrote: On 15/09/17 04:45, Prentice Bisbal wrote: I'm happy to announce that I finally found the cause this problem: numad. Very interesting, it sounds

Re: [Beowulf] Varying performance across identical cluster nodes.

2018-02-19 Thread Prentice Bisbal
Finally catching up months and months of beowulf e-mails. On 09/18/2017 05:20 AM, Håkon Bugge wrote: On 18 Sep 2017, at 03:09, Christopher Samuel wrote: On 15/09/17 04:45, Prentice Bisbal wrote: I'm happy to announce that I finally found the cause this problem: numad.

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-18 Thread Håkon Bugge
> On 18 Sep 2017, at 03:09, Christopher Samuel wrote: > > On 15/09/17 04:45, Prentice Bisbal wrote: > >> I'm happy to announce that I finally found the cause this problem: numad. > > Very interesting, it sounds like it was migrating processes onto a > single core over

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-17 Thread Christopher Samuel
On 15/09/17 04:45, Prentice Bisbal wrote: > I'm happy to announce that I finally found the cause this problem: numad. Very interesting, it sounds like it was migrating processes onto a single core over time! Anything diagnostic in its log? -- Christopher SamuelSenior Systems

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Prentice Bisbal
Beowulfers, I'm happy to announce that I finally found the cause this problem: numad. On these particular systems, numad was having a catastrophic effect on the performance. As the jobs ran GFLOPS would steadily decrease in a monotonic fashion, watching the output of turbostat and 'cpupower

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Joe Landman
On 09/14/2017 09:25 AM, John Hearns via Beowulf wrote: Prentice, as I understand it the problem here is that with the same OS and IB drivers, there is a big difference in performance between stateful and NFS root nodes. Throwing my hat into the ring, try looking ot see if there is an

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread John Hearns via Beowulf
Prentice, as I understand it the problem here is that with the same OS and IB drivers, there is a big difference in performance between stateful and NFS root nodes. Throwing my hat into the ring, try looking ot see if there is an excessive rate of interrupts in the nfsroot case, coming from

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Prentice Bisbal
Switching away from NFS root is not something I can change right now. Prentice On 09/13/2017 02:45 PM, Joe Landman wrote: FWIW:  I gave up on NFS boot a while ago, due in part to problems with performance that were hard to track down.  The environment I created to do completely ramboot boots

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Prentice Bisbal
Another good question. The systems with the nfsroot os still have a local disk. That local disk has a /var partition where logs are written. Both system do send some logs to a remote log server. While /etc/rsyslog.conf files were almost identical, I copied the one from the nfsroot system to

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Prentice Bisbal
Good question. I just checked using vmstat. When running xhpl on both systems, vmstat shows only zeros for si and so, even long after the performance degrades on the nfsroot instance. Just to be sure, I double-checked with top, which shows 0k of swap being used. Prentice On 09/13/2017 02:15

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-14 Thread Bogdan Costescu
On Fri, Sep 8, 2017 at 8:41 PM, Prentice Bisbal wrote: > I have a dozen servers that are all identical hardware: SuperMicro servers > with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the > users have been complaining of wildly inconsistent performance

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Christopher Samuel
On 14/09/17 03:48, Prentice Bisbal wrote: > What software configuration, either a kernel a parameter, configuration > of numad or cpuspeed, or some other setting, could affect this? Hmm, how about diff'ing "sysctl -a" between the systems too? Does one load new CPU microcode in whereas another

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Michael Di Domenico
On Wed, Sep 13, 2017 at 2:45 PM, Joe Landman wrote: > FWIW: I gave up on NFS boot a while ago, due in part to problems with > performance that were hard to track down. The environment I created to do > completely ramboot boots at scale, allows me to pivot to NFS if

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Joe Landman
FWIW: I gave up on NFS boot a while ago, due in part to problems with performance that were hard to track down. The environment I created to do completely ramboot boots at scale, allows me to pivot to NFS if desired (boot time switch). But I rarely use that. Pure ramboot has been a joy to

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Scott Atchley
Are you logging something goes to the disk in the local case, but that is competing for network bandwidth when NFS mounting? On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley wrote: > Are you swapping? > > On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Scott Atchley
Are you swapping? On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham wrote: > ack, so maybe validate you can reproduce with another nfs root. Maybe a > lab setup where a single server is serving nfs root to the node. If you > could reproduce in that way then it would give some

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Andrew Latham
ack, so maybe validate you can reproduce with another nfs root. Maybe a lab setup where a single server is serving nfs root to the node. If you could reproduce in that way then it would give some direction. Beyond that it sounds like an interesting problem. On Wed, Sep 13, 2017 at 12:48 PM,

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Prentice Bisbal
Okay, based on the various responses I've gotten here and on other lists, I feel I need to clarify things: This problem only occurs when I'm running our NFSroot based version of the OS (CentOS 6). When I run the same OS installed on a local disk, I do not have this problem, using the same

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-10 Thread Christopher Samuel
On 09/09/17 04:41, Prentice Bisbal wrote: > Any ideas where to look or what to tweak to fix this? Any idea why this > is only occuring with RHEL 6 w/ NFS root OS? No ideas, but in addition to what others have suggested: 1) diff the output of dmidecode between 4 nodes, 2 OK and 2 slow to see

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Joe Landman
On 09/08/2017 02:41 PM, Prentice Bisbal wrote: But here's the thing: this wasn't a problem until we upgraded to CentOS 6. Where I work, we use a read-only NFSroot filesystem for our cluster nodes, so all nodes are mounting and using the same exact read-only image of the operating system.

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Bill Broadley
Last time I saw this problem was because the chassis was missing the air redirection guides, and not enough air was getting to the CPUs. The OS upgrade might actually be enabling better throttling to keep the CPU cooler. ___ Beowulf mailing list,

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Lux, Jim (337C)
: Beowulf [mailto:beowulf-boun...@beowulf.org] On Behalf Of Andrew Latham Sent: Friday, September 08, 2017 11:56 AM To: Prentice Bisbal <pbis...@pppl.gov> Cc: Beowulf List <beowulf@beowulf.org> Subject: Re: [Beowulf] Varying performance across identical cluster nodes. Shooting from hip 1. BI

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Skylar Thompson
I would also suspect a thermal issue, though it could also be firmware. To verify a temperature problem, you might try setting up lm_sensors or scraping "ipmitool sdr" output (whichever is easier) regularly and try to make a performance-vs-temperature plot for each node. As Andrew mentioned, it

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Andrew Latham
Shooting from hip 1. BIOS identical version and settings 2. Firmware on device (I assume nothing just thinking out loud) 3. Re-seat fans/replace (oxidized contacts - silly but why not) 4. Verify the power supplies are identical (various watts etc... maybe swap out and test) 5. Memory cooling

[Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Prentice Bisbal
Beowulfers, I need your assistance debugging a problem: I have a dozen servers that are all identical hardware: SuperMicro servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the users have been complaining of wildly inconsistent performance across these 12 nodes.