I know this is an old topic. I'm catching up on months' worth of mailing
list mail right now.
On 09/17/2017 09:09 PM, Christopher Samuel wrote:
On 15/09/17 04:45, Prentice Bisbal wrote:
I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds
Finally catching up months and months of beowulf e-mails.
On 09/18/2017 05:20 AM, Håkon Bugge wrote:
On 18 Sep 2017, at 03:09, Christopher Samuel wrote:
On 15/09/17 04:45, Prentice Bisbal wrote:
I'm happy to announce that I finally found the cause this problem: numad.
> On 18 Sep 2017, at 03:09, Christopher Samuel wrote:
>
> On 15/09/17 04:45, Prentice Bisbal wrote:
>
>> I'm happy to announce that I finally found the cause this problem: numad.
>
> Very interesting, it sounds like it was migrating processes onto a
> single core over
On 15/09/17 04:45, Prentice Bisbal wrote:
> I'm happy to announce that I finally found the cause this problem: numad.
Very interesting, it sounds like it was migrating processes onto a
single core over time! Anything diagnostic in its log?
--
Christopher SamuelSenior Systems
Beowulfers,
I'm happy to announce that I finally found the cause this problem:
numad. On these particular systems, numad was having a catastrophic
effect on the performance. As the jobs ran GFLOPS would steadily
decrease in a monotonic fashion, watching the output of turbostat and
'cpupower
On 09/14/2017 09:25 AM, John Hearns via Beowulf wrote:
Prentice, as I understand it the problem here is that with the same
OS and IB drivers, there is a big difference in performance between
stateful and NFS root nodes.
Throwing my hat into the ring, try looking ot see if there is an
Prentice, as I understand it the problem here is that with the same OS
and IB drivers, there is a big difference in performance between stateful
and NFS root nodes.
Throwing my hat into the ring, try looking ot see if there is an
excessive rate of interrupts in the nfsroot case, coming from
Switching away from NFS root is not something I can change right now.
Prentice
On 09/13/2017 02:45 PM, Joe Landman wrote:
FWIW: I gave up on NFS boot a while ago, due in part to problems with
performance that were hard to track down. The environment I created
to do completely ramboot boots
Another good question. The systems with the nfsroot os still have a
local disk. That local disk has a /var partition where logs are written.
Both system do send some logs to a remote log server. While
/etc/rsyslog.conf files were almost identical, I copied the one from the
nfsroot system to
Good question. I just checked using vmstat. When running xhpl on both
systems, vmstat shows only zeros for si and so, even long after the
performance degrades on the nfsroot instance. Just to be sure, I
double-checked with top, which shows 0k of swap being used.
Prentice
On 09/13/2017 02:15
On Fri, Sep 8, 2017 at 8:41 PM, Prentice Bisbal wrote:
> I have a dozen servers that are all identical hardware: SuperMicro servers
> with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the
> users have been complaining of wildly inconsistent performance
On 14/09/17 03:48, Prentice Bisbal wrote:
> What software configuration, either a kernel a parameter, configuration
> of numad or cpuspeed, or some other setting, could affect this?
Hmm, how about diff'ing "sysctl -a" between the systems too?
Does one load new CPU microcode in whereas another
On Wed, Sep 13, 2017 at 2:45 PM, Joe Landman wrote:
> FWIW: I gave up on NFS boot a while ago, due in part to problems with
> performance that were hard to track down. The environment I created to do
> completely ramboot boots at scale, allows me to pivot to NFS if
FWIW: I gave up on NFS boot a while ago, due in part to problems with
performance that were hard to track down. The environment I created to
do completely ramboot boots at scale, allows me to pivot to NFS if
desired (boot time switch). But I rarely use that. Pure ramboot has
been a joy to
Are you logging something goes to the disk in the local case, but that is
competing for network bandwidth when NFS mounting?
On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley
wrote:
> Are you swapping?
>
> On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham
Are you swapping?
On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham wrote:
> ack, so maybe validate you can reproduce with another nfs root. Maybe a
> lab setup where a single server is serving nfs root to the node. If you
> could reproduce in that way then it would give some
ack, so maybe validate you can reproduce with another nfs root. Maybe a lab
setup where a single server is serving nfs root to the node. If you could
reproduce in that way then it would give some direction. Beyond that it
sounds like an interesting problem.
On Wed, Sep 13, 2017 at 12:48 PM,
Okay, based on the various responses I've gotten here and on other
lists, I feel I need to clarify things:
This problem only occurs when I'm running our NFSroot based version of
the OS (CentOS 6). When I run the same OS installed on a local disk, I
do not have this problem, using the same
On 09/09/17 04:41, Prentice Bisbal wrote:
> Any ideas where to look or what to tweak to fix this? Any idea why this
> is only occuring with RHEL 6 w/ NFS root OS?
No ideas, but in addition to what others have suggested:
1) diff the output of dmidecode between 4 nodes, 2 OK and 2 slow to see
On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system.
Last time I saw this problem was because the chassis was missing the air
redirection guides, and not enough air was getting to the CPUs.
The OS upgrade might actually be enabling better throttling to keep the CPU
cooler.
___
Beowulf mailing list,
: Beowulf [mailto:beowulf-boun...@beowulf.org] On Behalf Of Andrew Latham
Sent: Friday, September 08, 2017 11:56 AM
To: Prentice Bisbal <pbis...@pppl.gov>
Cc: Beowulf List <beowulf@beowulf.org>
Subject: Re: [Beowulf] Varying performance across identical cluster nodes.
Shooting from hip
1. BI
I would also suspect a thermal issue, though it could also be firmware. To
verify a temperature problem, you might try setting up lm_sensors or
scraping "ipmitool sdr" output (whichever is easier) regularly and try to
make a performance-vs-temperature plot for each node. As Andrew mentioned,
it
Shooting from hip
1. BIOS identical version and settings
2. Firmware on device (I assume nothing just thinking out loud)
3. Re-seat fans/replace (oxidized contacts - silly but why not)
4. Verify the power supplies are identical (various watts etc... maybe swap
out and test)
5. Memory cooling
Beowulfers,
I need your assistance debugging a problem:
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes.
25 matches
Mail list logo