Hmm...have you dumped waiters across the entire cluster or just on the NSD
servers/fs managers? Maybe there’s a slow node out there participating in the
suspend effort? Might be worth running some quick tracing on the FS manager to
see what it’s up to.
On July 15, 2018 at 13:27:54 EDT,
While I don’t own a DeLorean I work with someone who once fixed one up, which I
*think* effectively means I can jump back in time to before the deadline to
submit. (And let’s be honest, with the way HPC is going it feels like we have
the requisite 1.21GW of power...) However, since I can’t
I hate admitting this but I’ve found something that’s got me stumped.
We have a user running an MPI job on the system. Each rank opens up several
output files to which it writes ASCII debug information. The net result across
several hundred ranks is an absolute smattering of teeny tiny I/o
Hi Lukas,
Check out FPO mode. That mimics Hadoop’s data placement features. You can have
up to 3 replicas both data and metadata but still the downside, though, as you
say is the wrong node failures will take your cluster down.
You might want to check out something like Excelero’s NVMesh
Apologies for that. My mobile exchange email client has a like button you can
tap in the email action menu. I just discovered it this morning accidentally
and thought “wonder what this does. Better push it to find out.” Nothing
happened or so I thought. Apparently all it does is make you look
We’ve done a fair amount of VPI work but admittedly not with connectx4. Is it
possible the cards are trying to talk IB rather than Eth? I figured you’re
Ethernet based because of the mention of Juniper.
Are you attempting to do RoCE or just plain TCP/IP?
On December 20, 2017 at 14:40:48
It’s not supported on SLES11 either.
IBM didn’t (that I saw) talk much about this publicly or give customers a
chance to provide feedback about the decision. I know it was raised at the UG
in NY and I recall a number of people saying it would be a significant issue
for them (myself included)
Thanks Sven! That makes sense to me and is what I thought was the case which is
why I was confused when I saw the reply to the thread that said the >32
subblocks code had no performance impact.
A couple more question for you— in your presentation there’s a benchmark that
shows the file create
Looks like 13 is EPERM which means apparently permissions didn’t exist to
create the QP of the desired type which is odd since mmfsd runs as root. Is
there any remote chance SELinux is enabled (e.g. sestatus)? Although I’d think
mmfsd would run unconfined in the default policy, but maybe it
It’s my understanding and experience that all member nodes of two clusters that
are multi-clustered must be able to (and will eventually given enough
time/activity) make connections to any and all nodes in both clusters. Even if
you don’t designate the 2 protocol nodes as contact nodes I would
Somehow this nugget of joy (that’s most definitely sarcasm, this really sucks)
slipped past my radar:
http://www-01.ibm.com/support/docview.wss?uid=isg1IV96475
Anyone know if there’s a fix in the 4.1 stream?
In my opinion this is 100% a tar bug as the APAR suggests but GPFS has
implemented
Hi Everyone,
What’s the latest recommended efix release for 4.2.3.4?
I’m working on testing a 4.1 to 4.2 migration and was reminded today of some
fun bugs in 4.2.3.4 for which I think there are efixes. Alternatively, any word
on a 4.2.3.5 release date?
-Aaron
Hi Eric,
I shot you an email directly with contact info.
-Aaron
On August 28, 2017 at 08:26:56 EDT, J. Eric Wonderley
wrote:
We have several avago/lsi 9305-16e that I believe came from Advanced HPC.
Can someone recommend a another reseller of these hbas or a contact
Hi Everyone,
In the world of GPFS 4.2 is there a particular advantage to having a large
amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD
Server/FS manager nodes? We currently have half of physical memory allocated to
pagepool on these nodes.
For some historical
Hi Andreas,
I often start with an lsof to see who has files open on the troubled filesystem
and then start stracing the various processes to see which is responsible. It
ought to be a process blocked in uninterruptible sleep and ideally would be
obvious but on a shared machine it might not be.
Hi Tushar,
For me the issue was an underlying performance bottleneck (some CPU frequency
scaling problems causing cores to throttle back when it wasn't appropriate).
I noticed you have verbsRdmaSend set to yes. I've seen suggestions in the past
to turn this off under certain conditions
do this.
From: mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister,
Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Sent: Thursday, 30 March 2017 9:45 AM
To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Subject: Re: [gpfsug-discuss] question on viewing
The RDMA errors I think are secondary to what's going on with either your IPoIB
or Ethernet fabrics that's causing I assume IPoIB communication breakdowns and
expulsions. We've had entire IB fabrics go offline and if the nodes werent
depending on it for daemon communication nobody got expelled.
t; Wed Dec 28 16:17:09.035 2016: [D] 12:7FF160039172
> Thread::callBody(Thread*) + 1E2 at ??:0
> Wed Dec 28 16:17:09.036 2016: [D] 13:00007FF160027302
> Thread::callBodyWrapper(Thread*) + A2 at ??:0
> Wed Dec 28 16:17:09.037 2016: [D] 14:7FF15F73FDC5 start_thread +
> C5 at ??
I believe ARCAStream has a product that could facilitate this also. I also
believe their engineers are on the list.
From: Andrew Beattie
Sent: 11/17/16, 3:56 PM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] Is anyone performing any kind of Charge back / Show
back on Scale today
Understood. Thank you for your help.
By the way, I was able to figure out by poking mmpmon gfis that the job is
performing 20k a second each of inode creations, updates and deletions across
64 nodes. There's my 60k iops on the backend. While I'm impressed and not
surprised GPFS can keep up
gpfsug-discuss-boun...@spectrumscale.org
>
>
>
>
> After some googling around, I wonder if perhaps what I'm thinking of was
> an I/O forwarding layer that I understood was being developed for x86_64
> ty
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
>
>
>
>
> After some googling around, I wonder if perhaps what I'm thinking of was
> an I/O forwarding layer that I understood was being develo
Just want to add on to one of the points Sven touched on regarding metadata HW.
We have a modest SSD infrastructure for our metadata disks and we can scan 500M
inodes in parallel in about 5 hours if my memory serves me right (and I believe
we could go faster if we really wanted to). I think
Hi Everyone,
We ran into a rather interesting situation over the past week. We had a job
that was pounding the ever loving crap out of one of our filesystems (called
dnb02) doing about 15GB/s of reads. We had other jobs experience a slowdown on
a different filesystem (called dnb41) that uses
Hi Everyone,
I've noticed that many GPFS commands (mm*acl,mm*attr) and API calls (in
particular the putacl and getacl functions) have no support for not following
symlinks. Is there some hidden support for gpfs_putacl that will cause it to
not deteference symbolic links? Something like the
26 matches
Mail list logo