Re: [gpfsug-discuss] mmchdisk hung / proceeding at a glacial pace?

2018-07-15 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hmm...have you dumped waiters across the entire cluster or just on the NSD servers/fs managers? Maybe there’s a slow node out there participating in the suspend effort? Might be worth running some quick tracing on the FS manager to see what it’s up to. On July 15, 2018 at 13:27:54 EDT,

Re: [gpfsug-discuss] RFE Process ... Burning Issues

2018-04-18 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
While I don’t own a DeLorean I work with someone who once fixed one up, which I *think* effectively means I can jump back in time to before the deadline to submit. (And let’s be honest, with the way HPC is going it feels like we have the requisite 1.21GW of power...) However, since I can’t

[gpfsug-discuss] Confusing I/O Behavior

2018-04-10 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
I hate admitting this but I’ve found something that’s got me stumped. We have a user running an MPI job on the system. Each rank opens up several output files to which it writes ASCII debug information. The net result across several hundred ranks is an absolute smattering of teeny tiny I/o

Re: [gpfsug-discuss] Preferred NSD

2018-03-12 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Lukas, Check out FPO mode. That mimics Hadoop’s data placement features. You can have up to 3 replicas both data and metadata but still the downside, though, as you say is the wrong node failures will take your cluster down. You might want to check out something like Excelero’s NVMesh

Re: [gpfsug-discuss] GPFS best practises : end user standpoint

2018-01-16 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Apologies for that. My mobile exchange email client has a like button you can tap in the email action menu. I just discovered it this morning accidentally and thought “wonder what this does. Better push it to find out.” Nothing happened or so I thought. Apparently all it does is make you look

Re: [gpfsug-discuss] more than one mlx connectx-4 adapter in same host

2017-12-20 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
We’ve done a fair amount of VPI work but admittedly not with connectx4. Is it possible the cards are trying to talk IB rather than Eth? I figured you’re Ethernet based because of the mention of Juniper. Are you attempting to do RoCE or just plain TCP/IP? On December 20, 2017 at 14:40:48

Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 71, Issue 35

2017-12-19 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
It’s not supported on SLES11 either. IBM didn’t (that I saw) talk much about this publicly or give customers a chance to provide feedback about the decision. I know it was raised at the UG in NY and I recall a number of people saying it would be a significant issue for them (myself included)

Re: [gpfsug-discuss] Online data migration tool

2017-12-18 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Thanks Sven! That makes sense to me and is what I thought was the case which is why I was confused when I saw the reply to the thread that said the >32 subblocks code had no performance impact. A couple more question for you— in your presentation there’s a benchmark that shows the file create

Re: [gpfsug-discuss] Infiniband connection rejected, ibv_create_qp err 13

2017-12-05 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Looks like 13 is EPERM which means apparently permissions didn’t exist to create the QP of the desired type which is odd since mmfsd runs as root. Is there any remote chance SELinux is enabled (e.g. sestatus)? Although I’d think mmfsd would run unconfined in the default policy, but maybe it

Re: [gpfsug-discuss] mmauth/mmremotecluster wonkyness?

2017-11-30 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
It’s my understanding and experience that all member nodes of two clusters that are multi-clustered must be able to (and will eventually given enough time/activity) make connections to any and all nodes in both clusters. Even if you don’t designate the 2 protocol nodes as contact nodes I would

[gpfsug-discuss] tar sparse file data loss

2017-11-22 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Somehow this nugget of joy (that’s most definitely sarcasm, this really sucks) slipped past my radar: http://www-01.ibm.com/support/docview.wss?uid=isg1IV96475 Anyone know if there’s a fix in the 4.1 stream? In my opinion this is 100% a tar bug as the APAR suggests but GPFS has implemented

[gpfsug-discuss] Latest recommended 4.2 efix?

2017-09-28 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Everyone, What’s the latest recommended efix release for 4.2.3.4? I’m working on testing a 4.1 to 4.2 migration and was reminded today of some fun bugs in 4.2.3.4 for which I think there are efixes. Alternatively, any word on a 4.2.3.5 release date? -Aaron

Re: [gpfsug-discuss] sas avago/lsi hba reseller recommendation

2017-08-28 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Eric, I shot you an email directly with contact info. -Aaron On August 28, 2017 at 08:26:56 EDT, J. Eric Wonderley wrote: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact

[gpfsug-discuss] NSD Server/FS Manager Memory Requirements

2017-08-17 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Everyone, In the world of GPFS 4.2 is there a particular advantage to having a large amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD Server/FS manager nodes? We currently have half of physical memory allocated to pagepool on these nodes. For some historical

Re: [gpfsug-discuss] Associating I/O operations with files/processes

2017-05-30 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Andreas, I often start with an lsof to see who has files open on the troubled filesystem and then start stracing the various processes to see which is responsible. It ought to be a process blocked in uninterruptible sleep and ideally would be obvious but on a shared machine it might not be.

Re: [gpfsug-discuss] VERBS RDMA issue

2017-05-21 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Tushar, For me the issue was an underlying performance bottleneck (some CPU frequency scaling problems causing cores to throttle back when it wasn't appropriate). I noticed you have verbsRdmaSend set to yes. I've seen suggestions in the past to turn this off under certain conditions

Re: [gpfsug-discuss] question on viewing block distribution across NSDs

2017-03-29 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
do this. From: mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Thursday, 30 March 2017 9:45 AM To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Subject: Re: [gpfsug-discuss] question on viewing

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
The RDMA errors I think are secondary to what's going on with either your IPoIB or Ethernet fabrics that's causing I assume IPoIB communication breakdowns and expulsions. We've had entire IB fabrics go offline and if the nodes werent depending on it for daemon communication nobody got expelled.

Re: [gpfsug-discuss] LROC

2016-12-28 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
t; Wed Dec 28 16:17:09.035 2016: [D] 12:7FF160039172 > Thread::callBody(Thread*) + 1E2 at ??:0 > Wed Dec 28 16:17:09.036 2016: [D] 13:00007FF160027302 > Thread::callBodyWrapper(Thread*) + A2 at ??:0 > Wed Dec 28 16:17:09.037 2016: [D] 14:7FF15F73FDC5 start_thread + > C5 at ??

Re: [gpfsug-discuss] Is anyone performing any kind of Charge back / Show back on Scale today and how do you collect the data

2016-11-18 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
I believe ARCAStream has a product that could facilitate this also. I also believe their engineers are on the list. From: Andrew Beattie Sent: 11/17/16, 3:56 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Is anyone performing any kind of Charge back / Show back on Scale today

Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter

2016-10-15 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Understood. Thank you for your help. By the way, I was able to figure out by poking mmpmon gfis that the job is performing 20k a second each of inode creations, updates and deletions across 64 nodes. There's my 60k iops on the backend. While I'm impressed and not surprised GPFS can keep up

Re: [gpfsug-discuss] GPFS Routers

2016-09-20 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
gpfsug-discuss-boun...@spectrumscale.org > > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being developed for x86_64 > ty

Re: [gpfsug-discuss] GPFS Routers

2016-09-20 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
> Sent by:gpfsug-discuss-boun...@spectrumscale.org > > > > > After some googling around, I wonder if perhaps what I'm thinking of was > an I/O forwarding layer that I understood was being develo

Re: [gpfsug-discuss] *New* IBM Spectrum Protect Whitepaper "Petascale Data Protection"

2016-08-30 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Just want to add on to one of the points Sven touched on regarding metadata HW. We have a modest SSD infrastructure for our metadata disks and we can scan 500M inodes in parallel in about 5 hours if my memory serves me right (and I believe we could go faster if we really wanted to). I think

[gpfsug-discuss] Monitor NSD server queue?

2016-08-16 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Everyone, We ran into a rather interesting situation over the past week. We had a job that was pounding the ever loving crap out of one of our filesystems (called dnb02) doing about 15GB/s of reads. We had other jobs experience a slowdown on a different filesystem (called dnb41) that uses

[gpfsug-discuss] GPFS API O_NOFOLLOW support

2016-07-21 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Hi Everyone, I've noticed that many GPFS commands (mm*acl,mm*attr) and API calls (in particular the putacl and getacl functions) have no support for not following symlinks. Is there some hidden support for gpfs_putacl that will cause it to not deteference symbolic links? Something like the