Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Dilger, Andreas
Thanks for the follow up. Often once the problem is found there is no 
update/conclusion to the email thread, then no way for others to try and solve 
the problem in a similar way. 

Cheers, Andreas

> On May 1, 2017, at 02:46, Strikwerda, Ger  wrote:
> 
> Hi all,
> 
> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing 
> subnet manager issue. We did no see an issue runnning 'sminfo' but on the IB 
> switch we could see that the subnetmanager was unstable. This caused mayhem 
> on the IB/Lustre setup.
> 
> Thanks everybody for their help/advice/hints. Good to see how this active 
> community works! 
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs_migrate

2017-05-01 Thread Dilger, Andreas
If your filesystem was created with Lustre 2.1 or later then you can use:

   FID=$(lfs path2fid "/path/to/file")
   lfs fid2path "$FID"

to find all the pathnames that are hard links to that file. There is a patch to 
add a "lfs path2links" option that does this in a single step, but it is not in 
any release yet.

The number of pathnames should match the hard link count returned by "stat 
-c%h" if the files don't have too many hard links (i.e. below 140 or so) and 
then you can manually migrate the file and re-link the other pathnames to the 
new file with "ln -f".

That is something that has been on the todo list for lfs_migrate for a while, 
so if you wanted to implement that in the script and submit a patch to Gerrit 
it would be appreciated.

Cheers, Andreas

On May 1, 2017, at 06:59, E.S. Rosenberg 
> wrote:

Now that we have reached close to the end of the migration process we have a 
lot of files that are being skipped due to "multiple hard links", I am not sure 
what my strategy should be concerning such files.
Is there any migration automation possible on these? Or is my only route 
contacting the owners (who may just not have known how to use 'ln -s')?

Any advice would be very welcome.
Thanks,
Eliyahu - אליהו

On Wed, Apr 12, 2017 at 6:55 PM, Todd, Allen 
> wrote:
Thanks Andreas -- good to know there is yet another reason to upgrade.  We are 
on 2.7.0.  I was trying to hold out for progressive file layout to land.

Allen

-Original Message-
From: lustre-discuss 
[mailto:lustre-discuss-boun...@lists.lustre.org]
 On Behalf Of Dilger, Andreas
Sent: Wednesday, April 12, 2017 8:19 AM
To: Todd, Allen 
>
Cc: E.S. Rosenberg >; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lfs_migrate

On Apr 10, 2017, at 14:53, Todd, Allen 
> wrote:
>
> While everyone is talking about lfs migrate, I would like to point out that 
> it appears to be missing an option to preserve file modification and access 
> times, which makes it less useful for behind the scenes data management tasks.

This should actually be the default, though there was a bug in older versions 
of Lustre that didn't preserve the timestamps.  That was fixed in Lustre 2.8.

Cheers, Andreas

> Allen
>
> -Original Message-
> From: lustre-discuss 
> [mailto:lustre-discuss-boun...@lists.lustre.org]
> On Behalf Of Henri Doreau
> Sent: Tuesday, April 04, 2017 3:18 AM
> To: E.S. Rosenberg >
> Cc: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] lfs_migrate
>
> Hello,
>
> the manpage for lfs(1) lists the available options in 2.8:
> """
> lfs migrate -m  directory
> lfs migrate [-c | --stripe-count ]
>   [-i | --stripe-index ]
>   [-S | --stripe-size ]
>   [-p | --pool ]
>   [-o | --ost-list ]
>   [-b | --block]
>   [-n | --non-block] file|directory """
>
> Although this is certainly terse, I guess that most parameters are intuitive.
>
> The command will open the file to restripe (blocking concurrent accesses or 
> not, depending on -b/-n), create a special "volatile" (=unlinked) one with 
> the requested striping parameters and copy the source into the destination.
>
> If the copy succeeds, the two files are atomically swapped and concurrent 
> access protection is released.
>
> In non-blocking mode, the process will detect if the source file was already 
> opened or if there's an open during the copy process and abort safely. It is 
> then up to the admin to reschedule the migration later, maybe with -b.
>
> HTH
>
> Henri
>
> On 02/avril - 14:43 E.S. Rosenberg wrote:
>> Thanks for all the great replies!
>>
>> I may be wrong on this but 'lfs migrate' does not seem to be
>> documented in the manpage (my local one is 2.8 so I expect that but
>> even manpages that I find online).
>>
>> Any pointers would be very welcome.
>>
>> On Thu, Mar 23, 2017 at 12:31 PM, Henri Doreau 
>> > wrote:
>>
>>> On 20/mars - 22:50 E.S. Rosenberg wrote:
 On Mon, Mar 20, 2017 at 10:19 PM, Dilger, Andreas <
>>> andreas.dil...@intel.com>
 wrote:

> The underlying "lfs migrate" command (not the "lfs_migrate"
> script) in newer Lustre versions (2.9) is capable of migrating
> files that are in
>>> use
> by using the "--block" option, which prevents other processes from
> accessing or modifying the file during migration.
>
> Unfortunately, "lfs_migrate" doesn't pass that 

Re: [lustre-discuss] Clients looses IB connection to OSS.

2017-05-01 Thread Thomas Stibor
Hi,

see JIRA: https://jira.hpdd.intel.com/browse/LU-5718

What seems to work as a quick fix (for older versions) is to set the
value of parameter max_pages_per_rpc=64

As written in https://jira.hpdd.intel.com/browse/LU-5718
the issue is resolved, however for upcoming version 2.10.0

Cheers
 Thomas

On Mon, May 01, 2017 at 04:47:32PM +0200, Hans Henrik Happe wrote:
> Hi,
> 
> We have experienced problems with loosing connection to OSS. It starts with:
> 
> May  1 03:35:46 node872 kernel: LNetError:
> 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
> fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst
> idx/frags: 128/236
> May  1 03:35:46 node872 kernel: LNetError:
> 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
> 10.21.10.116@o2ib: -90
> 
> The rest of the log is attached.
> 
> After this Lustre access is very slow. I.e. a 'df' can take minutes.
> Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
> makes ping work again until file I/O starts. Then I/O errors again.
> 
> We use both IB and TCP on servers, so no routers.
> 
> In the attached log astro-OST0001 has been moved to the other server in
> the HA pair. This is because 'lctl dl -t' showed strange output when on
> the right server:
> 
> # lctl dl -t
>   0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
>   1 UP lov astro-clilov-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
>   2 UP lmv astro-clilmv-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
>   3 UP mdc astro-MDT-mdc-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib
>   4 UP osc astro-OST0002-osc-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib
>   5 UP osc astro-OST0001-osc-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1
>   6 UP osc astro-OST0003-osc-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib
>   7 UP osc astro-OST-osc-88107412e800
> 53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib
> 
> So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even
> though it uses 10.21.10.115@o2ib (verified by performance test and
> disabling tcp1 on IB nodes).
> 
> Please ask for more details if needed.
> 
> Cheers,
> Hans Henrik
> 

> May  1 03:35:46 node872 kernel: LNetError: 
> 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for 
> peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
> May  1 03:35:46 node872 kernel: LNetError: 
> 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 
> 10.21.10.116@o2ib: -90
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: Lustre: 
> 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has 
> failed due to network error: [sent 1493602541/real 1493602541]  
> req@880e99cea080 x1565604440535580/t0(0) 
> o4->astro-OST0002-osc-881070c95c00@10.21.10.116@o2ib:6/4 lens 608/448 e 0 
> to 1 dl 1493602585 ref 2 fl Rpc:X/0/ rc 0/-1
> May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-881070c95c00: 
> Connection to astro-OST0002 (at 10.21.10.116@o2ib) was lost; in progress 
> operations using this service will wait for recovery to complete
> May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-881070c95c00: 
> Connection restored to 10.21.10.116@o2ib (at 10.21.10.116@o2ib)
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:46 node872 kernel: LustreError: 
> 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 
> 88103dd63000
> May  1 03:35:52 node872 kernel: Lustre: 
> 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1493602546/real 1493602546]  req@88103e0f10c0 
> x1565604440535684/t0(0) 
> o8->astro-OST0002-osc-881070c95c00@10.21.10.116@o2ib:28/4 lens 520/544 e 
> 0 to 1 dl 1493602552 ref 1 fl Rpc:XN/0/ rc 0/-1
> May  1 03:35:52 node872 kernel: Lustre: 
> 

Re: [lustre-discuss] Clients looses IB connection to OSS.

2017-05-01 Thread Oucharek, Doug S
For the “RDMA has too many fragments” issue, you need newly landed patch: 
http://review.whamcloud.com/12451.  For the slow access, not sure if that is 
related to the too many fragments error.  Once you get the too many fragments 
error, that node usually needs to unload/reload the LNet module to recover.

Doug

On May 1, 2017, at 7:47 AM, Hans Henrik Happe 
> wrote:

Hi,

We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
10.21.10.116@o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
 0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
 1 UP lov astro-clilov-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 2 UP lmv astro-clilmv-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 3 UP mdc astro-MDT-mdc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib
 4 UP osc astro-OST0002-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib
 5 UP osc astro-OST0001-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1
 6 UP osc astro-OST0003-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib
 7 UP osc astro-OST-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib

So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even
though it uses 10.21.10.115@o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Cheers,
Hans Henrik

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Clients looses IB connection to OSS.

2017-05-01 Thread Hans Henrik Happe
Hi,

We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
10.21.10.116@o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
  0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
  1 UP lov astro-clilov-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
  2 UP lmv astro-clilmv-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
  3 UP mdc astro-MDT-mdc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib
  4 UP osc astro-OST0002-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib
  5 UP osc astro-OST0001-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1
  6 UP osc astro-OST0003-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib
  7 UP osc astro-OST-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib

So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even
though it uses 10.21.10.115@o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Cheers,
Hans Henrik

May  1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError: 5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from 10.21.10.116@o2ib: -90
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: Lustre: 5606:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1493602541/real 1493602541]  req@880e99cea080 x1565604440535580/t0(0) o4->astro-OST0002-osc-881070c95c00@10.21.10.116@o2ib:6/4 lens 608/448 e 0 to 1 dl 1493602585 ref 2 fl Rpc:X/0/ rc 0/-1
May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-881070c95c00: Connection to astro-OST0002 (at 10.21.10.116@o2ib) was lost; in progress operations using this service will wait for recovery to complete
May  1 03:35:46 node872 kernel: Lustre: astro-OST0002-osc-881070c95c00: Connection restored to 10.21.10.116@o2ib (at 10.21.10.116@o2ib)
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:46 node872 kernel: LustreError: 5545:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc 88103dd63000
May  1 03:35:52 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602546/real 1493602546]  req@88103e0f10c0 x1565604440535684/t0(0) o8->astro-OST0002-osc-881070c95c00@10.21.10.116@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602552 ref 1 fl Rpc:XN/0/ rc 0/-1
May  1 03:35:52 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
May  1 03:36:17 node872 kernel: Lustre: 5579:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493602571/real 1493602571]  req@881056dd39c0 x1565604440535728/t0(0) o8->astro-OST0002-osc-881070c95c00@10.21.10.115@o2ib:28/4 lens 520/544 e 0 to 1 dl 1493602577 ref 1 fl Rpc:XN/0/ rc 0/-1
May  1 03:36:18 node872 kernel: Lustre: astro-OST0001-osc-881070c95c00: Connection to astro-OST0001 (at 10.21.10.116@o2ib) was lost; in progress operations using 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
The second option. We did not trust 'sminfo' so why not double check on the
IB switch or at least look at the logs of the IB switch to see what happens
over there.



On Mon, May 1, 2017 at 3:15 PM, E.S. Rosenberg 
wrote:

>
>
> On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger 
> wrote:
>
>> Hi Eli,
>>
>> We have a 180+ compute-cluster IB/10Gb connected with Lustre storage
>> IB/10 Gb connected. We have multiple IB switches with the master/core/big
>> switch manageable via webmanagement. This is switch is a Mellanox SX6036
>> FDR switch. 1 subnet manager is supposed to be running at this switch. And
>> using 'sminfo' on the clients we got info about the subnet manager being
>> alive. But when we looked via the webmanagement the subnet-manager was
>> unstable. The reason why is unknown. Could be faulty firmware. During the
>> weekend the system was running fine.
>>
> Did anything specific make you look in the switch, or just after all other
> things were checked you checked there?
>
>>
>>
>>
>>
>>
>>
>> On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>>
 Hi all,

 Our clients-failed-to-mount/lctl ping horror, turned out to be a
 failing subnet manager issue. We did no see an issue runnning 'sminfo' but
 on the IB switch we could see that the subnetmanager was unstable. This
 caused mayhem on the IB/Lustre setup.

>>> Can you describe a bit more of how you found this?
>>> You are running an SM on the switches?
>>> Like this if someone else runs into this they will be able to check this
>>> too
>>>

 Thanks everybody for their help/advice/hints. Good to see how this
 active community works!

>>> Indeed.
>>> Eli
>>>




 On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
 esr+lus...@mail.hebrew.edu> wrote:

>
>
> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
> doug.s.oucha...@intel.com> wrote:
>
>> That specific message happens when the “magic” u32 field at the start
>> of a message does not match what we are expecting.  We do check if the
>> message was transmitted as a different endian from us so when you see 
>> this
>> error, we assume that message has been corrupted or the sender is using 
>> an
>> invalid magic value.  I don’t believe this value has changed in the 
>> history
>> of the LND so this is more likely corruption of some sort.
>>
>
> OT: this information should probably be added to LU-2977 which
> specifically includes the question: What does "consumer defined fatal
> error" mean and why is this connection rejected?
>
>
>
>> Doug
>>
>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
>> andreas.dil...@intel.com> wrote:
>> >
>> > I'm not an LNet expert, but I think the critical issue to focus on
>> is:
>> >
>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>> >
>> > This means that the LND didn't connect at startup time, but I don't
>> know what the cause is.
>> > The error that generates this message is
>> IB_CM_REJ_CONSUMER_DEFINED, but I don't know enough about IB to tell you
>> what that means.  Some of the later code is checking for mismatched 
>> Lustre
>> versions, but it doesn't even get that far.
>> >
>> > Cheers, Andreas
>> >
>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>> >>
>> >> Hi Raj,
>> >>
>> >> [root@pg-gpu01 ~]# lustre_rmmod
>> >>
>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/net/lustre/lnet.ko networks=o2ib(ib0)
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/fid.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/osc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el
>> 6.x86_64/weak-updates/kernel/fs/lustre/lov.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread E.S. Rosenberg
On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger 
wrote:

> Hi Eli,
>
> We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10
> Gb connected. We have multiple IB switches with the master/core/big switch
> manageable via webmanagement. This is switch is a Mellanox SX6036 FDR
> switch. 1 subnet manager is supposed to be running at this switch. And
> using 'sminfo' on the clients we got info about the subnet manager being
> alive. But when we looked via the webmanagement the subnet-manager was
> unstable. The reason why is unknown. Could be faulty firmware. During the
> weekend the system was running fine.
>
Did anything specific make you look in the switch, or just after all other
things were checked you checked there?

>
>
>
>
>
>
> On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg  > wrote:
>
>>
>>
>> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger > > wrote:
>>
>>> Hi all,
>>>
>>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
>>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
>>> IB switch we could see that the subnetmanager was unstable. This caused
>>> mayhem on the IB/Lustre setup.
>>>
>> Can you describe a bit more of how you found this?
>> You are running an SM on the switches?
>> Like this if someone else runs into this they will be able to check this
>> too
>>
>>>
>>> Thanks everybody for their help/advice/hints. Good to see how this
>>> active community works!
>>>
>> Indeed.
>> Eli
>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
>>> esr+lus...@mail.hebrew.edu> wrote:
>>>


 On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
 doug.s.oucha...@intel.com> wrote:

> That specific message happens when the “magic” u32 field at the start
> of a message does not match what we are expecting.  We do check if the
> message was transmitted as a different endian from us so when you see this
> error, we assume that message has been corrupted or the sender is using an
> invalid magic value.  I don’t believe this value has changed in the 
> history
> of the LND so this is more likely corruption of some sort.
>

 OT: this information should probably be added to LU-2977 which
 specifically includes the question: What does "consumer defined fatal
 error" mean and why is this connection rejected?



> Doug
>
> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
> andreas.dil...@intel.com> wrote:
> >
> > I'm not an LNet expert, but I think the critical issue to focus on
> is:
> >
> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
> .el6.x86_64
> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211@o2ib rejected: consumer defined fatal error
> >
> > This means that the LND didn't connect at startup time, but I don't
> know what the cause is.
> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
> but I don't know enough about IB to tell you what that means.  Some of the
> later code is checking for mismatched Lustre versions, but it doesn't even
> get that far.
> >
> > Cheers, Andreas
> >
> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
> wrote:
> >>
> >> Hi Raj,
> >>
> >> [root@pg-gpu01 ~]# lustre_rmmod
> >>
> >> [root@pg-gpu01 ~]# modprobe -v lustre
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
> et/lustre/libcfs.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lvfs.ko
> >> insmod 
> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
> networks=o2ib(ib0)
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/obdclass.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/ptlrpc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/fid.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/mdc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/osc.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lov.ko
> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
> s/lustre/lustre.ko
> >>
> >> dmesg:
> >>
> >> LNet: HW CPU cores: 24, npartitions: 4
> >> alg: No test for crc32 (crc32-table)
> >> alg: No test for adler32 (adler32-zlib)
> >> alg: No test for crc32 (crc32-pclmul)
> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
> .el6.x86_64
> >> LNet: Added LNI 

Re: [lustre-discuss] lfs_migrate

2017-05-01 Thread E.S. Rosenberg
Now that we have reached close to the end of the migration process we have
a lot of files that are being skipped due to "multiple hard links", I am
not sure what my strategy should be concerning such files.
Is there any migration automation possible on these? Or is my only route
contacting the owners (who may just not have known how to use 'ln -s')?

Any advice would be very welcome.
Thanks,
Eliyahu - אליהו

On Wed, Apr 12, 2017 at 6:55 PM, Todd, Allen  wrote:

> Thanks Andreas -- good to know there is yet another reason to upgrade.  We
> are on 2.7.0.  I was trying to hold out for progressive file layout to land.
>
> Allen
>
> -Original Message-
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On
> Behalf Of Dilger, Andreas
> Sent: Wednesday, April 12, 2017 8:19 AM
> To: Todd, Allen 
> Cc: E.S. Rosenberg ; lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] lfs_migrate
>
> On Apr 10, 2017, at 14:53, Todd, Allen  wrote:
> >
> > While everyone is talking about lfs migrate, I would like to point out
> that it appears to be missing an option to preserve file modification and
> access times, which makes it less useful for behind the scenes data
> management tasks.
>
> This should actually be the default, though there was a bug in older
> versions of Lustre that didn't preserve the timestamps.  That was fixed in
> Lustre 2.8.
>
> Cheers, Andreas
>
> > Allen
> >
> > -Original Message-
> > From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org]
> > On Behalf Of Henri Doreau
> > Sent: Tuesday, April 04, 2017 3:18 AM
> > To: E.S. Rosenberg 
> > Cc: lustre-discuss@lists.lustre.org
> > Subject: Re: [lustre-discuss] lfs_migrate
> >
> > Hello,
> >
> > the manpage for lfs(1) lists the available options in 2.8:
> > """
> > lfs migrate -m  directory
> > lfs migrate [-c | --stripe-count ]
> >   [-i | --stripe-index ]
> >   [-S | --stripe-size ]
> >   [-p | --pool ]
> >   [-o | --ost-list ]
> >   [-b | --block]
> >   [-n | --non-block] file|directory """
> >
> > Although this is certainly terse, I guess that most parameters are
> intuitive.
> >
> > The command will open the file to restripe (blocking concurrent accesses
> or not, depending on -b/-n), create a special "volatile" (=unlinked) one
> with the requested striping parameters and copy the source into the
> destination.
> >
> > If the copy succeeds, the two files are atomically swapped and
> concurrent access protection is released.
> >
> > In non-blocking mode, the process will detect if the source file was
> already opened or if there's an open during the copy process and abort
> safely. It is then up to the admin to reschedule the migration later, maybe
> with -b.
> >
> > HTH
> >
> > Henri
> >
> > On 02/avril - 14:43 E.S. Rosenberg wrote:
> >> Thanks for all the great replies!
> >>
> >> I may be wrong on this but 'lfs migrate' does not seem to be
> >> documented in the manpage (my local one is 2.8 so I expect that but
> >> even manpages that I find online).
> >>
> >> Any pointers would be very welcome.
> >>
> >> On Thu, Mar 23, 2017 at 12:31 PM, Henri Doreau 
> wrote:
> >>
> >>> On 20/mars - 22:50 E.S. Rosenberg wrote:
>  On Mon, Mar 20, 2017 at 10:19 PM, Dilger, Andreas <
> >>> andreas.dil...@intel.com>
>  wrote:
> 
> > The underlying "lfs migrate" command (not the "lfs_migrate"
> > script) in newer Lustre versions (2.9) is capable of migrating
> > files that are in
> >>> use
> > by using the "--block" option, which prevents other processes from
> > accessing or modifying the file during migration.
> >
> > Unfortunately, "lfs_migrate" doesn't pass that argument on, though
> > it wouldn't be hard to change the script. Ideally, the "lfs_migrate"
> >>> script
> > would pass all unknown options to "lfs migrate".
> >
> >
> > The other item of note is that setting the OST inactive on the MDS
> > will prevent the MDS from deleting objects on the OST (see
> > https://jira.hpdd.intel.com/browse/LU-4825 for details).  In
> > Lustre
> >>> 2.9
> > and later it is possible to set on the MDS:
> >
> >   mds# lctl set_param osp..create_count=0
> >
> > to stop MDS allocation of new objects on that OST. On older
> > versions
> >>> it is
> > possible to set on the OSS:
> >
> >  oss# lctl set_param obdfilter..degraded=1
> >
> > so that it tells the MDS to avoid it if possible, but this isn't a
> > hard exclusion.
> >
> > It is also possible to use a testing hack to mark an OST as out of
> >>> inodes,
> > but that only works for one OST per OSS and it sounds like that
> > won't
> >>> be
> > useful in this case.
> >
> > Cheers, Andreas
> >
>  You're making me want 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
Hi Eli,

We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10
Gb connected. We have multiple IB switches with the master/core/big switch
manageable via webmanagement. This is switch is a Mellanox SX6036 FDR
switch. 1 subnet manager is supposed to be running at this switch. And
using 'sminfo' on the clients we got info about the subnet manager being
alive. But when we looked via the webmanagement the subnet-manager was
unstable. The reason why is unknown. Could be faulty firmware. During the
weekend the system was running fine.






On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg 
wrote:

>
>
> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger 
> wrote:
>
>> Hi all,
>>
>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
>> IB switch we could see that the subnetmanager was unstable. This caused
>> mayhem on the IB/Lustre setup.
>>
> Can you describe a bit more of how you found this?
> You are running an SM on the switches?
> Like this if someone else runs into this they will be able to check this
> too
>
>>
>> Thanks everybody for their help/advice/hints. Good to see how this active
>> community works!
>>
> Indeed.
> Eli
>
>>
>>
>>
>>
>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>>>
>>>
>>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
>>> doug.s.oucha...@intel.com> wrote:
>>>
 That specific message happens when the “magic” u32 field at the start
 of a message does not match what we are expecting.  We do check if the
 message was transmitted as a different endian from us so when you see this
 error, we assume that message has been corrupted or the sender is using an
 invalid magic value.  I don’t believe this value has changed in the history
 of the LND so this is more likely corruption of some sort.

>>>
>>> OT: this information should probably be added to LU-2977 which
>>> specifically includes the question: What does "consumer defined fatal
>>> error" mean and why is this connection rejected?
>>>
>>>
>>>
 Doug

 > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
 andreas.dil...@intel.com> wrote:
 >
 > I'm not an LNet expert, but I think the critical issue to focus on is:
 >
 >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
 .el6.x86_64
 >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
 >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
 172.23.55.211@o2ib rejected: consumer defined fatal error
 >
 > This means that the LND didn't connect at startup time, but I don't
 know what the cause is.
 > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
 but I don't know enough about IB to tell you what that means.  Some of the
 later code is checking for mismatched Lustre versions, but it doesn't even
 get that far.
 >
 > Cheers, Andreas
 >
 >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
 wrote:
 >>
 >> Hi Raj,
 >>
 >> [root@pg-gpu01 ~]# lustre_rmmod
 >>
 >> [root@pg-gpu01 ~]# modprobe -v lustre
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
 et/lustre/libcfs.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lvfs.ko
 >> insmod 
 >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
 networks=o2ib(ib0)
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/obdclass.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/ptlrpc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/fid.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/mdc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/osc.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lov.ko
 >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
 s/lustre/lustre.ko
 >>
 >> dmesg:
 >>
 >> LNet: HW CPU cores: 24, npartitions: 4
 >> alg: No test for crc32 (crc32-table)
 >> alg: No test for adler32 (adler32-zlib)
 >> alg: No test for crc32 (crc32-pclmul)
 >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
 .el6.x86_64
 >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
 >>
 >> But no luck,
 >>
 >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
 >> failed to ping 172.23.55.211@o2ib: Input/output error
 >>
 >> [root@pg-gpu01 ~]# mount /home
 >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
 at /home failed: Input/output error
 >> Is the 

Re: [lustre-discuss] client fails to mount

2017-05-01 Thread E.S. Rosenberg
On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger 
wrote:

> Hi all,
>
> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
> IB switch we could see that the subnetmanager was unstable. This caused
> mayhem on the IB/Lustre setup.
>
Can you describe a bit more of how you found this?
You are running an SM on the switches?
Like this if someone else runs into this they will be able to check this
too

>
> Thanks everybody for their help/advice/hints. Good to see how this active
> community works!
>
Indeed.
Eli

>
>
>
>
> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
> esr+lus...@mail.hebrew.edu> wrote:
>
>>
>>
>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
>> doug.s.oucha...@intel.com> wrote:
>>
>>> That specific message happens when the “magic” u32 field at the start of
>>> a message does not match what we are expecting.  We do check if the message
>>> was transmitted as a different endian from us so when you see this error,
>>> we assume that message has been corrupted or the sender is using an invalid
>>> magic value.  I don’t believe this value has changed in the history of the
>>> LND so this is more likely corruption of some sort.
>>>
>>
>> OT: this information should probably be added to LU-2977 which
>> specifically includes the question: What does "consumer defined fatal
>> error" mean and why is this connection rejected?
>>
>>
>>
>>> Doug
>>>
>>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas 
>>> wrote:
>>> >
>>> > I'm not an LNet expert, but I think the critical issue to focus on is:
>>> >
>>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>> .el6.x86_64
>>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>>> >
>>> > This means that the LND didn't connect at startup time, but I don't
>>> know what the cause is.
>>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
>>> but I don't know enough about IB to tell you what that means.  Some of the
>>> later code is checking for mismatched Lustre versions, but it doesn't even
>>> get that far.
>>> >
>>> > Cheers, Andreas
>>> >
>>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
>>> wrote:
>>> >>
>>> >> Hi Raj,
>>> >>
>>> >> [root@pg-gpu01 ~]# lustre_rmmod
>>> >>
>>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
>>> et/lustre/libcfs.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lvfs.ko
>>> >> insmod 
>>> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>>> networks=o2ib(ib0)
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/obdclass.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/ptlrpc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/fid.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/mdc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/osc.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lov.ko
>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>> s/lustre/lustre.ko
>>> >>
>>> >> dmesg:
>>> >>
>>> >> LNet: HW CPU cores: 24, npartitions: 4
>>> >> alg: No test for crc32 (crc32-table)
>>> >> alg: No test for adler32 (adler32-zlib)
>>> >> alg: No test for crc32 (crc32-pclmul)
>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>> .el6.x86_64
>>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>>> >>
>>> >> But no luck,
>>> >>
>>> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>>> >> failed to ping 172.23.55.211@o2ib: Input/output error
>>> >>
>>> >> [root@pg-gpu01 ~]# mount /home
>>> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
>>> at /home failed: Input/output error
>>> >> Is the MGS running?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>>> >> Yes, this is strange. Normally, I have seen that credits mismatch
>>> results this scenario but it doesn't look like this is the case.
>>> >>
>>> >> You wouldn't want to put mgs into capture debug messages as there
>>> will be a lot of data.
>>> >>
>>> >> I guess you already tried removing the lustre drivers and adding it
>>> again ?
>>> >> lustre_rmmod
>>> >> modprobe -v lustre
>>> >>
>>> >> And check dmesg for any errors...
>>> >>
>>> >>
>>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
>>> g.j.c.strikwe...@rug.nl> wrote:
>>> >> Hi Raj,
>>> >>
>>> >> When i do a lctl ping on a MGS server i do not see any logs at all.
>>> Also 

Re: [lustre-discuss] kerberised lustre performance

2017-05-01 Thread E.S. Rosenberg
On Fri, Apr 28, 2017 at 11:19 AM, Sebastien Buisson 
wrote:

> Hi,
>
> I have not specifically measured the performance impact of getting the
> Kerberos ticket before any Lustre request can be sent by the client. It
> happens at the first connection (so when mounting) and then when the ticket
> expires. Otherwise the ticket is cached.
> So unless the ticket has a very short lifetime of a few seconds,
> contacting the Kerberos server to renew the ticket should have very little
> impact on a standard production workflow.
>
Ah, so it is only to authenticate/authorize the client system and not the
user that is trying to access files?
Thanks,
Eli

>
> Cheers,
> Sebastien.
>
> > Le 27 avr. 2017 à 13:10, E.S. Rosenberg  a
> écrit :
> >
> > Hi everyone,
> > I just saw Sebatians' talk at LUG 2016 (Yes I know I'm a bit behind
> times) and I was wondering if and how much a performance impact there is
> from the need to get kerberos tickets before file actions (or is it only
> mounting?)...
> >
> > https://www.youtube.com/watch?v=zo6b03zxrIs
> >
> > Thanks,
> > Eli
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] client fails to mount

2017-05-01 Thread Strikwerda, Ger
Hi all,

Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
subnet manager issue. We did no see an issue runnning 'sminfo' but on the
IB switch we could see that the subnetmanager was unstable. This caused
mayhem on the IB/Lustre setup.

Thanks everybody for their help/advice/hints. Good to see how this active
community works!




On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg 
wrote:

>
>
> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
> doug.s.oucha...@intel.com> wrote:
>
>> That specific message happens when the “magic” u32 field at the start of
>> a message does not match what we are expecting.  We do check if the message
>> was transmitted as a different endian from us so when you see this error,
>> we assume that message has been corrupted or the sender is using an invalid
>> magic value.  I don’t believe this value has changed in the history of the
>> LND so this is more likely corruption of some sort.
>>
>
> OT: this information should probably be added to LU-2977 which
> specifically includes the question: What does "consumer defined fatal
> error" mean and why is this connection rejected?
>
>
>
>> Doug
>>
>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas 
>> wrote:
>> >
>> > I'm not an LNet expert, but I think the critical issue to focus on is:
>> >
>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211@o2ib rejected: consumer defined fatal error
>> >
>> > This means that the LND didn't connect at startup time, but I don't
>> know what the cause is.
>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
>> but I don't know enough about IB to tell you what that means.  Some of the
>> later code is checking for mismatched Lustre versions, but it doesn't even
>> get that far.
>> >
>> > Cheers, Andreas
>> >
>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger 
>> wrote:
>> >>
>> >> Hi Raj,
>> >>
>> >> [root@pg-gpu01 ~]# lustre_rmmod
>> >>
>> >> [root@pg-gpu01 ~]# modprobe -v lustre
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> net/lustre/libcfs.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lvfs.ko
>> >> insmod 
>> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>> networks=o2ib(ib0)
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/obdclass.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/ptlrpc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/fid.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/mdc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/osc.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lov.ko
>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/
>> fs/lustre/lustre.ko
>> >>
>> >> dmesg:
>> >>
>> >> LNet: HW CPU cores: 24, npartitions: 4
>> >> alg: No test for crc32 (crc32-table)
>> >> alg: No test for adler32 (adler32-zlib)
>> >> alg: No test for crc32 (crc32-pclmul)
>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>> .el6.x86_64
>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> >>
>> >> But no luck,
>> >>
>> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>> >> failed to ping 172.23.55.211@o2ib: Input/output error
>> >>
>> >> [root@pg-gpu01 ~]# mount /home
>> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01
>> at /home failed: Input/output error
>> >> Is the MGS running?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>> >> Yes, this is strange. Normally, I have seen that credits mismatch
>> results this scenario but it doesn't look like this is the case.
>> >>
>> >> You wouldn't want to put mgs into capture debug messages as there will
>> be a lot of data.
>> >>
>> >> I guess you already tried removing the lustre drivers and adding it
>> again ?
>> >> lustre_rmmod
>> >> modprobe -v lustre
>> >>
>> >> And check dmesg for any errors...
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
>> g.j.c.strikwe...@rug.nl> wrote:
>> >> Hi Raj,
>> >>
>> >> When i do a lctl ping on a MGS server i do not see any logs at all.
>> Also not when i do a sucessfull ping from a working node. Is there a way to
>> verbose the Lustre logging to see more detail on the LNET level?
>> >>
>> >> It is very strange that a rebooted node is able to lctl ping compute
>> nodes, but fails to lctl ping metadata and storage nodes.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>> >> Ger,
>> >> It looks like 

[lustre-discuss] DIsk Mount Permission to Multiple Users

2017-05-01 Thread yasir
Hi Lustre Admin

I have successfully mounted Lustre partition on all Client with /lustre
directory. Now I've to give permission of /lustre to multiple users or
group.

So please help how to allocated /lustre permission to different users.

 

 

 

Thanks & Regards

Yasir

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org