Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-19 Thread Jessica Otey

Hi Megan (et al.),

I don't understand the behavior, either... I've worked successfully with 
changelogs in the past, and indeed it is very lightweight. (Since 
robinhood has not been running anywhere, I'd already removed all the 
changelog readers from the various MDTs for the reasons you noted.)


Whatever my problem is does not manifest as a load issue, on either 
client or MDT side. It manifests rather as some sort of connection 
failure. Here's the most recent example, which maybe will generate more 
ideas as to cause.


On our third lustre fs (one we use for backups), I was able to complete 
a file system scan to populate the database, but then when I activated 
changelogs, the client almost immediately experienced the disconnections 
we've seen on the other two systems.


Here's the log from the MDT (heinlein, 10.7.17.126). The robinhood 
client is akebono (10.7.17.122):

May 16 16:05:51 heinlein kernel: Lustre: lard-MDD: changelog on
May 16 16:05:51 heinlein kernel: Lustre: Modifying parameter 
general.mdd.lard-MDT*.changelog_mask in log params
May 16 16:13:16 heinlein kernel: Lustre: lard-MDT: Client 
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122@o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: lard-MDT: Client 
2d1aedc0-1f5e-2741-689a-169922a2593b (at 10.7.17.122@o2ib) reconnecting
May 16 16:13:17 heinlein kernel: Lustre: Skipped 7458 previous similar 
messages


Here's what akebono (10.7.17.122) reported:

May 16 16:13:16 akebono kernel: LustreError: 11-0: 
lard-MDT-mdc-880fd68d7000: Communicating with 10.7.17.126@o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:16 akebono kernel: Lustre: 
lard-MDT-mdc-880fd68d7000: Connection to lard-MDT (at 
10.7.17.126@o2ib) was lost; in progress operations using this service 
will wait for recovery to complete
May 16 16:13:16 akebono kernel: Lustre: 
lard-MDT-mdc-880fd68d7000: Connection restored to lard-MDT 
(at 10.7.17.126@o2ib)
May 16 16:13:17 akebono kernel: LustreError: 11-0: 
lard-MDT-mdc-880fd68d7000: Communicating with 10.7.17.126@o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:17 akebono kernel: LustreError: Skipped 7458 previous 
similar messages
May 16 16:13:17 akebono kernel: Lustre: 
lard-MDT-mdc-880fd68d7000: Connection to lard-MDT (at 
10.7.17.126@o2ib) was lost; in progress operations using this service 
will wait for recovery to complete
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar 
messages
May 16 16:13:17 akebono kernel: Lustre: 
lard-MDT-mdc-880fd68d7000: Connection restored to lard-MDT 
(at 10.7.17.126@o2ib)
May 16 16:13:17 akebono kernel: Lustre: Skipped 7458 previous similar 
messages
May 16 16:13:18 akebono kernel: LustreError: 11-0: 
lard-MDT-mdc-880fd68d7000: Communicating with 10.7.17.126@o2ib, 
operation llog_origin_handle_destroy failed with -19.
May 16 16:13:18 akebono kernel: LustreError: Skipped 14924 previous 
similar messages


Jessica

On 5/19/17 8:58 AM, Ms. Megan Larko wrote:

Greetings Jessica,

I'm not sure I am correctly understanding the behavior "robinhood 
activity floods the MDT".   The robinhood program as you (and I) are 
using it is consuming the MDT CHANGELOG via a reader_id which was 
assigned when the CHANGELOG was enabled on the MDT.   You can check 
the MDS for these readers via "lctl get_param mdd.*.changelog_users".  
Each CHANGELOG reader must either be consumed by a process or 
destroyed otherwise the CHANGELOG will grow until it consumes 
sufficient space to stop the MDT from functioning correctly.  So 
robinhood should consume and then clear the CHANGELOG via this 
reader_id.  This implementation of robinhood is actually a rather 
light-weight process as far as the MDS is concerned.   The load issues 
I encountered were on the robinhood server itself which is a separate 
server from the Lustre MGS/MDS server.


Just curious, have you checked for multiple reader_id's on your MDS 
for this Lustre file system?


P.S. My robinhood configuration file is using nb_threads = 8, just for 
a data point.


Cheers,
megan





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-19 Thread Jessica Otey
I think that may be a red herring related to rsyslog?  When we most 
recently rebooted the MDT, this is the log (still on the box, not on the 
log server):


May  3 14:24:22 asimov kernel: LNet: HW CPU cores: 12, npartitions: 4
May  3 14:24:30 asimov kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]

And lctl list_nids gives it once:

[root@asimov ~]# lctl list_nids
10.7.17.8@o2ib

Jessica

On 5/19/17 10:13 AM, Jeff Johnson wrote:

Jessica,

You are getting a NID registering twice. Doug noticed and pointed it 
out. I'd look to see if that is one machine doing something twice or 
two machines with the same NID.


--Jeff

On Fri, May 19, 2017 at 05:58 Ms. Megan Larko > wrote:


Greetings Jessica,

I'm not sure I am correctly understanding the behavior "robinhood
activity floods the MDT".   The robinhood program as you (and I)
are using it is consuming the MDT CHANGELOG via a reader_id which
was assigned when the CHANGELOG was enabled on the MDT. You can
check the MDS for these readers via "lctl get_param
mdd.*.changelog_users".  Each CHANGELOG reader must either be
consumed by a process or destroyed otherwise the CHANGELOG will
grow until it consumes sufficient space to stop the MDT from
functioning correctly.  So robinhood should consume and then clear
the CHANGELOG via this reader_id.  This implementation of
robinhood is actually a rather light-weight process as far as the
MDS is concerned.   The load issues I encountered were on the
robinhood server itself which is a separate server from the Lustre
MGS/MDS server.

Just curious, have you checked for multiple reader_id's on your
MDS for this Lustre file system?

P.S. My robinhood configuration file is using nb_threads = 8, just
for a data point.

Cheers,
megan

On Thu, May 18, 2017 at 2:36 PM, Jessica Otey > wrote:

Hi Megan,

Thanks for your input. We use percona, a drop-in replacement
for mysql... The robinhood activity floods the MDT, but it
does not seem to produce any excessive load on the robinhood
box...

Anyway, FWIW...

~]# mysql --version
mysql  Ver 14.14 Distrib 5.5.54-38.6, for Linux (x86_64) using
readline 5.1

Product: robinhood
Version: 3.0-1
Build:   2017-03-13 10:29:26

Compilation switches:
Lustre filesystems
Lustre Version: 2.5
Address entries by FID
MDT Changelogs supported

Database binding: MySQL

RPM: robinhood-lustre-3.0-1.lustre2.5.el6.x86_64

Lustre rpms:

lustre-client-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
lustre-client-modules-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64


On 5/18/17 11:55 AM, Ms. Megan Larko wrote:

With regards to (WRT) Subject "Robinhood exhausting RPC
resources against 2.5.5  lustre file systems", what version
of robinhood and what version of MySQL database?   I mention
this because I have been working with robinhood-3.0-0.rc1 and
initially MySQL-5.5.32 and Lustre 2.5.42.1 on
kernel-2.6.32-573 and had issues in which the robinhood
server consumed more than the total amount of 32 CPU cores on
the robinhood server (with 128 G RAM) and would functionally
hang the robinhood server.   The issue was solved for me by
changing to MySQL-5.6.35.   It was the "sort" command in
robinhood that was not working well with the MySQL-5.5.32.

Cheers,
megan




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com 
www.aeoncomputing.com 
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-19 Thread Jeff Johnson
Jessica,

You are getting a NID registering twice. Doug noticed and pointed it out.
I'd look to see if that is one machine doing something twice or two
machines with the same NID.

--Jeff

On Fri, May 19, 2017 at 05:58 Ms. Megan Larko  wrote:

> Greetings Jessica,
>
> I'm not sure I am correctly understanding the behavior "robinhood activity
> floods the MDT".   The robinhood program as you (and I) are using it is
> consuming the MDT CHANGELOG via a reader_id which was assigned when the
> CHANGELOG was enabled on the MDT.   You can check the MDS for these readers
> via "lctl get_param mdd.*.changelog_users".  Each CHANGELOG reader must
> either be consumed by a process or destroyed otherwise the CHANGELOG will
> grow until it consumes sufficient space to stop the MDT from functioning
> correctly.  So robinhood should consume and then clear the CHANGELOG via
> this reader_id.  This implementation of robinhood is actually a rather
> light-weight process as far as the MDS is concerned.   The load issues I
> encountered were on the robinhood server itself which is a separate server
> from the Lustre MGS/MDS server.
>
> Just curious, have you checked for multiple reader_id's on your MDS for
> this Lustre file system?
>
> P.S. My robinhood configuration file is using nb_threads = 8, just for a
> data point.
>
> Cheers,
> megan
>
> On Thu, May 18, 2017 at 2:36 PM, Jessica Otey  wrote:
>
>> Hi Megan,
>>
>> Thanks for your input. We use percona, a drop-in replacement for mysql...
>> The robinhood activity floods the MDT, but it does not seem to produce any
>> excessive load on the robinhood box...
>>
>> Anyway, FWIW...
>>
>> ~]# mysql --version
>> mysql  Ver 14.14 Distrib 5.5.54-38.6, for Linux (x86_64) using readline
>> 5.1
>>
>> Product: robinhood
>> Version: 3.0-1
>> Build:   2017-03-13 10:29:26
>>
>> Compilation switches:
>> Lustre filesystems
>> Lustre Version: 2.5
>> Address entries by FID
>> MDT Changelogs supported
>>
>> Database binding: MySQL
>>
>> RPM: robinhood-lustre-3.0-1.lustre2.5.el6.x86_64
>> Lustre rpms:
>>
>> lustre-client-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
>> lustre-client-modules-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
>>
>> On 5/18/17 11:55 AM, Ms. Megan Larko wrote:
>>
>> With regards to (WRT) Subject "Robinhood exhausting RPC resources against
>> 2.5.5   lustre file systems", what version of robinhood and what version of
>> MySQL database?   I mention this because I have been working with
>> robinhood-3.0-0.rc1 and initially MySQL-5.5.32 and Lustre 2.5.42.1 on
>> kernel-2.6.32-573 and had issues in which the robinhood server consumed
>> more than the total amount of 32 CPU cores on the robinhood server (with
>> 128 G RAM) and would functionally hang the robinhood server.   The issue
>> was solved for me by changing to MySQL-5.6.35.   It was the "sort" command
>> in robinhood that was not working well with the MySQL-5.5.32.
>>
>> Cheers,
>> megan
>>
>>
>>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-19 Thread Ms. Megan Larko
Greetings Jessica,

I'm not sure I am correctly understanding the behavior "robinhood activity
floods the MDT".   The robinhood program as you (and I) are using it is
consuming the MDT CHANGELOG via a reader_id which was assigned when the
CHANGELOG was enabled on the MDT.   You can check the MDS for these readers
via "lctl get_param mdd.*.changelog_users".  Each CHANGELOG reader must
either be consumed by a process or destroyed otherwise the CHANGELOG will
grow until it consumes sufficient space to stop the MDT from functioning
correctly.  So robinhood should consume and then clear the CHANGELOG via
this reader_id.  This implementation of robinhood is actually a rather
light-weight process as far as the MDS is concerned.   The load issues I
encountered were on the robinhood server itself which is a separate server
from the Lustre MGS/MDS server.

Just curious, have you checked for multiple reader_id's on your MDS for
this Lustre file system?

P.S. My robinhood configuration file is using nb_threads = 8, just for a
data point.

Cheers,
megan

On Thu, May 18, 2017 at 2:36 PM, Jessica Otey  wrote:

> Hi Megan,
>
> Thanks for your input. We use percona, a drop-in replacement for mysql...
> The robinhood activity floods the MDT, but it does not seem to produce any
> excessive load on the robinhood box...
>
> Anyway, FWIW...
>
> ~]# mysql --version
> mysql  Ver 14.14 Distrib 5.5.54-38.6, for Linux (x86_64) using readline 5.1
>
> Product: robinhood
> Version: 3.0-1
> Build:   2017-03-13 10:29:26
>
> Compilation switches:
> Lustre filesystems
> Lustre Version: 2.5
> Address entries by FID
> MDT Changelogs supported
>
> Database binding: MySQL
>
> RPM: robinhood-lustre-3.0-1.lustre2.5.el6.x86_64
> Lustre rpms:
>
> lustre-client-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
> lustre-client-modules-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
>
> On 5/18/17 11:55 AM, Ms. Megan Larko wrote:
>
> With regards to (WRT) Subject "Robinhood exhausting RPC resources against
> 2.5.5   lustre file systems", what version of robinhood and what version of
> MySQL database?   I mention this because I have been working with
> robinhood-3.0-0.rc1 and initially MySQL-5.5.32 and Lustre 2.5.42.1 on
> kernel-2.6.32-573 and had issues in which the robinhood server consumed
> more than the total amount of 32 CPU cores on the robinhood server (with
> 128 G RAM) and would functionally hang the robinhood server.   The issue
> was solved for me by changing to MySQL-5.6.35.   It was the "sort" command
> in robinhood that was not working well with the MySQL-5.5.32.
>
> Cheers,
> megan
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-18 Thread Colin Faber
Hi Jessica,

What STAGE_GET_FID_threads_max and STAGE_GET_INFO_FS_threads_max values are
you using?

Also what nb_threads values are you using for your robinhood configuration?

Also, when you see the disconnects between the clients and the MDS, what
does the MDS load actually look like? Is the system overloaded?

Did you happen to ever benchmark your MDS to find out what kind of
performance you can expect out of it? (mdtest?)

-cf


On Thu, May 18, 2017 at 12:36 PM, Jessica Otey  wrote:

> Hi Megan,
>
> Thanks for your input. We use percona, a drop-in replacement for mysql...
> The robinhood activity floods the MDT, but it does not seem to produce any
> excessive load on the robinhood box...
>
> Anyway, FWIW...
>
> ~]# mysql --version
> mysql  Ver 14.14 Distrib 5.5.54-38.6, for Linux (x86_64) using readline 5.1
>
> Product: robinhood
> Version: 3.0-1
> Build:   2017-03-13 10:29:26
>
> Compilation switches:
> Lustre filesystems
> Lustre Version: 2.5
> Address entries by FID
> MDT Changelogs supported
>
> Database binding: MySQL
>
> RPM: robinhood-lustre-3.0-1.lustre2.5.el6.x86_64
> Lustre rpms:
>
> lustre-client-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
> lustre-client-modules-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
>
> On 5/18/17 11:55 AM, Ms. Megan Larko wrote:
>
> With regards to (WRT) Subject "Robinhood exhausting RPC resources against
> 2.5.5   lustre file systems", what version of robinhood and what version of
> MySQL database?   I mention this because I have been working with
> robinhood-3.0-0.rc1 and initially MySQL-5.5.32 and Lustre 2.5.42.1 on
> kernel-2.6.32-573 and had issues in which the robinhood server consumed
> more than the total amount of 32 CPU cores on the robinhood server (with
> 128 G RAM) and would functionally hang the robinhood server.   The issue
> was solved for me by changing to MySQL-5.6.35.   It was the "sort" command
> in robinhood that was not working well with the MySQL-5.5.32.
>
> Cheers,
> megan
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&
> d=DwICAg=IGDlg0lD0b-nebmJJ0Kp8A=x9pM59OqndbWw-
> lPPdr8w1Vud29EZigcxcNkz0uw5oQ=4AakV9Mc0JlUxzY-
> sNPH0a7okSbxM9bxL6_RSKwd4L4=k6F_tUj9P27MaMwmajor1q-
> VdHqBPo-xKG2f3quLLgc=
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-18 Thread Jessica Otey

Hi Megan,

Thanks for your input. We use percona, a drop-in replacement for 
mysql... The robinhood activity floods the MDT, but it does not seem to 
produce any excessive load on the robinhood box...


Anyway, FWIW...

~]# mysql --version
mysql  Ver 14.14 Distrib 5.5.54-38.6, for Linux (x86_64) using readline 5.1

Product: robinhood
Version: 3.0-1
Build:   2017-03-13 10:29:26

Compilation switches:
Lustre filesystems
Lustre Version: 2.5
Address entries by FID
MDT Changelogs supported

Database binding: MySQL

RPM: robinhood-lustre-3.0-1.lustre2.5.el6.x86_64

Lustre rpms:

lustre-client-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64
lustre-client-modules-2.5.5-2.6.32_642.15.1.el6.x86_64_g22a210f.x86_64


On 5/18/17 11:55 AM, Ms. Megan Larko wrote:
With regards to (WRT) Subject "Robinhood exhausting RPC resources 
against 2.5.5   lustre file systems", what version of robinhood and 
what version of MySQL database?   I mention this because I have been 
working with robinhood-3.0-0.rc1 and initially MySQL-5.5.32 and Lustre 
2.5.42.1 on kernel-2.6.32-573 and had issues in which the robinhood 
server consumed more than the total amount of 32 CPU cores on the 
robinhood server (with 128 G RAM) and would functionally hang the 
robinhood server.   The issue was solved for me by changing to 
MySQL-5.6.35.   It was the "sort" command in robinhood that was not 
working well with the MySQL-5.5.32.


Cheers,
megan



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Oucharek, Doug S
How is it you are getting the same NID registering twice in the log file:

Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]

Doug

On May 17, 2017, at 11:04 AM, Jessica Otey 
> wrote:


All,

We have observed an unfortunate interaction between Robinhood and two Lustre 
2.5.5 file systems (both of which originated as 1.8.9 file systems).

Robinhood was used successfully against these file systems when they were both 
1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 months).

We also have a third Lustre file system that originated as 2.4.3, and has since 
been upgraded to 2.5.5, against which Robinhood is currently operating as 
expected. This leads me to suppose that the issue may have to do the 
interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5 system. But I 
don't know.

The problem manifests itself as follows: Either a Robinhood file scan or the 
initiation of the consumption of changelogs results in the consumption all the 
available RPC resources on the MDT. This in turn leads to the MDT not being 
able to satisfy any other requests from clients, which in turn leads to client 
disconnections (the MDT thinks they are dead and evicts them). Meanwhile, 
Robinhood itself is unable to traverse the file system to gather the 
information it seeks, and so its scans either hang (due to the client 
disconnect) or run at a rate such that they would never complete (less than 1 
file per second).

If we don't run robinhood at all, the file system performs (after a remount of 
the MDT) as expected.

Initially, we thought that the difficulty might be that we neglected to 
activate the FID-in-direct feature when we upgraded to 2.4.3. We did so on one 
of these systems, and ran an lfsck oi_scrub, but that did not ameliorate the 
problem.

Any thoughts on this matter would be appreciated. (We miss using Robinhood!)

Thanks,

Jessica



More data for those who cannot help themselves:

April 2016 - Robinhood comes into production use against both our 1.8.9 file 
systems.

July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.4.3 client; changelog consumption now included.

Lustre "reconnects" (from /var/log/messages on one of the MDTs):

July 2016: 4

Aug 2016: 20

Sept 2016: 8

Oct 2016: 8

Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.5.3 client.

Lustre "reconnects":

Nov. 2016: 180

Dec. 2016: 62

Jan. 2017: 96

Feb 1-24, 2017: 2

Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file systems)

 NAASC-Lustre MDT coming back 

Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with 
ordered data mode. quota=off. Opts:
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with 
ordered data mode. quota=off. Opts:
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection restored 
to MGS (at 0@lo)
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection restored 
to MGS (at 0@lo)
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, loading
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, loading

The night after this upgrade, a regular rsync to the backup Lustre system 
provokes a failure/client disconnect. (Unfortunately, I don't have the logs to 
look at Robinhood activity from this time, but I believe I restarted the 
service after the system came back.)

Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0
 lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0
 lens 3584/0 e 0 

Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Colin Faber
Very likely then the test isn't stressing the system enough. What tunings
are you using on your robinhood installation? What tunings on the lustre
client itself?

-cf


On Wed, May 17, 2017 at 2:52 PM, Jessica Otey  wrote:

> Update #1. Robinhood change log consumption is also producing the same
> effect against a native 2.x file system instance. So the 'legacy' aspect of
> our two production instances does not seem to be a factor...
>
> Update #2. Currently running, per Colin Faber's suggestion: find
> /mnt/lustre -exec lfs path2fid {} \;
>
> This does not (so far) provoke a disconnection.
>
> Jessica
> On 5/17/17 2:04 PM, Jessica Otey wrote:
>
> We also have a third Lustre file system that originated as 2.4.3, and has
> since been upgraded to 2.5.5, against which Robinhood is currently
> operating as expected. This leads me to suppose that the issue may have to
> do the interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5
> system. But I don't know.
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Jessica Otey
Update #1. Robinhood change log consumption is also producing the same 
effect against a native 2.x file system instance. So the 'legacy' aspect 
of our two production instances does not seem to be a factor...


Update #2. Currently running, per Colin Faber's suggestion: find 
/mnt/lustre -exec lfs path2fid {} \;


This does not (so far) provoke a disconnection.

Jessica

On 5/17/17 2:04 PM, Jessica Otey wrote:


We also have a third Lustre file system that originated as 2.4.3, and 
has since been upgraded to 2.5.5, against which Robinhood is currently 
operating as expected. This leads me to suppose that the issue may 
have to do the interaction between Robinhood and a 
legacy-1.8.x-now-lustre-2.5.5 system. But I don't know.





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Colin Faber
Have you been able to replicate the issue with a simple test?


find /mnt/lustre -exec lfs path2fid {} \;

?

On May 17, 2017 12:04 PM, "Jessica Otey"  wrote:

> All,
>
> We have observed an unfortunate interaction between Robinhood and two
> Lustre 2.5.5 file systems (both of which originated as 1.8.9 file systems).
>
> Robinhood was used successfully against these file systems when they were
> both 1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 months).
>
> We also have a third Lustre file system that originated as 2.4.3, and has
> since been upgraded to 2.5.5, against which Robinhood is currently
> operating as expected. This leads me to suppose that the issue may have to
> do the interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5
> system. But I don't know.
>
> The problem manifests itself as follows: Either a Robinhood file scan or
> the initiation of the consumption of changelogs results in the consumption
> all the available RPC resources on the MDT. This in turn leads to the MDT
> not being able to satisfy any other requests from clients, which in turn
> leads to client disconnections (the MDT thinks they are dead and evicts
> them). Meanwhile, Robinhood itself is unable to traverse the file system to
> gather the information it seeks, and so its scans either hang (due to the
> client disconnect) or run at a rate such that they would never complete
> (less than 1 file per second).
>
> If we don't run robinhood at all, the file system performs (after a
> remount of the MDT) as expected.
>
> Initially, we thought that the difficulty might be that we neglected to
> activate the FID-in-direct feature when we upgraded to 2.4.3. We did so on
> one of these systems, and ran an lfsck oi_scrub, but that did not
> ameliorate the problem.
>
> Any thoughts on this matter would be appreciated. (We miss using
> Robinhood!)
>
> Thanks,
>
> Jessica
>
> 
>
> More data for those who cannot help themselves:
> April 2016 - Robinhood comes into production use against both our 1.8.9
> file systems.
>
> July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) --
> Robinhood rebuilt against 2.4.3 client; changelog consumption now included.
>
> Lustre "reconnects" (from /var/log/messages on one of the MDTs):
>
> July 2016: 4
>
> Aug 2016: 20
>
> Sept 2016: 8
>
> Oct 2016: 8
>
> Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file systems)
> -- Robinhood rebuilt against 2.5.3 client.
>
> Lustre "reconnects":
>
> Nov. 2016: 180
>
> Dec. 2016: 62
>
> Jan. 2017: 96
>
> Feb 1-24, 2017: 2
>
> Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file
> systems)
>
>  NAASC-Lustre MDT coming back 
> Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version:
> 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
> Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version:
> 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
> Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib
> [8/256/0/180]
> Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib
> [8/256/0/180]
> Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem
> with ordered data mode. quota=off. Opts:
> Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem
> with ordered data mode. quota=off. Opts:
> Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection
> restored to MGS (at 0@lo)
> Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection
> restored to MGS (at 0@lo)
> Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk,
> loading
> Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk,
> loading
>
> The night after this upgrade, a regular rsync to the backup Lustre system
> provokes a failure/client disconnect. (Unfortunately, I don't have the logs
> to look at Robinhood activity from this time, but I believe I restarted the
> service after the system came back.)
>
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) @@@ Dropping timed-out request from
> 12345-10.7.17.123@o2ib: deadline 6:11s ago
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) @@@ Dropping timed-out request from
> 12345-10.7.17.123@o2ib: deadline 6:11s ago
> Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050
> x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2@10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050
> x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2@10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 

[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Jessica Otey

All,

We have observed an unfortunate interaction between Robinhood and two 
Lustre 2.5.5 file systems (both of which originated as 1.8.9 file systems).


Robinhood was used successfully against these file systems when they 
were both 1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 
months).


We also have a third Lustre file system that originated as 2.4.3, and 
has since been upgraded to 2.5.5, against which Robinhood is currently 
operating as expected. This leads me to suppose that the issue may have 
to do the interaction between Robinhood and a 
legacy-1.8.x-now-lustre-2.5.5 system. But I don't know.


The problem manifests itself as follows: Either a Robinhood file scan or 
the initiation of the consumption of changelogs results in the 
consumption all the available RPC resources on the MDT. This in turn 
leads to the MDT not being able to satisfy any other requests from 
clients, which in turn leads to client disconnections (the MDT thinks 
they are dead and evicts them). Meanwhile, Robinhood itself is unable to 
traverse the file system to gather the information it seeks, and so its 
scans either hang (due to the client disconnect) or run at a rate such 
that they would never complete (less than 1 file per second).


If we don't run robinhood at all, the file system performs (after a 
remount of the MDT) as expected.


Initially, we thought that the difficulty might be that we neglected to 
activate the FID-in-direct feature when we upgraded to 2.4.3. We did so 
on one of these systems, and ran an lfsck oi_scrub, but that did not 
ameliorate the problem.


Any thoughts on this matter would be appreciated. (We miss using Robinhood!)

Thanks,

Jessica



More data for those who cannot help themselves:

April 2016 - Robinhood comes into production use against both our 1.8.9 
file systems.


July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.4.3 client; changelog consumption now included.


Lustre "reconnects" (from /var/log/messages on one of the MDTs):

July 2016: 4

Aug 2016: 20

Sept 2016: 8

Oct 2016: 8

Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file 
systems) -- Robinhood rebuilt against 2.5.3 client.


Lustre "reconnects":

Nov. 2016: 180

Dec. 2016: 62

Jan. 2017: 96

Feb 1-24, 2017: 2

Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file systems)

 NAASC-Lustre MDT coming back 

Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib 
[8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib 
[8/256/0/180]
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem 
with ordered data mode. quota=off. Opts:
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem 
with ordered data mode. quota=off. Opts:
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection 
restored to MGS (at 0@lo)
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection 
restored to MGS (at 0@lo)
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, 
loading
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, 
loading


The night after this upgrade, a regular rsync to the backup Lustre 
system provokes a failure/client disconnect. (Unfortunately, I don't 
have the logs to look at Robinhood activity from this time, but I 
believe I restarted the service after the system came back.)


Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 
x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 
x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 
25111:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took 
longer than estimated (6:11s); client may timeout. req@88045b3a2850 
x1560271381909940/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: