from:"肖正刚"

[lustre-discuss] Recover from broken lustre updates (Haoyang Liu)

2021-07-26 Thread 肖正刚 via lustre-discuss

Hi, Haoyang

Maybe you should rebuild the MOFED with new kernel first, then rebuild
lustre server package.
1) about restore
I think you can try switch to the old kernel first, but as you said, you
have rebuild the MOFED under the new kernel, so once you go back to the old
kernel you need to rebuild MOFED(make sure the versions are the same) .
 If this not worked, you can try reinstall the IO servers as what you have
done at the very beginning, I recommand you  use a new drive to install OS.

2) about data loss
No data loss, they are stored in mgt

Thanks
Regards,

 于2021年7月27日周二 上午4:28写道：

> Send lustre-discuss mailing list submissions to
> lustre-discuss@lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-requ...@lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-ow...@lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
>1. Recover from broken lustre updates (Haoyang Liu)
>
>
> --
>
> Message: 1
> Date: Mon, 26 Jul 2021 16:28:26 +0800 (GMT+08:00)
> From: "Haoyang Liu" 
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Recover from broken lustre updates
> Message-ID: <5e70f6a4.db93.17ae1edf43b.coremail.liuhaoy...@pku.edu.cn>
> Content-Type: text/plain; charset=UTF-8
>
> Hi all,
>
> I am using Lustre 2.7 along with mlnx infiniband. Recently I by mistake
> perform a system update and after the update the lustre modules won't load.
>
> System configuration before the update:
> centos-7.3, kernel version: 3.10.0-514.2.2.el7_lustre.gba8983e.x86_64
> lustre version:
> 2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64
> mlnx-ofed version:
> 4.2.1.2.0.1.gf8de107.kver.3.10.0_514.2.2.el7_lustre.gba8983e.x86_64.x86_64
>
> System configuration after the update:
> centos-7.3, kernel version: 3.10.0-514.2.2.el7_lustre.x86_64
> lustre version: 2.7.19.8-3.10.0_514.2.2.el7_lustre.x86_64.x86_64
> mlnx-ofed version:
> 4.2.1.2.0.1.gf8de107.kver.3.10.0_514.2.2.el7_lustre.gba8983e.x86_64.x86_64
>
> The update seems to just replace the linux kernel with a different patch
> version (w/o gba8983e),
> and rebuild the lustre modules (no upgrading for lustre). However, the
> lustre modules are built against the wrong version
> of mlnx-ofed. dmesg shows the following errors:
>
>
> [17509.744301] ko2iblnd: disagrees about version of symbol
> ib_fmr_pool_unmap
> [17509.744307] ko2iblnd: Unknown symbol ib_fmr_pool_unmap (err -22)
> [17509.744317] ko2iblnd: disagrees about version of symbol ib_create_cq
> [17509.744319] ko2iblnd: Unknown symbol ib_create_cq (err -22)
> [17509.744332] ko2iblnd: disagrees about version of symbol
> rdma_resolve_addr
> [17509.744334] ko2iblnd: Unknown symbol rdma_resolve_addr (err -22)
> [17509.744345] ko2iblnd: disagrees about version of symbol
> ib_create_fmr_pool
> ...
>
> I've tried to build mlnx-ofed under the updated kernel, but the problem
> still exists.
>
> My questions:
> 1) how to restore the lustre system before the updates? The following RPMs
> are already present on my server:
> 
> kernel-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kernel-devel-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kernel-headers-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kernel-tools-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kernel-tools-libs-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kernel-tools-libs-devel-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64.rpm
> kmod-spl-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64-0.6.5.7-1.el7.x86_64.rpm
> kmod-spl-devel-0.6.5.7-1.el7.x86_64.rpm
>
> kmod-spl-devel-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64-0.6.5.7-1.el7.x86_64.rpm
> kmod-zfs-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64-0.6.5.7-1.el7.x86_64.rpm
> kmod-zfs-devel-0.6.5.7-1.el7.x86_64.rpm
>
> kmod-zfs-devel-3.10.0-514.2.2.el7_lustre.gba8983e.x86_64-0.6.5.7-1.el7.x86_64.rpm
> libnvpair1-0.6.5.7-1.el7.x86_64.rpm
> libuutil1-0.6.5.7-1.el7.x86_64.rpm
> libzfs2-0.6.5.7-1.el7.x86_64.rpm
> libzfs2-devel-0.6.5.7-1.el7.x86_64.rpm
> libzpool2-0.6.5.7-1.el7.x86_64.rpm
>
> lustre-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
> lustre-dkms-2.7.19.8-1.el7.noarch.rpm
>
> lustre-iokit-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
>
> lustre-modules-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
>
> lustre-osd-ldiskfs-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
>
> lustre-osd-ldiskfs-mount-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
>
> lustre-osd-zfs-2.7.19.8-3.10.0_514.2.2.el7_lustre.gba8983e.x86_64_gba8983e.x86_64.rpm
>
>

Re: [lustre-discuss] Quota related (Anilkumar Naik)

2020-11-30 Thread 肖正刚

Sorry, I typed the wrong word.
You should replace qouta by quota.

Anilkumar Naik  于2020年11月30日周一 下午2:41写道：

> Below commands having errors for me. From our lustre details, could you
> please provide exact command to run at our server.thank you.
>
> Regards,
> Anilkumar
>
> On Mon, 30 Nov, 2020, 6:59 am 肖正刚,  wrote:
>
>> Hi,
>> you can enable user quota on mgs by
>> "
>> lctl conf_param your_fsname.qouta.mdt=u
>> lctl conf_param your_fsname.qouta.ost=u
>> "
>> details about quota in lustre manual chapter 25
>> https://doc.lustre.org/lustre_manual.xhtml#configuringquotas
>>
>>
>>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Quota related (Anilkumar Naik)

2020-11-29 Thread 肖正刚

Hi,
you can enable user quota on mgs by
"
lctl conf_param your_fsname.qouta.mdt=u
lctl conf_param your_fsname.qouta.ost=u
"
details about quota in lustre manual chapter 25
https://doc.lustre.org/lustre_manual.xhtml#configuringquotas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre QoS using TBF (Strikwerda, Ger)

2020-10-26 Thread 肖正刚

Hi,
You can find details in lustre manual chapter 34.6.5
When enabling TBF policy, you can specify one of the type(NID, JOBID,
OPCode and
UID/GID  )s, or just use "tbf" to enable all of them to do a fine-grained
RPC requests classification and this feature also supports logical
conditional conjunction and disjunction
operations among different types
what's more:
1) you should set rules on every oss
2) once reboot oss or umount ost, rules are gone
3) tbf only limit the rpcrate, not iops or bw directly.

Hope this helps.

Best Regards.

>
>
>
> --
>
> Message: 1
> Date: Mon, 26 Oct 2020 15:40:23 +0100
> From: "Strikwerda, Ger" 
> To: Lustre discussion 
> Subject: [lustre-discuss] Lustre QoS using TBF
> Message-ID:
>  mfvqwmn4sf7+mzzrmlqosuq+x5ne...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear community,
>
> We at the university of Groningen (the Netherlands) are looking into doing
> QoS on our Lustre file system to prevent users from suffocating our
> filesystems. Lustre QoS using TBF is mentioned in a couple of
> presentations/slides, but I failed to get/find some useful documentation on
> how to implement such a feature.
>
> My question: Does anybody of you have some experience with Luste QoS in
> production you want to share, does it work, and what are best practises?
>
> --
>
> Vriendelijke groet,
>
> Ger Strikwerdasenior expert multidisciplinary HPC enabler
> simple solution architect
> Rijksuniversiteit Groningen
> CIT/HPC beheer
>
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20201026/1ed1ee46/attachment-0001.html
> >
>
> --
>
> Subject: Digest Footer
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> --
>
> End of lustre-discuss Digest, Vol 175, Issue 10
> ***
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] ls command blocked in some dir

2020-09-23 Thread 肖正刚

Hi,
I disabled auto_scrub on all OSS use "lctl set_param
osd-ldiskfs.*.auto_scrub=0"; then trace file command on client , nothing
new returned and  oss can not see any messages newer.

Franke, Knut  于2020年9月23日周三 下午11:59写道：

> Hi,
>
> we've had a similar issue, though in our case there were "FID-in-LMA
> [...] does not match the object self-fid" errors in the OSS logs. See
> LU-13392 for details.
>
> You could try disabling auto_scrub (lctl set_param
> osd_ldiskfs.*.auto_scrub=0) and check whether this causes lstat() to
> return something.
>
> Regards,
> Knut
>
> Am Mittwoch, den 23.09.2020, 22:56 +0800 schrieb 肖正刚:
> > Caution! External email. Do not open attachments or click links, unless
> this email comes from a known sender and you know the content is safe.
> > Hi, all
> > In one of our lustre filesystems，we found that
> > 1) ls command blocked in some dir but ls --color=never  worked.
> > 2) some files can not be accessed, like cat/head/vim/file(i use
> strace to trace command "strace file .xxx" ,  stucked in lstat).
> > "
> > fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) =
> 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b02fbb27000
> > mmap(NULL, 266240, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b02fbb28000
> > lstat("animation-2_0043.hsf",
> > "
> > 3) at the same time,no message found on client/mds/oss
> >
> > Are these files corrupted? How to repair?
> > PS: We have used lfsck，not worked and we cannot bring down the
> filesystem to run e2fsck at present.
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] ls command blocked in some dir

2020-09-23 Thread 肖正刚

Hi, all
In one of our lustre filesystems，we found that
1) ls command blocked in some dir but ls --color=never  worked.
2) some files can not be accessed, like cat/head/vim/file(i use strace
to trace command "strace file .xxx" ,  stucked in lstat).
"
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b02fbb27000
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b02fbb28000
lstat("animation-2_0043.hsf",
"
3) at the same time,no message found on client/mds/oss

Are these files corrupted? How to repair?
PS: We have used lfsck，not worked and we cannot bring down the filesystem
to run e2fsck at present.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] client syslog flood with "client_bulk_callback()) event type 2, status -103"

2020-09-09 Thread 肖正刚

Hi, all
After upgrade lustre client from 2.12.2 to 2.12.5，we found some clients
flood with messages like
"
[Wed Sep  9 15:49:05 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:06 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:06 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:07 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:07 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:08 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:08 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:08 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:09 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:10 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:11 2020] LNetError:
0:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 10.10.41.11@o2ib: Async QP event
type 3
[Wed Sep  9 15:49:11 2020] LNetError:
0:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) Skipped 599 previous similar
messages
[Wed Sep  9 15:49:11 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:11 2020] Lustre:
3845:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1599637713/real 1599637713]
 req@9a35c1e83f00 x1676624438686784/t0(0)
o3->public1-OST000a-osc-9a3f3b3af000@10.10.41.11@o2ib:6/4 lens 488/440
e 1 to 1 dl 1599637816 ref 2 fl Rpc:eX/2/ rc -11/-1
[Wed Sep  9 15:49:11 2020] Lustre:
3845:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 599 previous
similar messages
[Wed Sep  9 15:49:11 2020] Lustre: public1-OST000a-osc-9a3f3b3af000:
Connection to public1-OST000a (at 10.10.41.11@o2ib) was lost; in progress
operations using this service will wait for recovery to complete
[Wed Sep  9 15:49:11 2020] Lustre: Skipped 599 previous similar messages
[Wed Sep  9 15:49:12 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:12 2020] Lustre: public1-OST000a-osc-9a3f3b3af000:
Connection restored to 10.10.41.11@o2ib (at 10.10.41.11@o2ib)
[Wed Sep  9 15:49:12 2020] Lustre: Skipped 599 previous similar messages
[Wed Sep  9 15:49:13 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:13 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:14 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:14 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:14 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:14 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:15 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:15 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
[Wed Sep  9 15:49:16 2020] LustreError:
3476:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
desc 9a1f3cafd000
"

 Any suggestions?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-29 Thread 肖正刚

Hi, Andreas，
Thanks for your reply.
Maybe this is a bug?
We never hit this before update client to 2.12.5

Andreas Dilger  于2020年8月29日周六 下午6:37写道：

> On Aug 25, 2020, at 17:42, 肖正刚  wrote:
>
>
> no, on oss we found only the client who reported " dirty page discard  "
> being evicted.
> we hit this again last night, and on oss we can see logs like:
> "
> [Tue Aug 25 23:40:12 2020] LustreError:
> 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
> expired after 100s: evicting client at 10.10.3.223@o2ib  ns:
> filter-public1-OST_UUID lock: 9f1f91cba880/0x3fcc67dad1c65842 lrc:
> 3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 3 type: EXT
> [0->18446744073709551615] (req 0->270335) flags: 0x6400020020 nid:
> 10.10.3.223@o2ib remote: 0xd713b7b417045252 expref: 7081 pid: 25923
> timeout: 21386699 lvb_type: 0
>
>
> It isn't clear what the question is here.  The "dirty page discard"
> message means that unwritten data from the client was discarded because the
> client was evicted and the lock covering this data was revoked by the
> server because the client was not responsive.
>
> Anymore , we exec lfsck on all servers,  result is
>
>
> There is no need for LFSCK in this case.  The file data was not written,
> but a client eviction does not result in the filesystem becoming
> inconsistent.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-25 Thread 肖正刚

k callback timer
expired after 147s: evicting client at 10.10.3.223@o2ib  ns:
filter-public1-OST0002_UUID lock: 9f16e6f95c40/0x3fcc67dad1dea822 lrc:
3/0,0 mode: PR/PR res: [0xdd5d4bb:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->24575) flags: 0x6400020020 nid:
10.10.3.223@o2ib remote: 0xd713b7b417900639 expref: 8633 pid: 25993
timeout: 21388514 lvb_type: 0
"

Anymore , we exec lfsck on all servers,  result is
"
layout_mdts_init: 0
layout_mdts_scanning-phase1: 0
layout_mdts_scanning-phase2: 0
layout_mdts_completed: 1
layout_mdts_failed: 0
layout_mdts_stopped: 0
layout_mdts_paused: 0
layout_mdts_crashed: 0
layout_mdts_partial: 0
layout_mdts_co-failed: 0
layout_mdts_co-stopped: 0
layout_mdts_co-paused: 0
layout_mdts_unknown: 0
layout_osts_init: 0
layout_osts_scanning-phase1: 0
layout_osts_scanning-phase2: 0
layout_osts_completed: 8
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 2253861
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 0
namespace_mdts_completed: 1
namespace_mdts_failed: 0
namespace_mdts_stopped: 0
namespace_mdts_paused: 0
namespace_mdts_crashed: 0
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 0
"

Colin Faber  于2020年8月26日周三 上午12:17写道：

> The I/O was not fully committed after close() from the client. Are you
> experiencing high numbers of evictions?
>
> On Tue, Aug 25, 2020 at 9:12 AM 肖正刚  wrote:
>
>> Hi, all
>>
>> We found that some clients' dmesg filled up with messages like
>> "
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x1680f:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x14246:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12018:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c86:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c76:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c8e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c66:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c7e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12c6e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12ca6:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
>> [0x27a82:0x12cbe

[lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-25 Thread 肖正刚

Hi, all

We found that some clients' dmesg filled up with messages like
"
Aug 24 19:54:34 ln5 kernel: Lustre:
13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x1680f:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x14246:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12018:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c86:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c76:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c8e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c66:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c7e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c6e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12ca6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cbe:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13571:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cb6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13551:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cae:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13572:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cce:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13573:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cc6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13574:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12d56:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13575:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12d36:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13576:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x1429e:0x0]/ may get corrupted (rc -108)

"
Then, we checked disk array, sas link, multipath, but no error found.
Has anyone ever met the same problem ?
Any suggestions will help!

Regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] depmod error when upgrade 2.12.2 to 2.12.5

2020-08-10 Thread 肖正刚

Hi,all
when upgrade 2.12.2 to 2.12.5，we hit depmod error , can this be ignored or
how to resolve it?
Error info:
depmod: ERROR: fstatat(4, ptlrpc.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, fld.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, mgs.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, osp.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, obdclass.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, obdecho.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, lod.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, lnet.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, ksocklnd.ko.xz): No such file or directory
depmod: ERROR: fstatat(5, mst_pci.ko): No such file or directory
depmod: ERROR: fstatat(5, mst_pciconf.ko): No such file or directory
depmod: ERROR: fstatat(4, lmv.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, libcfs.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, lquota.ko.xz): No such file or directory
depmod: ERROR: fstatat(5, rshim.ko): No such file or directory
depmod: ERROR: fstatat(5, rshim_pcie_lf.ko): No such file or directory
depmod: ERROR: fstatat(5, rshim_net.ko): No such file or directory
depmod: ERROR: fstatat(5, rshim_pcie.ko): No such file or directory
depmod: ERROR: fstatat(4, lov.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, mdd.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, ost.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, ofd.ko.xz): No such file or directory
depmod: ERROR: fstatat(6, scsi_transport_srp.ko): No such file or directory
depmod: ERROR: fstatat(4, lnet_selftest.ko.xz): No such file or directory
depmod: ERROR: fstatat(4, llog_test.ko.xz): No such file or directory
depmod: ERROR: fstatat(6, mlx_compat.ko): No such file or directory
depmod: ERROR: fstatat(8, nvme-rdma.ko): No such file or directory
depmod: ERROR: fstatat(8, nvmet-rdma.ko): No such file or directory
depmod: ERROR: fstatat(9, ib_iser.ko): No such file or directory
depmod: ERROR: fstatat(9, ib_srp.ko): No such file or directory
depmod: ERROR: fstatat(9, ib_srpt.ko): No such file or directory
depmod: ERROR: fstatat(9, ib_isert.ko): No such file or directory
depmod: ERROR: fstatat(9, opa_vnic.ko): No such file or directory
depmod: ERROR: fstatat(9, rdmavt.ko): No such file or directory
depmod: ERROR: fstatat(9, rdma_rxe.ko): No such file or directory
depmod: ERROR: fstatat(9, iw_cxgb4.ko): No such file or directory
depmod: ERROR: fstatat(9, iw_cxgb3.ko): No such file or directory
depmod: ERROR: fstatat(9, vmw_pvrdma.ko): No such file or directory
depmod: ERROR: fstatat(9, ib_ipath.ko): No such file or directory

Thanks!
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe

2020-07-30 Thread 肖正刚

Hi，
Thanks for your suggestion.
But , to reboot the OSSs in production under massive IO pressure  will make
another long long story .

Regards.


Weiss, Karsten  于2020年7月30日周四 下午11:31写道：

> Hi!
>
>
>
> (Caveat: I ran into this issue not on Lustre but on HPC MPI jobs on CentOS
> 7.7. They only run stable
>
> with the workaround.)
>
>
>
> I’ve opened a bug with Red Hat at
> https://bugzilla.redhat.com/show_bug.cgi?id=1796825 but unfortunately,
>
> it is no longer public (or fixed/closed) i.e. you probably won’t be able
> to read it.
>
>
>
> To make a long story short: You may try to boot with the kernel parameter
> “iommu=pt” as a workaround(!).
>
>
>
> Please let me know if this “fixes” the problem for you. YMMV.
>
>
>
> Best regards,
>
> Karsten
>
>
>
> --
>
> *Dipl.-Inf. Karsten Weiss *s+c / Atos
>
> T +49 7071 9457 452
>
> karsten.we...@atos.net
>
> https://atos.net/de/deutschland/sc-en
>
>
>
> *From:* lustre-discuss  *On
> Behalf Of *???
> *Sent:* Thursday, July 30, 2020 16:05
> *To:* lustre-discuss 
> *Subject:* [lustre-discuss] infiniband mlx5_0: dump_cqe:286:(pid 25761):
> dump error cqe
>
>
>
> Hi, all
>
>
>
> we installed lustre-2.12.2 both server and clients ,recently,our oss's
> syslog flooding with messages like below：
>
> “
>
> infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 84 79 01 04 4c d0
> LustreError: 25762:0:(events.c:450:server_bulk_callback()) event type 5,
> status -5, desc 9ffdf58c0a00
> LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
> status -103, desc 9ffdf58c0a00
> LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
> status -103, desc 9ffdf58c0a00
> LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
> status -103, desc 9ffdf58c0a00
>
> ”
>
> Does anyone hit this beforce or any suggestions?
>
>
>
> Thanks?
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe

2020-07-30 Thread 肖正刚

Hi, all

we installed lustre-2.12.2 both server and clients ,recently,our oss's
syslog flooding with messages like below：
“
infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 00 00 88 13 08 00 84 79 01 04 4c d0
LustreError: 25762:0:(events.c:450:server_bulk_callback()) event type 5,
status -5, desc 9ffdf58c0a00
LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
status -103, desc 9ffdf58c0a00
LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
status -103, desc 9ffdf58c0a00
LustreError: 25755:0:(events.c:450:server_bulk_callback()) event type 5,
status -103, desc 9ffdf58c0a00
”
Does anyone hit this beforce or any suggestions?

Thanks?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-20 Thread 肖正刚

Hi, Alastair & Mark Hahn
Can mounted lustre filesystem(same version ) impact each other ?
Can network become bottleneck?

regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] client /server version compatibility (Peeples, Heath)

2020-07-18 Thread 肖正刚

Hi,
You can get something from changlog，like
http://wiki.lustre.org/Lustre_2.12.2_Changelog.
BTW，we have tried 2.12.2 server with 2.7.x client & 2.5.x client，not work.

Regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-16 Thread 肖正刚

Hi, Mark Hahn

Very appreciate for your detailed reply.
And sorry for the ambiguous description.
For some reasons, we decided not to expand on the lustre filesystem already
exists; so  what I want to know is the number of lustre filesystems that a
client can mount on the same time .

Best regards.

Mark Hahn  于2020年7月16日周四 下午3:00写道：

> > On Jul 15, 2020, at 12:29 AM, ???  wrote:
> >> Is there a ceiling for a Lustre filesystem that can be mounted in a
> cluster?
>
> It is very high, as Andreas said.
>
> >> If so, what's the number?
>
> The following contains specific limits:
>
>
> https://build.whamcloud.com/job/lustre-manual//lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140436304680016
>
> You'll notice that you must assume some aspects of configuration, such as
> the
> size and number of your OSTs.  I see OSTs in the range of 75-400TB (and OST
> counts between 58 and 187).
>
> >> If not, how much is proper?
>
> Lustre is designed to scale.  So a config with a small number of OSTs,
> on very few OSSes doesn't make that much sense.  OSTs are pretty much
> expected to be decent-sized RAIDs.  There would be tradeoffs among cost-
> efficient disk sizes (maybe 16T today) and RAID overhead (usually N+2),
> and how that trades off with bandwidth (HBA and OSS network).
>
> >> Does mount multiple filesystems  can affect the stability of each file
> system or cause other problems?
>
> My experience is that the main factor in reliability is device count,
> rather than how the devices are organized.  For instance, if you
> have more OSSes, you may get linearly nicer performance, but
> you also increase your chance of having components crash or fail.
>
> The main reason for separate filesystems is usually that the MDS
> (maybe MTD) can be a bottleneck.  But you can scale MDSes, instead.
>
> regards, mark hahn.
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-15 Thread 肖正刚

  Hi, Jongwoo &  Andreas

Sorry for the ambiguous description.
What I want to know is the number of lustre filesystems that a client can
mount on the same time.

Thanks



>
>
> Message: 1
> Date: Wed, 15 Jul 2020 14:29:10 +0800
> From: ??? 
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Is there aceiling of lustre filesystem a
> client can mount
> Message-ID:
>  hxr...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi, all
> Is there a ceiling for a Lustre filesystem that can be mounted in a
> cluster?
> If so, what's the number?
> If not, how much is proper?
> Does mount multiple filesystems  can affect the stability of each file
> system or cause other problems?
>
> Thanks!
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200715/e57d1c6f/attachment-0001.html
> >
>
> --
>
>
> Message: 3
> Date: Wed, 15 Jul 2020 23:45:57 +0900
> From: Jongwoo Han 
> To: ??? 
> Cc: lustre-discuss 
> Subject: Re: [lustre-discuss] Is there aceiling of lustre filesystem a
> client can mount
> Message-ID:
>  kgfbw9qrommea3xscmy33l...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I think your question is ambiguous.
>
> What ceiling do you mean? Total storage capacity? number of disks? number
> of clients? number of filesystems?
>
> Please be more clear about it.
>
> Regards,
> Jongwoo Han
>
> 2020? 7? 15? (?) ?? 3:29, ??? ?? ??:
>
> > Hi, all
> > Is there a ceiling for a Lustre filesystem that can be mounted in a
> > cluster?
> > If so, what's the number?
> > If not, how much is proper?
> > Does mount multiple filesystems  can affect the stability of each file
> > system or cause other problems?
> >
> > Thanks!
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>
>
> --
> Jongwoo Han
> +82-505-227-6108
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200715/3329f03d/attachment-0001.html
> >
>
> --
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-15 Thread 肖正刚

Hi, all
Is there a ceiling for a Lustre filesystem that can be mounted in a cluster?
If so, what's the number?
If not, how much is proper?
Does mount multiple filesystems  can affect the stability of each file
system or cause other problems?

Thanks!
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] getcwd() fails (Leonardo Saavedra)

2020-07-13 Thread 肖正刚

Hi, Leonardo
Thanks for your reply,
but I found that use vasp-5.4.4 can walk arround this issue,so we do not
intend to upgrade the kernel recently.



 于2020年7月14日周二 上午9:14写道：

> Send lustre-discuss mailing list submissions to
> lustre-discuss@lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-requ...@lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-ow...@lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of lustre-discuss digest..."
>
>
> Today's Topics:
>
>1. Re: getcwd() fails (Leonardo Saavedra)
>
>
> --
>
> Message: 1
> Date: Mon, 13 Jul 2020 10:07:29 -0600
> From: Leonardo Saavedra 
> To: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] getcwd() fails
> Message-ID: <76dcb65b-09fc-654e-afb3-967a69a77...@nrao.edu>
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>
> you have to upgrade your kernel to kernel-3.10.0-1127.8.2.el7.x86_64.
>
> The getcwd issue was fixed in :
>
> [...]
> - [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi)
> [1631631]
> [...]
>
>
> Leo Saavedra
> National Radio Astronomy Observatory
> http://www.nrao.edu
> +1-575-8357033
>
> On 7/12/20 9:46 AM, ??? wrote:
> > Hi, Alex
> > Thanks for your suggestion.
> > I did some tests this weekend and found that? may be intel mpi's or
> > vasp's issue.
> > When I ran another version of vasp which compiled with intel mpi 2017
> > the error disappeared.
> > Then?I switched to the original version of vasp?errors occured.
> >
> >
> > Alex Zarochentsev mailto:zamena...@gmail.com>>
> > ?2020?7?10??? ??5:01???
> >
> > Hello,
> >
> >
> > On Fri, Jul 10, 2020 at 11:28 AM ???  > > wrote:
> >
> > Hi all,
> >
> > We run lustre 2.12.2(both server) on CentOS 7.6, we
> > hits getcwd error when ran vasp.
> > Error message:
> > forrtl: severe (121): Cannot access current working directory
> > for unit 7, file "Unknown"
> > Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine ? ? ? ? ? Line? ?
> > ? ? Source
> > vasp_std? ? ? ? ? ?00D1CAC9? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?00D36DCF? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?007B2620? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?0086E57A? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?00935DB1? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?00AEE62D? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?00BA6239? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?0040921E? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> > libc-2.17.so  ?2B7D2ABBA3D5?
> > __libc_start_main ?Unknown? Unknown
> > vasp_std? ? ? ? ? ?00409129? Unknown ? ? ? ? ? ?
> > ?Unknown? Unknown
> >
> > I can not find any errors about lustre in that client.
> > Any more , in mds I can't find any errors about that client.
> >
> > Any suggestions?
> >
> >
> >
> > similar to
> > https://jira.whamcloud.com/browse/LU-12997 ?
> > the first comment in LU-12997 says it is a kernel bug.
> >
> > Thanks,
> > Zam.
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > 
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200713/a0852c66/attachment-0001.html
> >
>
> --
>
> Subject: Digest Footer
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> --
>
> End of lustre-discuss Digest, Vol 172, Issue 17
> ***
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

Re: [lustre-discuss] getcwd() fails

2020-07-12 Thread 肖正刚

Hi, Alex
Thanks for your suggestion.
I did some tests this weekend and found that  may be intel mpi's or vasp's
issue.
When I ran another version of vasp which compiled with intel mpi 2017 the
error disappeared.
Then，I switched to the original version of vasp，errors occured.


Alex Zarochentsev  于2020年7月10日周五 下午5:01写道：

> Hello,
>
>
> On Fri, Jul 10, 2020 at 11:28 AM 肖正刚  wrote:
>
>> Hi all,
>>
>> We run lustre 2.12.2(both server) on CentOS 7.6, we hits getcwd
>> error when ran vasp.
>> Error message:
>> forrtl: severe (121): Cannot access current working directory for unit 7,
>> file "Unknown"
>> Image  PCRoutineLine
>> Source
>> vasp_std   00D1CAC9  Unknown   Unknown
>> Unknown
>> vasp_std   00D36DCF  Unknown   Unknown
>> Unknown
>> vasp_std   007B2620  Unknown   Unknown
>> Unknown
>> vasp_std   0086E57A  Unknown   Unknown
>> Unknown
>> vasp_std   00935DB1  Unknown   Unknown
>> Unknown
>> vasp_std   00AEE62D  Unknown   Unknown
>> Unknown
>> vasp_std   00BA6239  Unknown   Unknown
>> Unknown
>> vasp_std   0040921E  Unknown   Unknown
>> Unknown
>> libc-2.17.so   2B7D2ABBA3D5  __libc_start_main Unknown
>> Unknown
>> vasp_std   00409129  Unknown   Unknown
>> Unknown
>>
>> I can not find any errors about lustre in that client.
>> Any more , in mds I can't find any errors about that client.
>>
>> Any suggestions?
>>
>
>
> similar to
> https://jira.whamcloud.com/browse/LU-12997 ?
> the first comment in LU-12997 says it is a kernel bug.
>
> Thanks,
> Zam.
>
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] getcwd() fails

2020-07-10 Thread 肖正刚

Hi all,

We run lustre 2.12.2(both server) on CentOS 7.6, we hits getcwd
error when ran vasp.
Error message:
forrtl: severe (121): Cannot access current working directory for unit 7,
file "Unknown"
Image  PCRoutineLineSource

vasp_std   00D1CAC9  Unknown   Unknown  Unknown
vasp_std   00D36DCF  Unknown   Unknown  Unknown
vasp_std   007B2620  Unknown   Unknown  Unknown
vasp_std   0086E57A  Unknown   Unknown  Unknown
vasp_std   00935DB1  Unknown   Unknown  Unknown
vasp_std   00AEE62D  Unknown   Unknown  Unknown
vasp_std   00BA6239  Unknown   Unknown  Unknown
vasp_std   0040921E  Unknown   Unknown  Unknown
libc-2.17.so   2B7D2ABBA3D5  __libc_start_main Unknown  Unknown
vasp_std   00409129  Unknown   Unknown  Unknown

I can not find any errors about lustre in that client.
Any more , in mds I can't find any errors about that client.

Any suggestions?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] mlx4 and mxl5 mix environment

2020-06-22 Thread 肖正刚

Hi, all
We setup up a cluster use mlx4 and mlx5 driver mixed，all things goes well.
Later I find something in wiki
http://wiki.lustre.org/Infiniband_Configuration_Howto and
http://lists.onebuilding.org/pipermail/lustre-devel-lustre.org/2016-May/003842.html
which was
last edited on 2016.
So do i need to change lnet configuration described in this page ?
Or the problem has been resolved in new version (like 2.12.x) ?
Anymore where can i find more details ?

Any suggestions would be appreciated.
Thanks！
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] how to mapping of RPC rate to bandwidth/IOPS?

2020-06-13 Thread 肖正刚

Hi, Andreas
Thanks for your reply.
I am geting more clearer，now.

Thanks

Andreas Dilger  于2020年6月10日周三 上午11:00写道：

> On Jun 2, 2020, at 02:30, 肖正刚  wrote:
>
>
> Hi all,
> we use TBF policy(details:
> https://jira.whamcloud.com/secure/attachment/14201/Lustre%20NRS%20TBF%20documentation%200.1.pdf)
> to limit rpcrate coming from clients; but I do not know how to mapping of
> rpcrate to bandwidth or iops.
> For example:
> if I set a client's rpcrate=10，how much bandwith or iops the client can
> get  in theory?
>
>
> Currently, the TBF policies only deal with RPCs.  For most systems today,
> you are probably using 4MB RPC size (osc.*.max_pages_per_rpc=1024), so if
> you set rpcrate=10 the clients will be able to get at most 40MB/s (assuming
> applications do relatively linear IO).  If applications have small random
> IOPS then rpcrate=10 may get up to 256 4KB writes per RPC, or about 2560
> IOPS = 10MB/s.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] how to mapping of RPC rate to bandwidth/IOPS?

2020-06-02 Thread 肖正刚

Hi all,
we use TBF policy(details:
https://jira.whamcloud.com/secure/attachment/14201/Lustre%20NRS%20TBF%20documentation%200.1.pdf)
to limit rpcrate coming from clients; but I do not know how to mapping of
rpcrate to bandwidth or iops.
For example:
if I set a client's rpcrate=10，how much bandwith or iops the client can
get  in theory?

Any suggestions will help.

Thanks.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] file restored after modify

2020-05-08 Thread 肖正刚

Hi all,
we used lustre 2.12.2, hit a strange problem this morning, some file
restored after modify

i modified job.sh, about 1 min later，file restored.
when i copy job.sh to test.sh, then modify test.sh , test.sh not restored.
when i use root to modify job.sh, file not restored; then i use user who
own job.sh to modify the job.sh again, file not restored.

How this happen ?
Any suggestion will help.

Thanks!
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] can not rebuild lustre 2.12.4 use src rpm package

2020-04-03 Thread 肖正刚

I got around this  by disable building of Lustre tests

 cp rpmbuild/SOURCES/lustre-2.12.4.tar.gz .
 tar -xvf lustre-2.12.4.tar.gz
 cd lustre-2.12.4/
 ./configure --enable-server --disable-tests
 make rpms




肖正刚  于2020年4月3日周五 下午3:33写道：

> Hi,
>
> i rebuild 2.12.4 clients in centos 7.6 use command line :
>
> rpmbuild --rebuild lustre-2.12.4-1.src.rpm
>
> no error found while configure, but when compiling , i see error
>
> Finding  Provides: /usr/lib/rpm/redhat/find-provides
> Finding  Requires(interp):
> Finding  Requires(rpmlib):
> Finding  Requires(verify):
> Finding  Requires(pre):
> Finding  Requires(post):
> Finding  Requires(preun):
> Finding  Requires(postun):
> Finding  Requires(pretrans):
> Finding  Requires(posttrans):
> Finding  Requires: /usr/lib/rpm/redhat/find-requires
> Provides: lustre-resource-agents = 2.12.4-1.el7 
> lustre-resource-agents(x86-64) = 2.12.4-1.el7
> Requires(rpmlib): rpmlib(FileDigests) <= 4.6.0-1 
> rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <= 3.0.4-1
> Processing files: lustre-tests-2.12.4-1.el7.x86_64
> error: File not found by glob: 
> /root/rpmbuild/BUILDROOT/lustre-2.12.4-1.x86_64/usr/lib64/openmpi/bin/*
>
> Are there something i missed?
>
> Any suggestion will help.
>
>
> Thanks！
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] can not rebuild lustre 2.12.4 use src rpm package

2020-04-03 Thread 肖正刚

Hi,

i rebuild 2.12.4 clients in centos 7.6 use command line :

rpmbuild --rebuild lustre-2.12.4-1.src.rpm

no error found while configure, but when compiling , i see error

Finding  Provides: /usr/lib/rpm/redhat/find-provides
Finding  Requires(interp):
Finding  Requires(rpmlib):
Finding  Requires(verify):
Finding  Requires(pre):
Finding  Requires(post):
Finding  Requires(preun):
Finding  Requires(postun):
Finding  Requires(pretrans):
Finding  Requires(posttrans):
Finding  Requires: /usr/lib/rpm/redhat/find-requires
Provides: lustre-resource-agents = 2.12.4-1.el7
lustre-resource-agents(x86-64) = 2.12.4-1.el7
Requires(rpmlib): rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <=
3.0.4-1
Processing files: lustre-tests-2.12.4-1.el7.x86_64
error: File not found by glob:
/root/rpmbuild/BUILDROOT/lustre-2.12.4-1.x86_64/usr/lib64/openmpi/bin/*

Are there something i missed?

Any suggestion will help.


Thanks！
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] MDS HIT LBUG

2020-04-01 Thread 肖正刚

Hi，
Our MDS hit a bug(Red Hat issue) as described in :
https://jira.whamcloud.com/browse/LU-10678
https://jira.whamcloud.com/browse/LU-11786

Our kernel version : 3.10.0-957.10.1.el7_lustre.x86_64
lustre version: 2.12.2
OS version: CentOS 7.6

RHEL said the kernel bug was resolved in kernel-3.10.0-957.12.1.el7 (
https://access.redhat.com/errata/RHSA-2019:0818)

but in LU-11786, Mahmoud Hanafi said they hit this bug twice  with
3.10.0-957.21.3.el7 and lustre2.12.2,  so It does appear to be unresolved.

Does anyone hit this bug and fixed?

Thanks.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] confused about mdt space

2020-04-01 Thread 肖正刚

Now, I am clear.
Thanks Richard !

Mohr Jr, Richard Frank  于2020年4月1日周三 下午10:59写道：

>
>
> > On Apr 1, 2020, at 10:07 AM, Mohr Jr, Richard Frank 
> wrote:
> >
> >
> >
> >> On Apr 1, 2020, at 3:55 AM, 肖正刚  wrote:
> >>
> >> For  " the recent lustre versions use a 1KB inode size by default and
> the default format options create 1 inodes for every 2.5 KB of MDT space" :
> >> I checked the inode size is 1KB and  in my online systems,  as you said
> , about 40~41% of mdt disk space consumed by inodes.
> >> but from the manual I found the default "inode ratio" is 2K, so where
> the additional 0.5KB comes from ?
> >>
> >
> > I was basing this on info I found in this email thread:
> >
> >
> http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2018-February/015302.html
> >
> > I think the 2K/inode ratio listed in the manual may be incorrect (or
> perhaps someone can point out my mistake).
>
> Looking at the lustre source (utils/libmount_utils_ldiskfs.c), the default
> behavior is determined by this line:
>
> bytes_per_inode = inode_size + 1536
>
> So the default bytes per inode is 1K (inode size) + 1.5K = 2.5K
>
> —
> Rick Mohr
> Senior HPC System Administrator
> Joint Institute for Computational Sciences
> University of Tennessee
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] confused about mdt space

2020-04-01 Thread 肖正刚

Hi,
Please forget my first question, I made a mistake.
For  " the recent lustre versions use a 1KB inode size by default and the
default format options create 1 inodes for every 2.5 KB of MDT space" :
I checked the inode size is 1KB and  in my online systems,  as you said ,
about 40~41% of mdt disk space consumed by inodes.
but from the manual I found the default "inode ratio" is 2K, so where the
additional 0.5KB comes from ?

Thanks.


肖正刚  于2020年4月1日周三 下午1:00写道：

> Thanks a lot.
> I have two more questions:
> 1) Assume I consider the mdt space use the method described in lustre
> manual, by calculation, the metadata space is 400GB.
> After format(default option), about 160GB(40% of 400GB) preallocated for
> inodes, so the avalaible inodes number is less than estimated, right ?
> 2) mds need additional space for other use, like log,acls,xattrs；how to
> estimate these space ?
>
> Thanks!
>
> Mohr Jr, Richard Frank  于2020年3月31日周二 下午9:57写道：
>
>>
>>
>> > On Mar 30, 2020, at 10:56 PM, 肖正刚  wrote:
>> >
>> > Hello, I have some question about metadata space.
>> > 1) I have ten 960GB SAS SSDs for mdt,after done raid10,we have 4.7TB
>> space free.
>> > after formated as mdt,we only have 2.6TB space free; so where the 2.1TB
>> space go ?
>> > 2) for the 2.6TB space, what's it used for?
>>
>> That space is used by inodes.  I believe the recent lustre versions use a
>> 1KB inode size by default and the default format options create 1 inodes
>> for every 2.5 KB of MDT space.  So about 40% of your disk space will be
>> consumed by inodes.
>>
>> —
>> Rick Mohr
>> Senior HPC System Administrator
>> Joint Institute for Computational Sciences
>> University of Tennessee
>>
>>
>>
>>
>>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] confused about mdt space

2020-03-31 Thread 肖正刚

Thanks a lot.
I have two more questions:
1) Assume I consider the mdt space use the method described in lustre
manual, by calculation, the metadata space is 400GB.
After format(default option), about 160GB(40% of 400GB) preallocated for
inodes, so the avalaible inodes number is less than estimated, right ?
2) mds need additional space for other use, like log,acls,xattrs；how to
estimate these space ?

Thanks!

Mohr Jr, Richard Frank  于2020年3月31日周二 下午9:57写道：

>
>
> > On Mar 30, 2020, at 10:56 PM, 肖正刚  wrote:
> >
> > Hello, I have some question about metadata space.
> > 1) I have ten 960GB SAS SSDs for mdt,after done raid10,we have 4.7TB
> space free.
> > after formated as mdt,we only have 2.6TB space free; so where the 2.1TB
> space go ?
> > 2) for the 2.6TB space, what's it used for?
>
> That space is used by inodes.  I believe the recent lustre versions use a
> 1KB inode size by default and the default format options create 1 inodes
> for every 2.5 KB of MDT space.  So about 40% of your disk space will be
> consumed by inodes.
>
> —
> Rick Mohr
> Senior HPC System Administrator
> Joint Institute for Computational Sciences
> University of Tennessee
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] confused about mdt space

2020-03-30 Thread 肖正刚

Hello, I have some question about metadata space.

1) I have ten 960GB SAS SSDs for mdt,after done raid10,we have 4.7TB space free.

after formated as mdt,we only have 2.6TB space free; so where the
2.1TB space go ?

2) for the 2.6TB space, what's it used for?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Recover from broken lustre updates (Haoyang Liu)

Re: [lustre-discuss] Quota related (Anilkumar Naik)

Re: [lustre-discuss] Quota related (Anilkumar Naik)

Re: [lustre-discuss] Lustre QoS using TBF (Strikwerda, Ger)

Re: [lustre-discuss] ls command blocked in some dir

[lustre-discuss] ls command blocked in some dir

[lustre-discuss] client syslog flood with "client_bulk_callback()) event type 2, status -103"

Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

[lustre-discuss] some clients dmesg filled up with "dirty page discard"

[lustre-discuss] depmod error when upgrade 2.12.2 to 2.12.5

Re: [lustre-discuss] infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe

[lustre-discuss] infiniband mlx5_0: dump_cqe:286:(pid 25761): dump error cqe

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

Re: [lustre-discuss] client /server version compatibility (Peeples, Heath)

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

[lustre-discuss] Is there aceiling of lustre filesystem a client can mount

Re: [lustre-discuss] getcwd() fails (Leonardo Saavedra)

Re: [lustre-discuss] getcwd() fails

[lustre-discuss] getcwd() fails

[lustre-discuss] mlx4 and mxl5 mix environment

Re: [lustre-discuss] how to mapping of RPC rate to bandwidth/IOPS?

[lustre-discuss] how to mapping of RPC rate to bandwidth/IOPS?

[lustre-discuss] file restored after modify

Re: [lustre-discuss] can not rebuild lustre 2.12.4 use src rpm package

[lustre-discuss] can not rebuild lustre 2.12.4 use src rpm package

[lustre-discuss] MDS HIT LBUG

Re: [lustre-discuss] confused about mdt space

Re: [lustre-discuss] confused about mdt space

Re: [lustre-discuss] confused about mdt space

[lustre-discuss] confused about mdt space

32 matches

Site Navigation

Mail list logo

Footer information