Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-25 Thread 肖正刚
no, on oss we found only the client who reported " dirty page discard  "
being evicted.
we hit this again last night, and on oss we can see logs like:
"
[Tue Aug 25 23:40:12 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223@o2ib  ns:
filter-public1-OST_UUID lock: 9f1f91cba880/0x3fcc67dad1c65842 lrc:
3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->270335) flags: 0x6400020020 nid:
10.10.3.223@o2ib remote: 0xd713b7b417045252 expref: 7081 pid: 25923
timeout: 21386699 lvb_type: 0
[Tue Aug 25 23:40:12 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar
messages
[Tue Aug 25 23:40:14 2020] LustreError:
26000:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req@9f13259a6300 x1653628454261296/t0(0)
o106->public1-OST@10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref
1 fl Rpc:/0/ rc 0/-1
[Tue Aug 25 23:40:14 2020] LustreError:
26000:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 14 previous
similar messages
[Tue Aug 25 23:40:26 2020] LustreError:
25917:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req@9f1339a5c800 x1653628454263632/t0(0)
o106->public1-OST0002@10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref
1 fl Rpc:/0/ rc 0/-1
[Tue Aug 25 23:40:26 2020] LustreError:
25917:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous
similar messages
[Tue Aug 25 23:44:59 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0
[Tue Aug 25 23:44:59 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 5755 previous similar
messages
[Tue Aug 25 23:49:18 2020] Lustre: public1-OST0002: Connection restored to
87ca2182-98a3-25dd-7d30-989d822381c6 (at 10.10.5.6@o2ib)
[Tue Aug 25 23:49:18 2020] Lustre: Skipped 102 previous similar messages
[Tue Aug 25 23:55:00 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 577536 GRANT, real grant 0
[Tue Aug 25 23:55:00 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 1121 previous similar
messages
[Tue Aug 25 23:59:25 2020] Lustre: public1-OST: Connection restored to
d45ad9f4-8903-7c80-7b35-bd32037de660 (at 10.10.7.131@o2ib)
[Tue Aug 25 23:59:25 2020] Lustre: Skipped 50 previous similar messages
[Tue Aug 25 23:59:49 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 156s: evicting client at 10.10.3.223@o2ib  ns:
filter-public1-OST_UUID lock: 9f130863a880/0x3fcc67dad1cff1d5 lrc:
3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 4 type: EXT
[0->18446744073709551615] (req 3911680->4173823) flags: 0x620020
nid: 10.10.3.223@o2ib remote: 0xd713b7b417354237 expref: 11891 pid: 26099
timeout: 21387847 lvb_type: 0
[Tue Aug 25 23:59:49 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar
messages
[Wed Aug 26 00:00:40 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223@o2ib  ns:
filter-public1-OST0004_UUID lock: 9f2df4a10d80/0x3fcc67dad1d50925 lrc:
3/0,0 mode: PR/PR res: [0xdc95179:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->266239) flags: 0x640020 nid:
10.10.3.223@o2ib remote: 0xd713b7b417549c43 expref: 14594 pid: 26181
timeout: 21387927 lvb_type: 0
[Wed Aug 26 00:00:40 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar
message
[Wed Aug 26 00:02:37 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223@o2ib  ns:
filter-public1-OST_UUID lock: 9f1359e94a40/0x3fcc67dad1dacd8b lrc:
3/0,0 mode: PR/PR res: [0xde609f1:0x0:0x0].0x0 rrc: 4 type: EXT
[0->18446744073709551615] (req 1941504->2097151) flags: 0x6400020020
nid: 10.10.3.223@o2ib remote: 0xd713b7b417780209 expref: 5626 pid: 26134
timeout: 21388044 lvb_type: 0
[Wed Aug 26 00:02:37 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar
message
[Wed Aug 26 00:05:00 2020] LustreError:
26199:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0
[Wed Aug 26 00:05:00 2020] LustreError:
26199:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 14028 previous similar
messages
[Wed Aug 26 00:09:30 2020] Lustre: public1-OST: Connection restored to
956559c4-4e7c-e6a5-3867-83ab85699688 (at 10.10.6.91@o2ib)
[Wed Aug 26 00:09:30 2020] Lustre: Skipped 39 previous similar messages
[Wed Aug 26 00:10:27 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 147s: evicting 

[lustre-discuss] Complete list of rules for PCC

2020-08-25 Thread Pinkesh Valdria
I am looking for the various policy rules which can be applied for Lustre 
Persistent Client Cache.   In the docs,  I see below example using projid, 
fname and uid.    Where can I find a complete list of supported rules.    

 

Also is there a way for PCC to only cache content of few folders 

http://doc.lustre.org/lustre_manual.xhtml#pcc.design.rules

 

 

The following command adds a PCC backend on a client:

client# lctl pcc add /mnt/lustre /mnt/pcc  --param 
"projid={500,1000}={*.h5},uid=1001 rwid=2"

The first substring of the config parameter is the auto-cache rule, where "&" 
represents the logical AND operator while "," represents the logical OR 
operator. The example rule means that new files are only auto cached if either 
of the following conditions are satisfied:
The project ID is either 500 or 1000 and the suffix of the file name is "h5";
The user ID is 1001;
 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] error mounting client

2020-08-25 Thread Jeff Johnson
Your output shows Infiniband NIDs (@o2ib). If you are mounting @tcp what is
your tcp access method to the Infiniband file system? Multihomed? lnet
router?

--Jeff

On Tue, Aug 25, 2020 at 8:32 AM Peeples, Heath 
wrote:

> We have just build a 2.12.5 cluster.  When trying to mount the fs (via
> tcp).  I get the following errors.  Would anyone have an idea what  the
> problem might be?  Thanks in advance
>
>
>
>
>
> [10680.535157] LustreError: 15c-8: MGC192.168.8.8@tcp: The configuration
> from log 'ldata-client' failed (-2). This may be the result of
> communication errors between this node and the MGS, a bad configuration, or
> other errors. See the syslog for more information.
>
> [10680.883649] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup())
> ldata-clilov-91b118df1000: lov tgt 0 not cleaned! deathrow=0, lovrc=1
>
> [10680.886610] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup()) Skipped
> 4 previous similar messages
>
> [10680.890298] LustreError: 12634:0:(obd_config.c:610:class_cleanup())
> Device 9 not setup
>
> [10680.891816] Lustre: Unmounted ldata-client
>
> [10680.895178] LustreError: 12634:0:(obd_mount.c:1608:lustre_fill_super())
> Unable to mount  (-2)
>
> [10763.516841] LustreError: 12732:0:(ldlm_lib.c:494:client_obd_setup())
> can't add initial connection
>
> [10763.518368] LustreError: 12732:0:(obd_config.c:559:class_setup()) setup
> ldata-OST0006-osc-91b125029800 failed (-2)
>
> [10763.519806] LustreError:
> 12732:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.8.8@tcp:
> cfg command failed: rc = -2
>
> [10763.522603] Lustre:cmd=cf003 0:ldata-OST0006-osc
> 1:ldata-OST0006_UUID  2:172.23.0.116@o2ib
>
>
>
> Heath
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] error mounting client

2020-08-25 Thread Colin Faber
Was this an initial mount of a new file system or a new TCP client being
introduced to an existing file system? Can you describe your setup a little
more?

On Tue, Aug 25, 2020 at 9:32 AM Peeples, Heath 
wrote:

> We have just build a 2.12.5 cluster.  When trying to mount the fs (via
> tcp).  I get the following errors.  Would anyone have an idea what  the
> problem might be?  Thanks in advance
>
>
>
>
>
> [10680.535157] LustreError: 15c-8: MGC192.168.8.8@tcp: The configuration
> from log 'ldata-client' failed (-2). This may be the result of
> communication errors between this node and the MGS, a bad configuration, or
> other errors. See the syslog for more information.
>
> [10680.883649] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup())
> ldata-clilov-91b118df1000: lov tgt 0 not cleaned! deathrow=0, lovrc=1
>
> [10680.886610] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup()) Skipped
> 4 previous similar messages
>
> [10680.890298] LustreError: 12634:0:(obd_config.c:610:class_cleanup())
> Device 9 not setup
>
> [10680.891816] Lustre: Unmounted ldata-client
>
> [10680.895178] LustreError: 12634:0:(obd_mount.c:1608:lustre_fill_super())
> Unable to mount  (-2)
>
> [10763.516841] LustreError: 12732:0:(ldlm_lib.c:494:client_obd_setup())
> can't add initial connection
>
> [10763.518368] LustreError: 12732:0:(obd_config.c:559:class_setup()) setup
> ldata-OST0006-osc-91b125029800 failed (-2)
>
> [10763.519806] LustreError:
> 12732:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.8.8@tcp:
> cfg command failed: rc = -2
>
> [10763.522603] Lustre:cmd=cf003 0:ldata-OST0006-osc
> 1:ldata-OST0006_UUID  2:172.23.0.116@o2ib
>
>
>
> Heath
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-25 Thread Colin Faber
The I/O was not fully committed after close() from the client. Are you
experiencing high numbers of evictions?

On Tue, Aug 25, 2020 at 9:12 AM 肖正刚  wrote:

> Hi, all
>
> We found that some clients' dmesg filled up with messages like
> "
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x1680f:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x14246:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12018:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c86:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c76:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c8e:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c66:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c7e:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12c6e:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12ca6:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12cbe:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13571:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12cb6:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13551:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12cae:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13572:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12cce:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13573:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12cc6:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13574:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12d56:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13575:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x12d36:0x0]/ may get corrupted (rc -108)
> Aug 24 19:54:34 ln5 kernel: Lustre:
> 13576:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
> [0x27a82:0x1429e:0x0]/ may get corrupted (rc -108)
>
> "
> Then, we checked disk array, sas link, multipath, but no error found.
> Has anyone ever met the same problem ?
> Any suggestions will help!
>
> Regards.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] error mounting client

2020-08-25 Thread Peeples, Heath
We have just build a 2.12.5 cluster.  When trying to mount the fs (via tcp).  I 
get the following errors.  Would anyone have an idea what  the problem might 
be?  Thanks in advance


[10680.535157] LustreError: 15c-8: MGC192.168.8.8@tcp: The configuration from 
log 'ldata-client' failed (-2). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
[10680.883649] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup()) 
ldata-clilov-91b118df1000: lov tgt 0 not cleaned! deathrow=0, lovrc=1
[10680.886610] LustreError: 12634:0:(lov_obd.c:839:lov_cleanup()) Skipped 4 
previous similar messages
[10680.890298] LustreError: 12634:0:(obd_config.c:610:class_cleanup()) Device 9 
not setup
[10680.891816] Lustre: Unmounted ldata-client
[10680.895178] LustreError: 12634:0:(obd_mount.c:1608:lustre_fill_super()) 
Unable to mount  (-2)
[10763.516841] LustreError: 12732:0:(ldlm_lib.c:494:client_obd_setup()) can't 
add initial connection
[10763.518368] LustreError: 12732:0:(obd_config.c:559:class_setup()) setup 
ldata-OST0006-osc-91b125029800 failed (-2)
[10763.519806] LustreError: 
12732:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.8.8@tcp: cfg 
command failed: rc = -2
[10763.522603] Lustre:cmd=cf003 0:ldata-OST0006-osc  1:ldata-OST0006_UUID  
2:172.23.0.116@o2ib

Heath
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-25 Thread 肖正刚
Hi, all

We found that some clients' dmesg filled up with messages like
"
Aug 24 19:54:34 ln5 kernel: Lustre:
13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x1680f:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x14246:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12018:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c86:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c76:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c8e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c66:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c7e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12c6e:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12ca6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cbe:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13571:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cb6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13551:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cae:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13572:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cce:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13573:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12cc6:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13574:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12d56:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13575:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x12d36:0x0]/ may get corrupted (rc -108)
Aug 24 19:54:34 ln5 kernel: Lustre:
13576:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid:
[0x27a82:0x1429e:0x0]/ may get corrupted (rc -108)

"
Then, we checked disk array, sas link, multipath, but no error found.
Has anyone ever met the same problem ?
Any suggestions will help!

Regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disk quota exceeded while quota is not filled

2020-08-25 Thread David Cohen
Hi,
Still hoping for a reply...

It seems to me that old groups are more affected by the issue than new ones
that were created after a major disk migration.
It seems that the quota enforcement is somehow based on a counter other
than the accounting as the accounting produces the same numbers as du.
So if quota is calculated separately from accounting, it is possible that
quota is broken and keeps values from removed disks, while accounting is
correct.
So following that suspicion I tried to force the FS to recalculate quota.
I tried:
lctl conf_param technion.quota.ost=none
and back to:
lctl conf_param technion.quota.ost=ugp

I tried running on mds and all ost:
tune2fs -O ^quota
and on again:
tune2fs -O quota
and after each attempt, also:
lctl lfsck_start -A -t all -o -e continue

But still the problem persists and groups under the quota usage get blocked
with "quota exceeded"

Best,
David


On Sun, Aug 16, 2020 at 8:41 AM David Cohen 
wrote:

> Hi,
> Adding some more information.
> A Few months ago the data on the Lustre fs was migrated to new physical
> storage.
> After successful migration the old ost were marked as active=0
> (lctl conf_param technion-OST0001.osc.active=0)
>
> Since then all the clients were unmounted and mounted.
> tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost.
> lctl dl don't show the old ost anymore, but when querying the quota they
> still appear.
> As I see that new users are less affected by the "quota exceeded" problem
> (blocked from writing while quota is not filled),
> I suspect that quota calculation is still summing values from the old ost:
>
> *lfs quota -g -v md_kaplan /storage/*
> Disk quotas for grp md_kaplan (gid 10028):
>  Filesystem  kbytes   quota   limit   grace   files   quota   limit
> grace
>   /storage/ 4823987000   0 5368709120   -  143596   0
>   0   -
> technion-MDT_UUID
>   37028   -   0   -  143596   -   0
> -
> quotactl ost0 failed.
> quotactl ost1 failed.
> quotactl ost2 failed.
> quotactl ost3 failed.
> quotactl ost4 failed.
> quotactl ost5 failed.
> quotactl ost6 failed.
> quotactl ost7 failed.
> quotactl ost8 failed.
> quotactl ost9 failed.
> quotactl ost10 failed.
> quotactl ost11 failed.
> quotactl ost12 failed.
> quotactl ost13 failed.
> quotactl ost14 failed.
> quotactl ost15 failed.
> quotactl ost16 failed.
> quotactl ost17 failed.
> quotactl ost18 failed.
> quotactl ost19 failed.
> quotactl ost20 failed.
> technion-OST0015_UUID
> 114429464*  - 114429464   -   -   -
> -   -
> technion-OST0016_UUID
> 92938588   - 92938592   -   -   -   -
>   -
> technion-OST0017_UUID
> 128496468*  - 128496468   -   -   -
> -   -
> technion-OST0018_UUID
> 191478704*  - 191478704   -   -   -
> -   -
> technion-OST0019_UUID
> 107720552   - 107720560   -   -   -
> -   -
> technion-OST001a_UUID
> 165631952*  - 165631952   -   -   -
> -   -
> technion-OST001b_UUID
> 460714156*  - 460714156   -   -   -
> -   -
> technion-OST001c_UUID
> 157182900*  - 157182900   -   -   -
> -   -
> technion-OST001d_UUID
> 102945952*  - 102945952   -   -   -
> -   -
> technion-OST001e_UUID
> 175840980*  - 175840980   -   -   -
> -   -
> technion-OST001f_UUID
> 142666872*  - 142666872   -   -   -
> -   -
> technion-OST0020_UUID
> 188147548*  - 188147548   -   -   -
> -   -
> technion-OST0021_UUID
> 125914240*  - 125914240   -   -   -
> -   -
> technion-OST0022_UUID
> 186390800*  - 186390800   -   -   -
> -   -
> technion-OST0023_UUID
> 115386876   - 115386884   -   -   -
> -   -
> technion-OST0024_UUID
> 127139556*  - 127139556   -   -   -
> -   -
> technion-OST0025_UUID
> 179666580*  - 179666580   -   -   -
> -   -
> technion-OST0026_UUID
> 147837348   - 147837356   -   -   -
> -   -
> technion-OST0027_UUID
> 129823528   - 129823536   -   -   -
> -   -
> technion-OST0028_UUID
> 158270776   - 158270784   -   -   -
> -   -
> technion-OST0029_UUID
> 168762120   - 168763104   -   -   -
> -   -
> technion-OST002a_UUID
> 164235684   - 164235688   -   -   -
> -   -
> technion-OST002b_UUID
> 147512200   - 147512204   -   -   -
> -   -
> technion-OST002c_UUID
> 158046652