[lustre-discuss] Zabbix Lustre template

2023-09-27 Thread David Cohen via lustre-discuss
Hi,
I'm looking for a Zabbix Lustre template, but couldn't find one.
Is anyone aware of such a template and can share a link?

Thanks,
David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Project quota and project quota accounting

2022-11-29 Thread David Cohen via lustre-discuss
Hi,
We are running Lustre 2.12.7 (ldiskfs) both on the servers and the clients.

lctl get_param osd-*.*.quota_slave.info returns for all ost and mds/mdt:
quota enabled:  ugp
space acct: ug

I tried enabling group quota on a client, with no success:
chattr -p 1 /storage/test
chattr: Operation not supported while setting project on /storage/test
Or with:
lfs project -p 1  /storage/test
lfs: failed to set xattr for  /storage/test': Operation not supported

Regards,
David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Unable to mount new OST

2021-07-06 Thread David Cohen
s03 kernel: sd 15:0:0:92: [sddy] 34863054848 4096-byte
logical blocks: (142 TB/129 TiB)
Jul  6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Write Protect is off
Jul  6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Write cache: enabled,
read cache: enabled, supports DPO and FUA
Jul  6 07:59:41 oss03 kernel: sd 15:0:0:92: [sddy] Attached SCSI disk
Jul  6 07:59:42 oss03 multipathd: sddy: add path (uevent)
Jul  6 07:59:42 oss03 multipathd: sddy [128:0]: path added to devmap OST0051

On Wed, Jul 7, 2021 at 7:24 AM Jeff Johnson 
wrote:

> What devices are underneath dm-21 and are there any errors in
> /var/log/messages for those devices? (assuming /dev/sdX devices underneath)
>
> Run `ls /sys/block/dm-21/slaves` to see what devices are beneath dm-21
>
>
>
>
>
> On Tue, Jul 6, 2021 at 20:09 David Cohen 
> wrote:
>
>> Hi,
>> The index of the OST is unique in the system and free for the new one, as
>> it is increased by "1" for every new OST created, so whatever it converts
>> to should not be relevant to it's refusal to mount, or am I mistaken?
>>
>> I'm pasting the log messages again, in case they were lost up the thread,
>> adding the output of "fdisk -l", should the OST size be the issue:
>>
>> lctl dk show tens of thousands of lines repeating the same error after
>> attempting to mount the OST:
>>
>> 0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
>> local-OST0033: fail to set LMA for init OI scrub: rc = -30
>> 0010:1000:26.0:1625546374.322974:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
>> local-OST0033: fail to set LMA for init OI scrub: rc = -30
>> 0010:1000:26.0:1625546374.322975:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
>> local-OST0033: fail to set LMA for init OI scrub: rc = -30
>>
>> in /var/log/messages I see the following corresponding to dm21 which is
>> the new OST:
>>
>> Jul  6 07:38:37 oss03 kernel: LDISKFS-fs warning (device dm-21):
>> ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected,
>> please wait.
>> Jul  6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): file extents enabled,
>> maximum tree depth=5
>> Jul  6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21):
>> ldiskfs_clear_journal_err:4862: Filesystem error recorded from previous
>> mount: IO failure
>> Jul  6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21):
>> ldiskfs_clear_journal_err:4863: Marking fs in need of filesystem check.
>> Jul  6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs
>> with errors, running e2fsck is recommended
>> Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): recovery complete
>> Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): mounted filesystem with
>> ordered data mode. Opts:
>> user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
>> Jul  6 07:39:22 oss03 kernel: LDISKFS-fs error (device dm-21):
>> htree_dirblock_to_tree:1278: inode #2: block 21233: comm mount.lustre: bad
>> entry in directory: rec_len is too small for name_len - offset=4084(4084),
>> inode=0, rec_len=12
>> , name_len=0
>> Jul  6 07:39:22 oss03 kernel: Aborting journal on device dm-21-8.
>> Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): Remounting filesystem
>> read-only
>> Jul  6 07:39:24 oss03 kernel: LDISKFS-fs warning (device dm-21):
>> kmmpd:187: kmmpd being stopped since filesystem has been remounted as
>> readonly.
>> Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): error count since last
>> fsck: 6
>> Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): initial error at time
>> 1625367384: htree_dirblock_to_tree:1278: inode 2: block 21233
>> Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): last error at time
>> 1625546362: htree_dirblock_to_tree:1278: inode 2: block 21233
>>
>> fdisk -l /dev/mapper/OST0051
>>
>> Disk /dev/mapper/OST0051: 142799.1 GB, 142799072657408 bytes, 34863054848
>> sectors
>> Units = sectors of 1 * 4096 = 4096 bytes
>> Sector size (logical/physical): 4096 bytes / 4096 bytes
>> I/O size (minimum/optimal): 2097152 bytes / 2097152 bytes
>>
>>
>> Thanks,
>> David
>>
>> On Tue, Jul 6, 2021 at 10:35 PM Spitz, Cory James 
>> wrote:
>>
>>> What OST index (number) were you trying to add?
>>>
>>>
>>>
>>> Andreas is right:
>>>
>>> Note that your "--index=0051" value is probably interpreted as an octal
>>> number "41", it should be "--index=0x0051" or "--index=0x51" (hex, to match
>>> the OST device name) or "--index=81&

Re: [lustre-discuss] Unable to mount new OST

2021-07-06 Thread David Cohen
Thanks Artem,
I already tried that (e2fsck) with no avail.
I even tried tunefs.lustre --writeconf --erase-params on the MDS and all
the other targets, but the behaviour remains the same.

Best regards,
David



On Tue, Jul 6, 2021 at 10:09 AM Благодаренко Артём <
artem.blagodare...@gmail.com> wrote:

> Hello David,
>
> On 6 Jul 2021, at 08:34, David Cohen 
> wrote:
>
> Jul  6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs
> with errors, running e2fsck is recommended
>
>
>
> It looks like LDISKFS partition is in inconsistent state now. It is better
> to follow the recommendation and run e2fsck.
>
> Best regards,
> Artem Blagodarenko.
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Unable to mount new OST

2021-07-05 Thread David Cohen
Thanks Andreas,
I'm aware that index 51 actually translates to hex 33 (local-OST0033_UUID).
I don't believe that's the reason for the failed mount as it is only an
index that I increase for every new OST and there are no duplicates.

lctl dk show tens of thousands of lines repeating the same error after
attempting to mount the OST:

0010:1000:26.0:1625546374.322973:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
local-OST0033: fail to set LMA for init OI scrub: rc = -30
0010:1000:26.0:1625546374.322974:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
local-OST0033: fail to set LMA for init OI scrub: rc = -30
0010:1000:26.0:1625546374.322975:0:248211:0:(osd_scrub.c:2039:osd_ios_scan_one())
local-OST0033: fail to set LMA for init OI scrub: rc = -30

in /var/log/messages I see the following corresponding to dm21 which is the
new OST:

Jul  6 07:38:37 oss03 kernel: LDISKFS-fs warning (device dm-21):
ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected,
please wait.
Jul  6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): file extents enabled,
maximum tree depth=5
Jul  6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21):
ldiskfs_clear_journal_err:4862: Filesystem error recorded from previous
mount: IO failure
Jul  6 07:39:19 oss03 kernel: LDISKFS-fs warning (device dm-21):
ldiskfs_clear_journal_err:4863: Marking fs in need of filesystem check.
Jul  6 07:39:19 oss03 kernel: LDISKFS-fs (dm-21): warning: mounting fs with
errors, running e2fsck is recommended
Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): recovery complete
Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): mounted filesystem with
ordered data mode. Opts:
user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Jul  6 07:39:22 oss03 kernel: LDISKFS-fs error (device dm-21):
htree_dirblock_to_tree:1278: inode #2: block 21233: comm mount.lustre: bad
entry in directory: rec_len is too small for name_len - offset=4084(4084),
inode=0, rec_len=12
, name_len=0
Jul  6 07:39:22 oss03 kernel: Aborting journal on device dm-21-8.
Jul  6 07:39:22 oss03 kernel: LDISKFS-fs (dm-21): Remounting filesystem
read-only
Jul  6 07:39:24 oss03 kernel: LDISKFS-fs warning (device dm-21): kmmpd:187:
kmmpd being stopped since filesystem has been remounted as readonly.
Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): error count since last
fsck: 6
Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): initial error at time
1625367384: htree_dirblock_to_tree:1278: inode 2: block 21233
Jul  6 07:44:22 oss03 kernel: LDISKFS-fs (dm-21): last error at time
1625546362: htree_dirblock_to_tree:1278: inode 2: block 21233

As I mentioned before mount never completes so the only way out of that is
force reboot.

Thanks,
David

On Tue, Jul 6, 2021 at 8:07 AM Andreas Dilger  wrote:

>
>
> On Jul 5, 2021, at 09:05, David Cohen 
> wrote:
>
> Hi,
> I'm using Lustre 2.10.5 and lately tried to add a new OST.
> The OST was formatted with the command below, which other than the index
> is the exact same one used for all the other OSTs in the system.
>
> mkfs.lustre --reformat --mkfsoptions="-t ext4 -T huge" --ost
> --fsname=local  --index=0051 --param ost.quota_type=ug
> --mountfsoptions='errors=remount-ro,extents,mballoc' --mgsnode=10.0.0.3@tcp
> --mgsnode=10.0.0.1@tc
> p --mgsnode=10.0.0.2@tcp --servicenode=10.0.0.3@tcp
> --servicenode=10.0.0.1@tcp --servicenode=10.0.0.2@tcp /dev/mapper/OST0051
>
>
> Note that your "--index=0051" value is probably interpreted as an octal
> number "41", it should be "--index=0x0051" or "--index=0x51" (hex, to match
> the OST device name) or "--index=81" (decimal).
>
>
> When trying to mount the with:
> mount.lustre /dev/mapper/OST0051 /Lustre/OST0051
>
> The system stays on 100% CPU (one core) forever and the mount never
> completes, not even after a week.
>
> I tried tunefs.lustre --writeconf --erase-params on the MDS and all the
> other targets, but the behaviour remains the same.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Unable to mount new OST

2021-07-05 Thread David Cohen
Hi,
I'm using Lustre 2.10.5 and lately tried to add a new OST.
The OST was formatted with the command below, which other than the index is
the exact same one used for all the other OSTs in the system.

mkfs.lustre --reformat --mkfsoptions="-t ext4 -T huge" --ost
--fsname=local  --index=0051 --param ost.quota_type=ug
--mountfsoptions='errors=remount-ro,extents,mballoc' --mgsnode=10.0.0.3@tcp
--mgsnode=10.0.0.1@tc
p --mgsnode=10.0.0.2@tcp --servicenode=10.0.0.3@tcp
--servicenode=10.0.0.1@tcp --servicenode=10.0.0.2@tcp /dev/mapper/OST0051

When trying to mount the with:
mount.lustre /dev/mapper/OST0051 /Lustre/OST0051

The system stays on 100% CPU (one core) forever and the mount never
completes, not even after a week.

I tried tunefs.lustre --writeconf --erase-params on the MDS and all the
other targets, but the behaviour remains the same.

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: Disk quota exceeded while quota is not filled

2020-08-26 Thread David Cohen
Thank you Chad for answering,
We are using the patched kernel on the MDT/OSS
The problem is in the group space quota.
In any case I enabled project quota just for future purposes.
There are no defined projects, do you think it can still pose a problem?

Best,
David




On Wed, Aug 26, 2020 at 3:18 PM Chad DeWitt  wrote:

> Hi David,
>
> Hope you're doing well.
>
> This is a total shot in the dark, but depending on the kernel version you
> are running, you may need a patched kernel to use project quotas. I'm not
> sure what the symptoms would be, but it may be worth turning off project
> quotas and seeing if doing so resolves your issue:
>
> lctl conf_param technion.quota.mdt=none
> lctl conf_param technion.quota.mdt=ug
> lctl conf_param technion.quota.ost=none
> lctl conf_param technion.quota.ost=ug
>
> (Looks like you have been running project quota on your MDT for a while
> without issue, so this may be a deadend.)
>
> Here's more info concerning when a patched kernel is necessary for
> project quotas (25.2.  Enabling Disk Quotas):
>
> http://doc.lustre.org/lustre_manual.xhtml
>
>
> Cheers,
> Chad
>
> 
>
> Chad DeWitt, CISSP | University Research Computing
>
> UNC Charlotte *| *Office of OneIT
>
> ccdew...@uncc.edu
>
> ----
>
>
>
> On Tue, Aug 25, 2020 at 3:04 AM David Cohen 
> wrote:
>
>> [*Caution*: Email from External Sender. Do not click or open links or
>> attachments unless you know this sender.]
>>
>> Hi,
>> Still hoping for a reply...
>>
>> It seems to me that old groups are more affected by the issue than new
>> ones that were created after a major disk migration.
>> It seems that the quota enforcement is somehow based on a counter other
>> than the accounting as the accounting produces the same numbers as du.
>> So if quota is calculated separately from accounting, it is possible that
>> quota is broken and keeps values from removed disks, while accounting is
>> correct.
>> So following that suspicion I tried to force the FS to recalculate quota.
>> I tried:
>> lctl conf_param technion.quota.ost=none
>> and back to:
>> lctl conf_param technion.quota.ost=ugp
>>
>> I tried running on mds and all ost:
>> tune2fs -O ^quota
>> and on again:
>> tune2fs -O quota
>> and after each attempt, also:
>> lctl lfsck_start -A -t all -o -e continue
>>
>> But still the problem persists and groups under the quota usage get
>> blocked with "quota exceeded"
>>
>> Best,
>> David
>>
>>
>> On Sun, Aug 16, 2020 at 8:41 AM David Cohen <
>> cda...@physics.technion.ac.il> wrote:
>>
>>> Hi,
>>> Adding some more information.
>>> A Few months ago the data on the Lustre fs was migrated to new physical
>>> storage.
>>> After successful migration the old ost were marked as active=0
>>> (lctl conf_param technion-OST0001.osc.active=0)
>>>
>>> Since then all the clients were unmounted and mounted.
>>> tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost.
>>> lctl dl don't show the old ost anymore, but when querying the quota they
>>> still appear.
>>> As I see that new users are less affected by the "quota exceeded"
>>> problem (blocked from writing while quota is not filled),
>>> I suspect that quota calculation is still summing values from the old
>>> ost:
>>>
>>> *lfs quota -g -v md_kaplan /storage/*
>>> Disk quotas for grp md_kaplan (gid 10028):
>>>  Filesystem  kbytes   quota   limit   grace   files   quota   limit
>>>   grace
>>>   /storage/ 4823987000   0 5368709120   -  143596   0
>>> 0   -
>>> technion-MDT_UUID
>>>   37028   -   0   -  143596   -   0
>>>   -
>>> quotactl ost0 failed.
>>> quotactl ost1 failed.
>>> quotactl ost2 failed.
>>> quotactl ost3 failed.
>>> quotactl ost4 failed.
>>> quotactl ost5 failed.
>>> quotactl ost6 failed.
>>> quotactl ost7 failed.
>>> quotactl ost8 failed.
>>> quotactl ost9 failed.
>>> quotactl ost10 failed.
>>> quotactl ost11 failed.
>>> quotactl ost12 failed.
>>> quotactl ost13 failed.
>>> quotactl ost14 failed.
>>> quotactl ost15 failed.
>>> quotactl ost16 failed.
>>> quotactl ost17 failed.
>>> quotactl ost18 failed

Re: [lustre-discuss] Disk quota exceeded while quota is not filled

2020-08-25 Thread David Cohen
Hi,
Still hoping for a reply...

It seems to me that old groups are more affected by the issue than new ones
that were created after a major disk migration.
It seems that the quota enforcement is somehow based on a counter other
than the accounting as the accounting produces the same numbers as du.
So if quota is calculated separately from accounting, it is possible that
quota is broken and keeps values from removed disks, while accounting is
correct.
So following that suspicion I tried to force the FS to recalculate quota.
I tried:
lctl conf_param technion.quota.ost=none
and back to:
lctl conf_param technion.quota.ost=ugp

I tried running on mds and all ost:
tune2fs -O ^quota
and on again:
tune2fs -O quota
and after each attempt, also:
lctl lfsck_start -A -t all -o -e continue

But still the problem persists and groups under the quota usage get blocked
with "quota exceeded"

Best,
David


On Sun, Aug 16, 2020 at 8:41 AM David Cohen 
wrote:

> Hi,
> Adding some more information.
> A Few months ago the data on the Lustre fs was migrated to new physical
> storage.
> After successful migration the old ost were marked as active=0
> (lctl conf_param technion-OST0001.osc.active=0)
>
> Since then all the clients were unmounted and mounted.
> tunefs.lustre --writeconf was executed on the mgs/mdt and all the ost.
> lctl dl don't show the old ost anymore, but when querying the quota they
> still appear.
> As I see that new users are less affected by the "quota exceeded" problem
> (blocked from writing while quota is not filled),
> I suspect that quota calculation is still summing values from the old ost:
>
> *lfs quota -g -v md_kaplan /storage/*
> Disk quotas for grp md_kaplan (gid 10028):
>  Filesystem  kbytes   quota   limit   grace   files   quota   limit
> grace
>   /storage/ 4823987000   0 5368709120   -  143596   0
>   0   -
> technion-MDT_UUID
>   37028   -   0   -  143596   -   0
> -
> quotactl ost0 failed.
> quotactl ost1 failed.
> quotactl ost2 failed.
> quotactl ost3 failed.
> quotactl ost4 failed.
> quotactl ost5 failed.
> quotactl ost6 failed.
> quotactl ost7 failed.
> quotactl ost8 failed.
> quotactl ost9 failed.
> quotactl ost10 failed.
> quotactl ost11 failed.
> quotactl ost12 failed.
> quotactl ost13 failed.
> quotactl ost14 failed.
> quotactl ost15 failed.
> quotactl ost16 failed.
> quotactl ost17 failed.
> quotactl ost18 failed.
> quotactl ost19 failed.
> quotactl ost20 failed.
> technion-OST0015_UUID
> 114429464*  - 114429464   -   -   -
> -   -
> technion-OST0016_UUID
> 92938588   - 92938592   -   -   -   -
>   -
> technion-OST0017_UUID
> 128496468*  - 128496468   -   -   -
> -   -
> technion-OST0018_UUID
> 191478704*  - 191478704   -   -   -
> -   -
> technion-OST0019_UUID
> 107720552   - 107720560   -   -   -
> -   -
> technion-OST001a_UUID
> 165631952*  - 165631952   -   -   -
> -   -
> technion-OST001b_UUID
> 460714156*  - 460714156   -   -   -
> -   -
> technion-OST001c_UUID
> 157182900*  - 157182900   -   -   -
> -   -
> technion-OST001d_UUID
> 102945952*  - 102945952   -   -   -
> -   -
> technion-OST001e_UUID
> 175840980*  - 175840980   -   -   -
> -   -
> technion-OST001f_UUID
> 142666872*  - 142666872   -   -   -
> -   -
> technion-OST0020_UUID
> 188147548*  - 188147548   -   -   -
> -   -
> technion-OST0021_UUID
> 125914240*  - 125914240   -   -   -
> -   -
> technion-OST0022_UUID
> 186390800*  - 186390800   -   -   -
> -   -
> technion-OST0023_UUID
> 115386876   - 115386884   -   -   -
> -   -
> technion-OST0024_UUID
> 127139556*  - 127139556   -   -   -
> -   -
> technion-OST0025_UUID
> 179666580*  - 179666580   -   -   -
> -   -
> technion-OST0026_UUID
> 147837348   - 147837356   -   -   -
> -   -
> technion-OST0027_UUID
> 129823528   - 129823536   -   -   -
> -   -
> technion-OST0028_UUID
> 158270776   - 158270784   -   -   -
> -   -
> technion-OST0029_UUID
>   

Re: [lustre-discuss] Disk quota exceeded while quota is not filled

2020-08-15 Thread David Cohen
 quota   limit
grace
  /storage/  4.493T  0k  5T   -  143596   0   0
  -



On Tue, Aug 11, 2020 at 7:35 AM David Cohen 
wrote:

> Hi,
> I'm running Lustre 2.10.5 on the oss and mds, and 2.10.7 on the clients.
> While inode quota ons mdt worked fine for a while now:
> lctl conf_param technion.quota.mdt=ugp
> When, few days ago I turned on quota on ost:
> lctl conf_param technion.quota.ost=ugp
> Users started getting "Disk quota exceeded" error messages while quota is
> not filled
>
> Actions taken:
> Full e2fsck -f -y to all the file system, mdt and ost.
> lctl lfsck_start -A -t all -o -e continue
> turning quota to none and back.
>
> None of the above solved the problem.
>
> lctl lfsck_query
>
>
> layout_mdts_init: 0
> layout_mdts_scanning-phase1: 0
> layout_mdts_scanning-phase2: 0
> layout_mdts_completed: 0
> layout_mdts_failed: 0
> layout_mdts_stopped: 0
> layout_mdts_paused: 0
> layout_mdts_crashed: 0
> *layout_mdts_partial: 1 *# is that normal output?
> layout_mdts_co-failed: 0
> layout_mdts_co-stopped: 0
> layout_mdts_co-paused: 0
> layout_mdts_unknown: 0
> layout_osts_init: 0
> layout_osts_scanning-phase1: 0
> layout_osts_scanning-phase2: 0
> layout_osts_completed: 30
> layout_osts_failed: 0
> layout_osts_stopped: 0
> layout_osts_paused: 0
> layout_osts_crashed: 0
> layout_osts_partial: 0
> layout_osts_co-failed: 0
> layout_osts_co-stopped: 0
> layout_osts_co-paused: 0
> layout_osts_unknown: 0
> layout_repaired: 15
> namespace_mdts_init: 0
> namespace_mdts_scanning-phase1: 0
> namespace_mdts_scanning-phase2: 0
> namespace_mdts_completed: 1
> namespace_mdts_failed: 0
> namespace_mdts_stopped: 0
> namespace_mdts_paused: 0
> namespace_mdts_crashed: 0
> namespace_mdts_partial: 0
> namespace_mdts_co-failed: 0
> namespace_mdts_co-stopped: 0
> namespace_mdts_co-paused: 0
> namespace_mdts_unknown: 0
> namespace_osts_init: 0
> namespace_osts_scanning-phase1: 0
> namespace_osts_scanning-phase2: 0
> namespace_osts_completed: 0
> namespace_osts_failed: 0
> namespace_osts_stopped: 0
> namespace_osts_paused: 0
> namespace_osts_crashed: 0
> namespace_osts_partial: 0
> namespace_osts_co-failed: 0
> namespace_osts_co-stopped: 0
> namespace_osts_co-paused: 0
> namespace_osts_unknown: 0
> namespace_repaired: 99
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Disk quota exceeded while quota is not filled

2020-08-10 Thread David Cohen
Hi,
I'm running Lustre 2.10.5 on the oss and mds, and 2.10.7 on the clients.
While inode quota ons mdt worked fine for a while now:
lctl conf_param technion.quota.mdt=ugp
When, few days ago I turned on quota on ost:
lctl conf_param technion.quota.ost=ugp
Users started getting "Disk quota exceeded" error messages while quota is
not filled

Actions taken:
Full e2fsck -f -y to all the file system, mdt and ost.
lctl lfsck_start -A -t all -o -e continue
turning quota to none and back.

None of the above solved the problem.

lctl lfsck_query


layout_mdts_init: 0
layout_mdts_scanning-phase1: 0
layout_mdts_scanning-phase2: 0
layout_mdts_completed: 0
layout_mdts_failed: 0
layout_mdts_stopped: 0
layout_mdts_paused: 0
layout_mdts_crashed: 0
*layout_mdts_partial: 1 *# is that normal output?
layout_mdts_co-failed: 0
layout_mdts_co-stopped: 0
layout_mdts_co-paused: 0
layout_mdts_unknown: 0
layout_osts_init: 0
layout_osts_scanning-phase1: 0
layout_osts_scanning-phase2: 0
layout_osts_completed: 30
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 15
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 0
namespace_mdts_completed: 1
namespace_mdts_failed: 0
namespace_mdts_stopped: 0
namespace_mdts_paused: 0
namespace_mdts_crashed: 0
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 99
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MGS+MDT migration to a new storage using LVM tools

2020-07-21 Thread David Cohen
Thanks Andreas for your detailed reply.

I took your advice on the MDT naming.
As now the migration is complete I want to share some major problems I had
on the way.

I don't know where to point the blame, the Lustre kernel, e2fprogs, srp
tools, multipath version, lvm version and moving from 512 blocks to 4096.
But as soon as I created the mirror, the server went into a kernel panic,
core dump loop.

I managed to stop it only by breaking the mirror from another server
connected to the same storage.
It took me a full day to recover the system.

Today I restarted the process, this time from a different server ,not
running Lustre, which I already used for a vm lun lvm migration.
The exact same procedure ran flawlessly and I needed only to refresh the
lvm on the MDS to be able to mount the migrated mdt.

Cheers,
David

On Sun, Jul 19, 2020 at 12:27 PM Andreas Dilger  wrote:

> On Jul 19, 2020, at 12:41 AM, David Cohen 
> wrote:
> >
> > Hi,
> > We have a combined MGS+MDT and I'm looking for a migration to new
> storage with a minimal disruption to the running jobs on the cluster.
> >
> > Can anyone find problems in the scenario below and/or suggest another
> solution?
> > I would appreciate also "no problems" replies to reassure the scenario
> before I proceed.
> >
> > Current configuration:
> > The mdt is a logical volume in a lustre_pool VG on a /dev/mapper/MDT0001
> PV
>
> I've been running Lustre on LVM at home for many years, and have done
> pvmove
> of the underlying storage to new devices without any problems.
>
> > Migration plan:
> > Add /dev/mapper/MDT0002 new disk (multipath)
>
> I would really recommend that you *not* use MDT0002 as the name of the PV.
> This is very confusing because the MDT itself (at the Lustre level) is
> almost certainly named "-MDT", and if you ever add new MDTs to
> this filesystem it will be confusing as to which *Lustre* MDT is on which
> underlying PV.  Instead, I'd take the opportunity to name this "MDT" to
> match the actual Lustre MDT target name.
>
> > extend the VG:
> > pvcreate /dev/mapper/MDT0002
> > vgextend  lustre_pool /dev/mapper/MDT0002
> > mirror the mdt to the new disk:
> > lvconvert -m 1 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0002
>
> I typically just use "pvmove", but doing this by adding a mirror and then
> splitting it off is probably safer.  That would still leave you with a full
> copy of the MDT on the original PV if something happened in the middle.
>
> > wait the mirrored disk to sync:
> > lvs -o+devices
> > when it's fully synced unmount the MDT, remove the old disk from the
> mirror:
> > lvconvert -m 0 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0001
> > and remove the old disk from the pool:
> > vgreduce lustre_pool /dev/mapper/MDT0001
> > pvremove /dev/mapper/MDT0001
> > remount the MDT and let the clients few minutes to recover the
> connection.
>
> In my experience with pvmove, there is no need to do anything with the
> clients,
> as long as you are not also moving the MDT to a new server, since the
> LVM/DM
> operations are totally transparent to both the Lustre server and client.
>
> After my pvmove (your "lvconvert -m 0"), I would just vgreduce the old PV
> from
> the VG, and then leave it in the system (internal HDD) until the next time
> I
> needed to shut down the server.  If you have hot-plug capability for the
> PVs,
> then you don't even need to wait for that.
>
> Cheers, Andreas
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] MGS+MDT migration to a new storage using LVM tools

2020-07-19 Thread David Cohen
Hi,
We have a combined MGS+MDT and I'm looking for a migration to new storage
with a minimal disruption to the running jobs on the cluster.

Can anyone find problems in the scenario below and/or suggest another
solution?
I would appreciate also "no problems" replies to reassure the scenario
before I proceed.

Current configuration:
The mdt is a logical volume in a lustre_pool VG on a /dev/mapper/MDT0001 PV

Migration plan:
Add /dev/mapper/MDT0002 new disk (multipath)
extend the VG:
pvcreate /dev/mapper/MDT0002
vgextend  lustre_pool /dev/mapper/MDT0002
mirror the mdt to the new disk:
lvconvert -m 1 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0002
wait the mirrored disk to sync:
lvs -o+devices
when it's fully synced unmount the MDT, remove the old disk from the mirror:
lvconvert -m 0 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0001
and remove the old disk from the pool:
vgreduce lustre_pool /dev/mapper/MDT0001
pvremove /dev/mapper/MDT0001
remount the MDT and let the clients few minutes to recover the connection.

Thanks,
David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] frequent Connection lost, Connection restored to mdt

2019-12-23 Thread David Cohen
Hi,
Yes, I do see load on the client side, but as the client has 40gb NIC and
the load comes from a 10gb WAN link I wouldn't expect it to overload the
net.
I can correlate the messages with load higher than 6gb from the WAN. Far
from the limit of the NIC.
The client has a latest generation Xeon processor so I wouldn't expect that
to be the bottle neck either.

David


On Mon, Dec 23, 2019 at 5:09 PM Degremont, Aurelien 
wrote:

> Hi
>
>
>
> These messages means the client thinks it has lost the communication with
> the server and reconnect. The server only sees the reconnection and never
> thought the client was gone.
>
>
>
> It could be related to lots of things. The server could be receiving RPCs
> from this client but not processing them fast enough. Is there other errors
> on your server? Is there any high load?
>
> Same on your clients? Is there any high load that could prevent your
> client from communicating with your server properly?
>
>
>
> Do you correlate that with some specific load running on your clients?
>
>
>
> Aurélien
>
>
>
> *De : *lustre-discuss  au nom de
> David Cohen 
> *Date : *dimanche 22 décembre 2019 à 17:08
> *À : *"lustre-discuss@lists.lustre.org" 
> *Objet : *[lustre-discuss] frequent Connection lost, Connection restored
> to mdt
>
>
>
> Hi,
>
> We are running 2.10.5 on the servers and 2.10.8 on the clients.
>
> Every few minutes, we see:
>
>
>
> On client side:
>
>
>
> Dec 22 15:26:34 gftp kernel: Lustre:
> 439834:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1577021187/real 1577021187]
>  req@88160be9c6c0 x1653620348981536/t0(0)
> o36->lustre-MDT-mdc-8817d9776c00@10.0.0.1@tcp:12/10 lens 608/4768
> e 0 to 1 dl 1577021194 ref 2 fl Rpc:X/0/ rc 0/-1
> Dec 22 15:26:34 gftp kernel: Lustre:
> 439834:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous
> similar messages
> Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00:
> Connection to lustre-MDT (at 10.0.0.1@tcp) was lost; in progress
> operations using this service will wait for recovery to complete
> Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages
> Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00:
> Connection restored to 10.0.0.1@tcp (at 192.114.101.153@tcp)
> Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages
>
>
>
> On server side:
>
>
>
> Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Client
> 38d6eef1-e146-be41-bab9-409b272d0d4f (at 10.0.0.10@tcp) reconnecting
> Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Connection restored
> to ec2cdfce-353f-583a-c970-fde3f5d5189c (at 10.0.0.10@tcp)
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] frequent Connection lost, Connection restored to mdt

2019-12-22 Thread David Cohen
Hi,
We are running 2.10.5 on the servers and 2.10.8 on the clients.
Every few minutes, we see:

On client side:

Dec 22 15:26:34 gftp kernel: Lustre:
439834:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1577021187/real 1577021187]
 req@88160be9c6c0 x1653620348981536/t0(0)
o36->lustre-MDT-mdc-8817d9776c00@10.0.0.1@tcp:12/10 lens 608/4768 e
0 to 1 dl 1577021194 ref 2 fl Rpc:X/0/ rc 0/-1
Dec 22 15:26:34 gftp kernel: Lustre:
439834:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous
similar messages
Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00:
Connection to lustre-MDT (at 10.0.0.1@tcp) was lost; in progress
operations using this service will wait for recovery to complete
Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages
Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT-mdc-8817d9776c00:
Connection restored to 10.0.0.1@tcp (at 192.114.101.153@tcp)
Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages

On server side:

Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Client
38d6eef1-e146-be41-bab9-409b272d0d4f (at 10.0.0.10@tcp) reconnecting
Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT: Connection restored
to ec2cdfce-353f-583a-c970-fde3f5d5189c (at 10.0.0.10@tcp)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Limit to the number of "--servicenode="

2018-09-29 Thread David Cohen
Hi,
In all the manuals and examples there are only two "--servicenode=" in the
creation of the mgs nodes and oss.
Is that a limitation or can I create more service nodes?
Is the maximum number of servicenodes is different for mgs and oss?

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.4 failover

2018-08-13 Thread David Cohen
the fstab line I use for mounting the Lustre filesystem:

oss03@tcp:oss01@tcp:/fsname /storagelustre
flock,user_xattr,defaults0 0

the mds is also configured for failover (unsuccessfully) :
tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs
--mountfsoptions='user_xattr,errors=remount-ro,acl'
--param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp
servicenode=oss03@tcp" /dev/lustre_pool/MDT




On Mon, Aug 13, 2018 at 8:40 PM Mohr Jr, Richard Frank (Rick Mohr) <
rm...@utk.edu> wrote:

>
> > On Aug 13, 2018, at 7:14 AM, David Cohen 
> wrote:
> >
> > I installed a new 2.10.4 Lustre file system.
> > Running MDS and OSS on the same servers.
> > Failover wasn't configured at format time.
> > I'm trying to configure failover node with tunefs without success.
> > tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug"
> --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp
> --servicenode=oss03@tcp /dev/mapper/OST0015
> >
> > I can mount the ost on the second server but the clients won't restore
> the connection.
> > Maybe I'm missing something obvious. Do you see any typo in the command?
>
> What mount command are you using on the client?
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.10.4 failover

2018-08-13 Thread David Cohen
Hi
I installed a new 2.10.4 Lustre file system.
Running MDS and OSS on the same servers.
Failover wasn't configured at format time.
I'm trying to configure failover node with tunefs without success.
tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug"
--mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp
--servicenode=oss03@tcp /dev/mapper/OST0015

I can mount the ost on the second server but the clients won't restore the
connection.
Maybe I'm missing something obvious. Do you see any typo in the command?


David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] How to support user_xattr in 2.10.4

2018-07-16 Thread David Cohen
Hi,
I'm running a newly installed Lustre 2.10.4.
The mds is configured to support acl and user_xattr:

Persistent mount opts: user_xattr,errors=remount-ro,acl

But when trying to mount (or remount) the client with "-o
remount,acl,user_xattr"

And checking the mount I get only:
type lustre (rw,lazystatfs)

While acl seems to be available user_xattr isn't:

rsync: rsync_xal_set:
lsetxattr(""/storage/atlas/atlasdatadisk/SAM/testfile-prep-GET-ATLASLOCALGROUPDISK.txt"","user.storm.checksum.adler32")
failed: Operation not supported (95)

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Mount options ignored?

2018-07-02 Thread David Cohen
Hi,
I'm running a newly installed Lustre 2.10.4.
The mds is configured to support acl and user_xattr:

Persistent mount opts: user_xattr,errors=remount-ro,acl

But when trying to mount (or remount) the client with "-o
remount,acl,user_xattr"

And checking the mount I get only:
type lustre (rw,lazystatfs)

While acl seems to be available user_xattr isn't:

rsync: rsync_xal_set:
lsetxattr(""/storage/atlas/atlasdatadisk/SAM/testfile-prep-GET-ATLASLOCALGROUPDISK.txt"","user.storm.checksum.adler32")
failed: Operation not supported (95)

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2018-01-03 Thread David Cohen
Thanks for all the answers.
I was thinking of creating a new file system, starting from clean
configuration, implementing quotas etc.
For that I was looking for a way in which the systems can coexist, moving
symbolic links while the folders are synchronized to the new system.
In the process emptying disks of the old file system and moving them to the
new one.
This is a long process that might take more then a month, but can be done
without disturbing normal cluster operation.

As it doesn't seems to be possible in real life, I will have to reevaluate
my options and come with a different migration schema.




On Wed, Jan 3, 2018 at 1:49 PM, Patrick Farrell <p...@cray.com> wrote:

> FWIW, as long as you don’t intend to use any interesting features (quotas,
> etc), 1.8 clients were used with 2.5 servers at ORNL for some time with no
> ill effects on the IO side of things.
>
> I’m not sure how much further that limited compatibility goes, though.
> --
> *From:* Dilger, Andreas <andreas.dil...@intel.com>
> *Sent:* Wednesday, January 3, 2018 4:20:56 AM
> *To:* David Cohen
> *Cc:* Patrick Farrell; lustre-discuss@lists.lustre.org
> *Subject:* Re: [lustre-discuss] Lustre Client in a container
>
> On Dec 31, 2017, at 01:50, David Cohen <cda...@physics.technion.ac.il>
> wrote:
> >
> > Patrick,
> > Thanks for you response.
> > I looking for a way to migrate from 1.8.9 system to 2.10.2, stable
> enough to run the several weeks or more that it might take.
>
> Note that there is no longer direct support for upgrading from 1.8 to
> 2.10.
>
> That said, are you upgrading the filesystem in place, or are you copying
> the data from the 1.8.9 filesystem to the 2.10.2 filesystem?  In the latter
> case, the upgrade compatibility doesn't really matter.  What you need is a
> client that can mount both server versions at the same time.
>
> Unfortunately, no 2.x clients can mount the 1.8.x server filesystem
> directly, so that does limit your options.  There was a time of
> interoperability with 1.8 clients being able to mount 2.1-ish servers, but
> that doesn't really help you.  You could upgrade the 1.8 servers to 2.1 or
> later, and then mount both filesystems with a 2.5-ish client, or upgrade
> the servers to 2.5.
>
> Cheers, Andreas
>
> > On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell <p...@cray.com> wrote:
> > David,
> >
> > I have no direct experience trying this, but I would imagine not -
> Lustre is a kernel module (actually a set of kernel modules), so unless the
> container tech you're using allows loading multiple different versions of
> *kernel modules*, this is likely impossible.  My limited understanding of
> container tech on Linux suggests that this would be impossible, containers
> allow userspace separation but there is only one kernel/set of
> modules/drivers.
> >
> > I don't know of any way to run multiple client versions on the same node.
> >
> > The other question is *why* do you want to run multiple client versions
> on one node...?  Clients are usually interoperable across a pretty generous
> set of server versions.
> >
> > - Patrick
> >
> >
> > From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on
> behalf of David Cohen <cda...@physics.technion.ac.il>
> > Sent: Saturday, December 30, 2017 11:45:15 AM
> > To: lustre-discuss@lists.lustre.org
> > Subject: [lustre-discuss] Lustre Client in a container
> >
> > Hi,
> > Is it possible to run Lustre client in a container?
> > The goal is to run two different client version on the same node, can it
> be done?
> >
> > David
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2017-12-31 Thread David Cohen
Patrick,
Thanks for you response.
I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough
to run the several weeks or more that it might take.


David

On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell <p...@cray.com> wrote:

> David,
>
>
> I have no direct experience trying this, but I would imagine not - Lustre
> is a kernel module (actually a set of kernel modules), so unless the
> container tech you're using allows loading multiple different versions of
> *kernel modules*, this is likely impossible.  My limited understanding of
> container tech on Linux suggests that this would be impossible, containers
> allow userspace separation but there is only one kernel/set of
> modules/drivers.
>
>
> I don't know of any way to run multiple client versions on the same node.
>
>
> The other question is *why* do you want to run multiple client versions on
> one node...?  Clients are usually interoperable across a pretty generous
> set of server versions.
>
>
> - Patrick
>
>
>
> --
> *From:* lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on
> behalf of David Cohen <cda...@physics.technion.ac.il>
> *Sent:* Saturday, December 30, 2017 11:45:15 AM
> *To:* lustre-discuss@lists.lustre.org
> *Subject:* [lustre-discuss] Lustre Client in a container
>
> Hi,
> Is it possible to run Lustre client in a container?
> The goal is to run two different client version on the same node, can it
> be done?
>
> David
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client in a container

2017-12-30 Thread David Cohen
Hi,
Is it possible to run Lustre client in a container?
The goal is to run two different client version on the same node, can it be
done?

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] delete a undeletable file

2013-03-08 Thread David Cohen
You can move the entire folder (mv) to another location 
/lustre_fs/somthing/badfiles recreate the folder and mv back only the good 
files.

 If I run unlink .viminfo I got the same error:
 
 unlink: cannot unlink `.viminfo': Invalid argument
 
 
 I can't stop the MDS/OSS  to do a lfsck or e2fsck because is a filesystem in 
 production of a lots of terabytes
 
 
 Any idea more to delete the damm file?
 
 
 
 THANKS!
 
 -Mensaje original- 
 From: Bob Ball
 Sent: Thursday, March 07, 2013 6:09 PM
 To: Colin Faber
 Cc: Alfonso Pardo ; Ben Evans ; lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] delete a undeletable file
 
 You could just unlink it instead.  That will work when rm fails.
 
 bob
 
 On 3/7/2013 11:10 AM, Colin Faber wrote:
  Hi,
 
  If the file is disassociated with an OST which is offline, bring the OST
  back online, if the OST object it self is missing then you can remove
  the file using 'unlink' rather than 'rm' to unlink the object meta data.
 
  If you want to try and recover the missing OST object an lfs getstripe
  against the file should yield the the OST on which it resides. Once
  that's determined you can take that OST offline and e2fsck may
  successfully restore it.
 
  Another option as Ben correctly points out, lfsck will correct / prune
  this meta data as well as the now orphaned (if any) OST object.
 
  -cf
 
 
  On 03/07/2013 08:30 AM, Ben Evans wrote:
  The snarky reply would be to use Emacs.
 
  More seriously:
 
  When I see something like ?? in the attributes for a file, my
  first thought is that the group_upcall on the filesystem is not
  correct, so permissions are broken. If you can log on as root, you may
  be able to see it clearly.
 
  If that doesn't work, you may have to run an fsck on the MDT (which
  may take minutes to hours depending on the size of your MDT)
 
  If that doesn't work, follow the procedure for running an lfsck (which
  will take a long time, and require quite a bit of storage to execute)
 
  -Ben Evans
 
  
  *From:* lustre-discuss-boun...@lists.lustre.org
  [lustre-discuss-boun...@lists.lustre.org] on behalf of Alfonso Pardo
  [alfonso.pa...@ciemat.es]
  *Sent:* Thursday, March 07, 2013 10:09 AM
  *To:* lustre-discuss@lists.lustre.org; wc-disc...@whamcloud.com
  *Subject:* [Lustre-discuss] delete a undeletable file
 
  Hello,
  I have a corrupt file than I can’t delete.
  This is my file:
  ls –la .viminfo
  /-? ? ? ? ? ? .viminfo/
  /lfs getstripe .viminfo/
  /.viminfo/
  /lmm_stripe_count: 6/
  /lmm_stripe_size: 1048576/
  /lmm_layout_gen: 0/
  /lmm_stripe_offset: 18/
  /obdidx objid objid group/
  /18 1442898 0x160452 0/
  /22 48 0x30 0/
  /19 1442770 0x1603d2 0/
  /21 49 0x31 0/
  /23 48 0x30 0/
  /20 50 0x32 0/
  And these are my OST:
  /lctl dl/
  /0 UP mgc MGC192.168.11.9@tcp f6d5b76f-a7e0-61ca-b389-cb3896b86186 5/
  /1 UP lov cetafs-clilov-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 4/
  /2 UP lmv cetafs-clilmv-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 4/
  /3 UP mdc cetafs-MDT-mdc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /4 UP osc cetafs-OST-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /5 UP osc cetafs-OST0001-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /6 UP osc cetafs-OST0002-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /7 UP osc cetafs-OST0003-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /8 UP osc cetafs-OST0004-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /9 UP osc cetafs-OST0005-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /10 UP osc cetafs-OST0006-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /11 UP osc cetafs-OST0007-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /12 UP osc cetafs-OST0012-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /13 UP osc cetafs-OST0013-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /14 UP osc cetafs-OST0008-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /15 UP osc cetafs-OST000a-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /16 UP osc cetafs-OST0009-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /17 UP osc cetafs-OST000b-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /18 UP osc cetafs-OST000c-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /19 UP osc cetafs-OST000d-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /20 UP osc cetafs-OST000e-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /21 UP osc cetafs-OST000f-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /22 UP osc cetafs-OST0010-osc-88009816e400
  a7ba6783-6ed8-2197-4ffc-fecbff9860a5 5/
  /23 UP osc cetafs-OST0011-osc-88009816e400
  

Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-06 Thread David Cohen
On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
 On 2010-01-04, at 03:02, David Cohen wrote:
  I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a
  problem
  with qlogic drivers and rolled back to 1.6.6).
  My MDS get unresponsive each day at 4-5 am local time, no kernel
  panic or
  error messages before.

I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
clients and the system is stable again.
Many Thanks.


 
 Judging by the time, I'd guess this is slocate or mlocate running
 on all of your clients at the same time.  This used to be a source of
 extremely high load back in the old days, but I thought that Lustre
 was in the exclude list in newer versions of *locate.  Looking at the
 installed mlocate on my system, that doesn't seem to be the case...
 strange.
 
  Some errors and an LBUG appear in the log after force booting the
  MDS and
  mounting the MDT and then the log is clear until next morning:
 
  Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
  (class_hash.c:225:lustre_hash_findadd_unique_hnode())
  ASSERTION(hlist_unhashed(hnode)) failed
  Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
  (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
  Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
  debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
  Jan  4 06:33:31 tech-mds kernel: ll_mgs_02 R  running task
  0  6357
  16340 (L-TLB)
  Jan  4 06:33:31 tech-mds kernel: Call Trace:
  Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
  Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
  Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
  Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
  Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
  Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
  Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
  Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
 
 It shouldn't LBUG during recovery, however.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
 

-- 
David Cohen
Grid Computing
Physics Department
Technion - Israel Institute of Technology
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] MDS crashes daily at the same hour

2010-01-04 Thread David Cohen
 active, resetting orphans
Jan  4 06:38:31 tech-mds kernel: Lustre: MDS technion-MDT: technion-
OST_UUID now active, resetting orphans
Jan  4 06:38:41 tech-mds kernel: LustreError: 6392:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18531070: cookie 
0xdcb9c7fd999ea709  r...@8100d3ed x1323646224495072/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579964 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6398:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18531068: cookie 
0xdcb9c7fd999e9dfc  r...@8100dc7c8c00 x1323646224495073/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6415:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18508458: cookie 
0xdcb9c7fd9983617e  r...@8100d4bfb400 x1323646224495345/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6415:0:
(mds_open.c:1665:mds_close()) Skipped 271 previous similar messages
Jan  4 06:38:42 tech-mds kernel: LustreError: 6409:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18498078: cookie 
0xdcb9c7fd99273a35  r...@810054d2e800 x1323646224496303/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579928 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:42 tech-mds kernel: LustreError: 6409:0:
(mds_open.c:1665:mds_close()) Skipped 957 previous similar messages
Jan  4 06:38:44 tech-mds kernel: LustreError: 6413:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18464618: cookie 
0xdcb9c7fd9893064a  r...@8100d39f3400 x1323646224498078/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579930 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:44 tech-mds kernel: LustreError: 6413:0:
(mds_open.c:1665:mds_close()) Skipped 1774 previous similar messages
Jan  4 06:38:48 tech-mds kernel: LustreError: 6423:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18437710: cookie 
0xdcb9c7fd9817e589  r...@8100d45b5c00 x1323646224499484/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579934 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:48 tech-mds kernel: LustreError: 6423:0:
(mds_open.c:1665:mds_close()) Skipped 1405 previous similar messages
Jan  4 06:38:53 tech-mds kernel: LustreError: 6422:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116)  
r...@810054d38000 x1323646224500886/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579939 ref 1 fl Interpret:/0/0 rc -116/0
Jan  4 06:38:53 tech-mds kernel: LustreError: 6422:0:
(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 5838 previous similar 
messages
Jan  4 06:38:56 tech-mds kernel: LustreError: 6420:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 13567564: cookie 
0xde1fda06cd4d058c  r...@810055378800 x1323646224501408/t0 o35-5d1ee8c1-
f826-9ab3-89bf-342c4f9e2...@net_0x2c0726512_uuid:0/0 lens 408/976 e 0 to 0 
dl 1262579942 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:56 tech-mds kernel: LustreError: 6420:0:
(mds_open.c:1665:mds_close()) Skipped 1923 previous similar messages




-- 
David Cohen
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss