[lustre-discuss] Lustre 2.5.3 - OST unable to connect to MGS

2015-10-08 Thread Murshid Azman
Hello Lustre gurus,

Recently, one of our OSS' had a faulty RAID card (3ware) and this has
corrupted the root filesystem and Lustre OST.

We then reinstalled the OS, fsck'd Lustre OST using a backup superblock
(the primary one was corrupted) and recreated the journal (journal also
corrupted). We now have a bunch of files in lost+found, evidently by
mounting as ldiskfs.

However, we are having problems mounting the Lustre OST with errors as
follows:

Oct  7 13:01:45 OSS50 kernel: LDISKFS-fs (sdb): mounted filesystem with
ordered data mode. quota=off. Opts:
Oct  7 13:01:48 OSS50 kernel: LustreError: 137-5: Lustre-OST003b_UUID: not
available for connect from 172.16.4.66@tcp (no target). If you are running
an HA pair check that the target is mounted on the other server.
Oct  7 13:01:48 OSS50 kernel: LustreError: Skipped 5 previous similar
messages
Oct  7 13:01:48 OSS50 kernel: LustreError: 137-5: Lustre-OST003b_UUID: not
available for connect from 172.16.250.59@tcp (no target). If you are
running an HA pair check that the target is mounted on the other server.
Oct  7 13:01:48 OSS50 kernel: LustreError: Skipped 3 previous similar
messages
Oct  7 13:01:51 OSS50 kernel: LustreError: 137-5: Lustre-OST003b_UUID: not
available for connect from 172.16.7.199@tcp (no target). If you are running
an HA pair check that the target is mounted on the other server.
Oct  7 13:01:51 OSS50 kernel: LustreError: Skipped 15 previous similar
messages
Oct  7 13:01:55 OSS50 kernel: LustreError: 137-5: Lustre-OST003b_UUID: not
available for connect from 172.16.250.173@tcp (no target). If you are
running an HA pair check that the target is mounted on the other server.
Oct  7 13:01:55 OSS50 kernel: LustreError: Skipped 19 previous similar
messages
Oct  7 13:02:04 OSS50 kernel: LustreError: 137-5: Lustre-OST003b_UUID: not
available for connect from 172.16.5.114@tcp (no target). If you are running
an HA pair check that the target is mounted on the other server.
Oct  7 13:02:04 OSS50 kernel: LustreError: Skipped 49 previous similar
messages
Oct  7 13:02:04 OSS50 kernel: LustreError: 0-0: Trying to start OBD
Lustre-OST003b_UUID using the wrong disk <85>. Were the /dev/ assignments
rearranged?
Oct  7 13:02:04 OSS50 kernel: LustreError:
16002:0:(obd_config.c:572:class_setup()) setup Lustre-OST003b failed (-22)
Oct  7 13:02:04 OSS50 kernel: LustreError:
16002:0:(obd_config.c:1591:class_config_llog_handler()) MGC172.16.0.251@tcp:
cfg command failed: rc = -22
Oct  7 13:02:04 OSS50 kernel: Lustre:cmd=cf003 0:Lustre-OST003b  1:dev
2:0  3:f
Oct  7 13:02:04 OSS50 kernel: LustreError: 15b-f: MGC172.16.0.251@tcp: The
configuration from log 'Lustre-OST003b'failed from the MGS (-22).  Make
sure this client and the MGS are running compatible versions of Lustre.
Oct  7 13:02:05 OSS50 kernel: LustreError: 15c-8: MGC172.16.0.251@tcp: The
configuration from log 'Lustre-OST003b' failed (-22). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_mount_server.c:1252:server_start_targets()) failed to start
server Lustre-OST003b: -22
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_mount_server.c:1735:server_fill_super()) Unable to start
targets: -22
Oct  7 13:02:05 OSS50 kernel: Lustre: Lustre-OST003b: Not available for
connect from 172.16.5.116@tcp (not set up)
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_mount_server.c:845:lustre_disconnect_lwp())
Lustre-MDT-lwp-OST003b: Can't end config log Lustre-client.
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_mount_server.c:1420:server_put_super()) Lustre-OST003b: failed
to disconnect lwp. (rc=-2)
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_config.c:619:class_cleanup()) Device 135 not setup
Oct  7 13:02:05 OSS50 kernel: Lustre: server umount Lustre-OST003b complete
Oct  7 13:02:05 OSS50 kernel: LustreError:
15976:0:(obd_mount.c:1324:lustre_fill_super()) Unable to mount /dev/sdb
(-22)
Oct  7 13:02:05 OSS50 kernel: Lustre: Skipped 1 previous similar message

Any ideas?

I would think that we can eliminate the configuration errors by doing a
writeconf but since this is a potentially destructive operation, I'd like
to check with you experts see if anyone have experienced something like
this?

Thank you,
Murshid.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Inodes Quota problem

2015-10-08 Thread Götz Waschk
Hi everyone,

I have a quota problem on a file system upgraded from Lustre 1.8 to
2.5.3. The number of inodes reported by lfs getquota is obviously
wrong:

# ls -l
insgesamt 3064024
drwxr-sr-x 3 u248 icecube1830912 28. Aug 2014  Callisto
drwxr-sr-x 2 u248 icecube2301952 26. Aug 2014  CameraOutput
-rw-r--r-- 1 u248 icecube  21542  1. Aug 2014  LogCallistoM1_026202_.log
-rw-r--r-- 1 u248 icecube  439316480 15. Aug 2014  logs25000_35000.tar.gz
-rw-r--r-- 1 u248 icecube  224972800 18. Aug 2014  logs35000_4.tar.gz
-rw-r--r-- 1 u248 icecube  225146880 28. Aug 2014  logscal40k-45k.tar.gz
-rw-r--r-- 1 u248 icecube 1122981634 28. Jul 2014
Pr_za05to36_4_027617_ct1_w0.rfl
-rw-r--r-- 1 u248 icecube 1120948234 28. Jul 2014
Pr_za05to36_4_027617_ct2_w0.rfl
# lfs quota -u u248 .
Disk quotas for user u248 (uid 20056):
 Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
  . 4709669648   0   0   -   1   0
  0   -


Is this a known bug?

Regards, Götz
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Inodes Quota problem

2015-10-08 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Oct 8, 2015, at 7:33 AM, Götz Waschk  wrote:
> 
> I have a quota problem on a file system upgraded from Lustre 1.8 to
> 2.5.3. The number of inodes reported by lfs getquota is obviously
> wrong:
> 
> # ls -l
> insgesamt 3064024
> drwxr-sr-x 3 u248 icecube1830912 28. Aug 2014  Callisto
> drwxr-sr-x 2 u248 icecube2301952 26. Aug 2014  CameraOutput
> -rw-r--r-- 1 u248 icecube  21542  1. Aug 2014  LogCallistoM1_026202_.log
> -rw-r--r-- 1 u248 icecube  439316480 15. Aug 2014  logs25000_35000.tar.gz
> -rw-r--r-- 1 u248 icecube  224972800 18. Aug 2014  logs35000_4.tar.gz
> -rw-r--r-- 1 u248 icecube  225146880 28. Aug 2014  logscal40k-45k.tar.gz
> -rw-r--r-- 1 u248 icecube 1122981634 28. Jul 2014
> Pr_za05to36_4_027617_ct1_w0.rfl
> -rw-r--r-- 1 u248 icecube 1120948234 28. Jul 2014
> Pr_za05to36_4_027617_ct2_w0.rfl
> # lfs quota -u u248 .
> Disk quotas for user u248 (uid 20056):
> Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>  . 4709669648   0   0   -   1   0
>  0   -
> 
> 
> Is this a known bug?

Not sure if this is a known bug.  Have you tried regenerating the quota data?  
I have had to do this a few times on some OSTs.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org