Re: [Lustre-discuss] [wc-discuss] The ost_connect operation failed with -16
Hi, I think you might hit this: http://jira.whamcloud.com/browse/LU-952 , you can find the patch from this ticket Regards Liang On May 30, 2012, at 11:21 AM, huangql wrote: Dear all, Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following: Server side: May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: Call Trace: May 30 11:06:31 boss07 kernel: [886f5d0e] start_this_handle+0x301/0x3cb [jbd2] May 30 11:06:31 boss07 kernel: [800a09ca] autoremove_wake_function+0x0/0x2e May 30 11:06:31 boss07 kernel: [886f5e83] jbd2_journal_start+0xab/0xdf [jbd2] May 30 11:06:31 boss07 kernel: [888ce9b2] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] May 30 11:06:31 boss07 kernel: [88920551] filter_version_get_check+0x91/0x2a0 [obdfilter] May 30 11:06:31 boss07 kernel: [80036cf4] __lookup_hash+0x61/0x12f May 30 11:06:31 boss07 kernel: [8893108d] filter_setattr_internal+0x90d/0x1de0 [obdfilter] May 30 11:06:31 boss07 kernel: [800e859b] lookup_one_len+0x53/0x61 May 30 11:06:31 boss07 kernel: [88925452] filter_fid2dentry+0x512/0x740 [obdfilter] May 30 11:06:31 boss07 kernel: [88924e27] filter_fmd_get+0x2b7/0x320 [obdfilter] May 30 11:06:31 boss07 kernel: [8003027b] __up_write+0x27/0xf2 May 30 11:06:31 boss07 kernel: [88932721] filter_setattr+0x1c1/0x3b0 [obdfilter] May 30 11:06:31 boss07 kernel: [8882677a] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] May 30 11:06:31 boss07 kernel: [8881e658] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] May 30 11:06:31 boss07 kernel: [88822b05] lustre_msg_get_version+0x35/0xf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [888b0abb] ost_handle+0x25db/0x55b0 [ost] May 30 11:06:31 boss07 kernel: [80150d56] __next_cpu+0x19/0x28 May 30 11:06:31 boss07 kernel: [800767ae] smp_send_reschedule+0x4e/0x53 May 30 11:06:31 boss07 kernel: [8883215a] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [888328a8] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] May 30 11:06:31 boss07 kernel: [8008b3bd] __wake_up_common+0x3e/0x68 May 30 11:06:31 boss07 kernel: [88833817] ptlrpc_main+0xf37/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [8005dfb1] child_rip+0xa/0x11 May 30 11:06:31 boss07 kernel: [888328e0] ptlrpc_main+0x0/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [8005dfa7] child_rip+0x0/0x11 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011 Client side: May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123@tcp. The ost_connect operation failed with -16 When you got this error message, you failed to run ls, df ,vi, touch and so on, which affect us to do anything in the file system. I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. Could someone give us some advice or any suggestions to solve this problem? Thank you very much in advance. Best Regards Qiulan Huang 2012-05-30 Computing center,the Institute of High Energy Physics, China Huang, QiulanTel: (+86) 10 8823 6010-105 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049 P.R. China Email: huan...@ihep.ac.cn === ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
[ ... ] The tar backup of the MDT is taking a very long time. So far it has backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar process pointers to small or average size files are backed up quickly and at a consistent pace. When tar encounters a pointer/inode belonging to a very large file (100GB+) the tar process stalls on that file for a very long time, as if it were trying to archive the real filesize amount of data rather than the pointer/inode. If you have stripes on, a 100GiB file will have 100,000 1MiB stripes, and each requires a chunk of metadata. The descriptor for that file will have this potentially a very large number of extents, scattered around the MDT block device, depending on how slowly the file grew etc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ofed with FDR14 support Lustre
We're using 1.8.7.80-wc1 here. It's basically 1.8.7-wc1, but with a few fixes pulled in from git a few months back to build on rhel6.2. It's built on top of Mellanox's OFED 1.5.3-3.0.0, and is working just fine on our FDR14 cluster. -- Mike Shuey On Wed, May 30, 2012 at 3:41 PM, John White jwh...@lbl.gov wrote: Does anyone know of Lustre version that can build against an ofed that supports FDR14 (1.5.4+, by my understanding)? Or is this still in the pipes? The compat matrix on the Whamcloud site only talks of support up to 1.5.3.1 (confirmed to build but doesn't support FDR14). John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ofed with FDR14 support Lustre
We have a small patch to lbuild to build against Mellanox 1.5.3-3. One patch is to allow specifying of OFED build trees by name rather than version, the second patch is to build with the mellanox ofed. These are from 2.1.2 on RHEL6 but they apply elsewhere as well. Ashley. On Wed, 2012-05-30 at 12:54 -0700, Michael Shuey wrote: We're using 1.8.7.80-wc1 here. It's basically 1.8.7-wc1, but with a few fixes pulled in from git a few months back to build on rhel6.2. It's built on top of Mellanox's OFED 1.5.3-3.0.0, and is working just fine on our FDR14 cluster. -- Mike Shuey On Wed, May 30, 2012 at 3:41 PM, John White jwh...@lbl.gov wrote: Does anyone know of Lustre version that can build against an ofed that supports FDR14 (1.5.4+, by my understanding)? Or is this still in the pipes? The compat matrix on the Whamcloud site only talks of support up to 1.5.3.1 (confirmed to build but doesn't support FDR14). John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss diff -r 7d5dec13571e lustre/kernel_patches/targets/2.6-rhel6.target.in --- a/lustre/kernel_patches/targets/2.6-rhel6.target.in Tue May 22 12:32:58 2012 +0100 +++ b/lustre/kernel_patches/targets/2.6-rhel6.target.in Tue May 22 12:37:20 2012 +0100 @@ -7,7 +7,8 @@ LUSTRE_VERSION=@VERSION@ DEVEL_PATH_ARCH_DELIMETER=. -OFED_VERSION=inkernel +OFED_VERSION=1.5.3 +OFED_TARBALL=MLNX_OFED_SRC-1.5.3-3.0.0.tgz BASE_ARCHS=i686 x86_64 ia64 ppc64 BIGMEM_ARCHS= diff -r 7121b6da363f build/lbuild --- a/build/lbuild Tue May 22 12:46:07 2012 +0100 +++ b/build/lbuild Tue May 22 12:47:05 2012 +0100 @@ -531,6 +531,15 @@ return 0 fi +# If a full filename has been provided instead of just a version +# then use that. +if [ -n ${OFED_TARBALL} ]; then +if [ -f ${KERNELDIR}/${OFED_TARBALL} ]; then +return 0 +fi +fatal 1 ${OFED_TARBALL} not found in ${KERNELDIR} +fi + local OFED_BASE_VERSION=$OFED_VERSION if [[ $OFED_VERSION = *.*.*.* ]]; then OFED_BASE_VERSION=${OFED_VERSION%.*} @@ -692,10 +701,17 @@ unpack_ofed() { -if ! untar $KERNELDIR/OFED-${OFED_VERSION}.tgz; then -return 1 +if [ -n ${OFED_TARBALL} ]; then +if ! untar $KERNELDIR/${OFED_TARBALL}; then +return 1 +fi +else +if ! untar $KERNELDIR/OFED-${OFED_VERSION}.tgz; then +return 1 +fi fi [ -d OFED ] || ln -sf OFED-[0-9].[0-9]* OFED +[ -d OFED ] || ln -sf *OFED_SRC-[0-9].[0-9]* OFED } ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
On 2012-05-29, at 1:28 PM, Peter Grandi wrote: The tar backup of the MDT is taking a very long time. So far it has backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar process pointers to small or average size files are backed up quickly and at a consistent pace. When tar encounters a pointer/inode belonging to a very large file (100GB+) the tar process stalls on that file for a very long time, as if it were trying to archive the real filesize amount of data rather than the pointer/inode. If you have stripes on, a 100GiB file will have 100,000 1MiB stripes, and each requires a chunk of metadata. The descriptor for that file will have this potentially a very large number of extents, scattered around the MDT block device, depending on how slowly the file grew etc. While that may be true for other distributed filesystems, that is not true for Lustre at all. The size of a Lustre object is not fixed to a chunk size like 32MB or similar, but rather is variable depending on the size of the file itself. The number of stripes (== objects) on a file is currently fixed at file creation time, and the MDS only needs to store the location of each stripe (at most one per OST). The actual blocks/extents of the objects are managed inside the OST itself and are never seen by the client or the MDS. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineerhttp://www.whamcloud.com/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
Is this the same issue as at backup MDT question (and follow up) http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010151.html due to sparse files on MDT? Does tar take a lot of CPU? Alex. On May 30, 2012, at 5:02 PM, Andreas Dilger wrote: The tar backup of the MDT is taking a very long time. So far it has backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar process pointers to small or average size files are backed up quickly and at a consistent pace. When tar encounters a pointer/inode belonging to a very large file (100GB+) the tar process stalls on that file for a very long time, as if it were trying to archive the real filesize amount of data rather than the pointer/inode. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files
Following up on my original post. I switched from /bin/tar that comes with RHEL/CentOS 5.x to thw Whamcloud patched tar utility. The entire backup was successful and took only 12 hours to complete. The CPU utilization was high 90% but only on one core. The process was much faster than the standard tar shipped in RHEL/CentOS and the only slow downs were on file pointers to very large files (100TB+) with large stripe counts. The files that were going very slow when I reported the initial problem were backed up instantly with the Whamcloud version of tar. Best part, the MDT was saved and the 4PB filesystem is in production again. --Jeff On 5/30/12 3:02 PM, Andreas Dilger wrote: On 2012-05-29, at 1:28 PM, Peter Grandi wrote: The tar backup of the MDT is taking a very long time. So far it has backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar process pointers to small or average size files are backed up quickly and at a consistent pace. When tar encounters a pointer/inode belonging to a very large file (100GB+) the tar process stalls on that file for a very long time, as if it were trying to archive the real filesize amount of data rather than the pointer/inode. If you have stripes on, a 100GiB file will have 100,000 1MiB stripes, and each requires a chunk of metadata. The descriptor for that file will have this potentially a very large number of extents, scattered around the MDT block device, depending on how slowly the file grew etc. While that may be true for other distributed filesystems, that is not true for Lustre at all. The size of a Lustre object is not fixed to a chunk size like 32MB or similar, but rather is variable depending on the size of the file itself. The number of stripes (== objects) on a file is currently fixed at file creation time, and the MDS only needs to store the location of each stripe (at most one per OST). The actual blocks/extents of the objects are managed inside the OST itself and are never seen by the client or the MDS. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineerhttp://www.whamcloud.com/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -- Jeff Johnson Manager Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x101 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss