Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

2012-05-30 Thread Peter Grandi
[ ... ]

 The tar backup of the MDT is taking a very long time. So far it has
 backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
 process pointers to small or average size files are backed up quickly
 and at a consistent pace. When tar encounters a pointer/inode
 belonging to a very large file (100GB+) the tar process stalls on that
 file for a very long time, as if it were trying to archive the real
 filesize amount of data rather than the pointer/inode.

If you have stripes on, a 100GiB file will have 100,000 1MiB
stripes, and each requires a chunk of metadata. The descriptor
for that file will have this potentially a very large number of
extents, scattered around the MDT block device, depending on how
slowly the file grew etc.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

2012-05-30 Thread Andreas Dilger
On 2012-05-29, at 1:28 PM, Peter Grandi wrote:
 The tar backup of the MDT is taking a very long time. So far it has
 backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
 process pointers to small or average size files are backed up quickly
 and at a consistent pace. When tar encounters a pointer/inode
 belonging to a very large file (100GB+) the tar process stalls on that
 file for a very long time, as if it were trying to archive the real
 filesize amount of data rather than the pointer/inode.
 
 If you have stripes on, a 100GiB file will have 100,000 1MiB
 stripes, and each requires a chunk of metadata. The descriptor
 for that file will have this potentially a very large number of
 extents, scattered around the MDT block device, depending on how
 slowly the file grew etc.

While that may be true for other distributed filesystems, that is
not true for Lustre at all.  The size of a Lustre object is not
fixed to a chunk size like 32MB or similar, but rather is
variable depending on the size of the file itself.  The number of
stripes (== objects) on a file is currently fixed at file
creation time, and the MDS only needs to store the location of
each stripe (at most one per OST).  The actual blocks/extents of
the objects are managed inside the OST itself and are never seen
by the client or the MDS.

Cheers, Andreas
--
Andreas Dilger   Whamcloud, Inc.
Principal Lustre Engineerhttp://www.whamcloud.com/




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

2012-05-30 Thread Alex Kulyavtsev

Is this the same issue as at backup MDT question (and follow up)
http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010151.html
due to sparse files on MDT?  Does tar take a lot of CPU?
Alex.

On May 30, 2012, at 5:02 PM, Andreas Dilger wrote:


The tar backup of the MDT is taking a very long time. So far it has
backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
process pointers to small or average size files are backed up  
quickly

and at a consistent pace. When tar encounters a pointer/inode
belonging to a very large file (100GB+) the tar process stalls on  
that

file for a very long time, as if it were trying to archive the real
filesize amount of data rather than the pointer/inode.


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

2012-05-30 Thread Jeff Johnson
Following up on my original post. I switched from /bin/tar that comes 
with RHEL/CentOS 5.x to thw Whamcloud patched tar utility. The entire 
backup was successful and took only 12 hours to complete. The CPU 
utilization was high 90% but only on one core. The process was much 
faster than the standard tar shipped in RHEL/CentOS and the only slow 
downs were on file pointers to very large files (100TB+) with large 
stripe counts. The files that were going very slow when I reported the 
initial problem were backed up instantly with the Whamcloud version of tar.

Best part, the MDT was saved and the 4PB filesystem is in production again.

--Jeff



On 5/30/12 3:02 PM, Andreas Dilger wrote:
 On 2012-05-29, at 1:28 PM, Peter Grandi wrote:
 The tar backup of the MDT is taking a very long time. So far it has
 backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
 process pointers to small or average size files are backed up quickly
 and at a consistent pace. When tar encounters a pointer/inode
 belonging to a very large file (100GB+) the tar process stalls on that
 file for a very long time, as if it were trying to archive the real
 filesize amount of data rather than the pointer/inode.
 If you have stripes on, a 100GiB file will have 100,000 1MiB
 stripes, and each requires a chunk of metadata. The descriptor
 for that file will have this potentially a very large number of
 extents, scattered around the MDT block device, depending on how
 slowly the file grew etc.
 While that may be true for other distributed filesystems, that is
 not true for Lustre at all.  The size of a Lustre object is not
 fixed to a chunk size like 32MB or similar, but rather is
 variable depending on the size of the file itself.  The number of
 stripes (== objects) on a file is currently fixed at file
 creation time, and the MDS only needs to store the location of
 each stripe (at most one per OST).  The actual blocks/extents of
 the objects are managed inside the OST itself and are never seen
 by the client or the MDS.

 Cheers, Andreas
 --
 Andreas Dilger   Whamcloud, Inc.
 Principal Lustre Engineerhttp://www.whamcloud.com/




 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
--
Jeff Johnson
Manager
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Tar backup of MDT runs extremely slow, tar pauses on pointers to very large files

2012-05-28 Thread Jeff Johnson
Greetings,

I am aiding in the recovery of a multi-Petabyte Lustre filesystem
(1.8.7) that went down hard due to site wide power loss. Power loss
caused the MDT RAID volume to be put in a critical state and I was
able to get the md raid based MDT device mounted read only and the MDT
mounted read only as type ldiskfs.

I was able to successfully backup the extended attributes of the MDT.
This process took about 10 minutes.

The tar backup of the MDT is taking a very long time. So far it has
backed up 1.6GB of the 5.0GB used in nine hours. In watching the tar
process pointers to small or average size files are backed up quickly
and at a consistent pace. When tar encounters a pointer/inode
belonging to a very large file (100GB+) the tar process stalls on that
file for a very long time, as if it were trying to archive the real
filesize amount of data rather than the pointer/inode.

During this process there are no errors reported by kernel, ldiskfs,
md or tar. Nothing that would indiciate why things are so slow on
pointers to large files. In watching the tar process the CPU
utilization is at or near 100% so it is doing something. Running
iostat at the same time shows that while tar is at or near 100% CPU
there are no reads taking place on the MDT device and no writes to the
device where the tarball is being written.

It appears that the tar process goes to outer space when it encounters
pointers to very large files. Is this expected behavior?

The backup command used is the one from the MDT backup process in the
1.8 manual: 'tar zcvf tarfile --sparse .'

df reports the ldiskfs MDT as 5GB used:
/dev/md0   2636788616   5192372 2455778504   1% /mnt/mdt

df -i reports the ldiskfs MDT as having 10,300,000 inodes used:
/dev/md0   1758199808 10353389 17478464191% /mnt/mdt

Any feedback is appreciated!

--Jeff


--
--
Jeff Johnson
Partner
Aeon Computing

jeff dot johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss