Re: identifying sparse files and get ride of them trick available?

2007-11-14 Thread David Zeillinger

Hi Daniel,

Did you happen to investigate why rsync -S is taking so much time? If it 
doesn't deal with sparse file the way one expects, this option is probably 
broken. Also have you already tried something like the advice in 
http://lists.samba.org/archive/rsync/2003-August/007000.html ?


Anyway, I think the way to go is using tar. It preserves the sparseness 
property of the files, so something like this could work: If you tar the 
file without using compression, you would get a file the size of the 
sparsefile, with a lot of zeroes in it. Then use a run-length encoding on it 
to collapse the zeroes.


Sync this file with rsync.

On the destination machine do it in reverse. Using pipes you don't even need 
the physical space of the whole sparse file, just the space requirement of 
the actual data in the sparse file; or transfer it immediately you don't 
need any space at all.


Example (with a pseudo rle program):

tar cf - sparsefile | rle -input - -output sparsefile.tar.rle

The code for run-length encoding is there in zlib, but unfortunately 
compress/gzip doesn't have an option to use it. You'd either need to hack 
this in yourself, or use one of the many implementations found when 
searching for rle. This would give you the option to modify it, to just 
rl-encode the zeroes.


David 



Re: identifying sparse files and get ride of them trick available?

2007-11-14 Thread Daniel Ouellet

David Zeillinger wrote:
Did you happen to investigate why rsync -S is taking so much time? If it 
doesn't deal with sparse file the way one expects, this option is 
probably broken. Also have you already tried something like the advice 
in http://lists.samba.org/archive/rsync/2003-August/007000.html ?


It takes a long time because it needs to process the full files anyway. 
You can test that if you want to see it.


Anyway, I think the way to go is using tar. It preserves the sparseness 
property of the files, so something like this could work: If you tar the 
file without using compression, you would get a file the size of the 
sparsefile, with a lot of zeroes in it. Then use a run-length encoding 
on it to collapse the zeroes.


That goes against the first goal of the question to find the sparse 
files for example. And tar doesn't remove the use of resource to tar it 
anyway on either side and rsync also can use compression on the fly as 
well witch it is in use as well already. And I am not sure it would work 
anyway as if you think about it for a minute really. What's the 
difference to copy a sparse file via scp, rsync or untar it. Why would 
scp and rsync runs out of space to start with if they are copying empty 
space, or pointer to empty block and jam in the process???


If you think about that, why would tar does a better job? No I didn't 
try it and may be I will just to know, not as a solutions however. The 
solution is to not copy the sparse file in the first place when possible 
and that's what I am working on.


The interesting question for me that still pending and that I do not 
have answer for is more this.


If sparse file is pointers only to empty drive space not use, then why 
would scp run out of space copying empty pointers in the first place?


That's really the interesting question this brings to me in the process 
here.


There was more, but I got my answer to them so far.


Sync this file with rsync.


in my opinion, you only make the problem worst, however, I will need to 
test it to talk knowingly. rsync already compress it at the source and I 
proof in the various tests I sent to the list that it doesn't send more 
data across the link when -S is use, but  only when it wasn't. So, tar 
it before wouldn't change that and this would only add more step to the 
process.


Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread knitti
On 11/11/07, Daniel Ouellet [EMAIL PROTECTED] wrote:
 2.3 ==
 Now using scp as many times it's can also be use for quick sync of
 changed files. Here however, we are up for a big surprise as well for
 sure. Here we can't even do it as the sparse file like in rsync example
 #1 will stop as it is to big in size, even if the data however is not.
 And we will also waist way more bandwidth trying to do it in the process
 as well. If the file was smaller in sparse size, then the copy process
 would work, however the waisted bandwidth would be present anyway making
 the point of trying to avoid the problem in the first place of
 transferring sparse files across file systems. Or at best trying to use
 something that would minimize the problem.


if I'm not completely wrong, you could always tar -czf the sparse file, scp the
archive and then tar -xzf the file in place in the other side. this should also
create a new sparse file. of course, you lose the rsyncabilty and you have to
identify your sparse file in advance. But 16GB of nothing should compress
very well  ;)


--knitti



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Daniel Ouellet

knitti wrote:

if I'm not completely wrong, you could always tar -czf the sparse file, scp the
archive and then tar -xzf the file in place in the other side. this should also
create a new sparse file. of course, you lose the rsyncabilty and you have to
identify your sparse file in advance. But 16GB of nothing should compress
very well  ;)


Only two things here.

1. you have to identify your sparse file in advance.

That is the question. Look at the title.

2. The point of be able to use rsync is to put it into a cronjob and 
have the transfer of what changed only and forget it.


Neither are accomplish with tar.

I appreciate the thought never the less.

Thanks

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread richardtoohey
Quoting Daniel Ouellet [EMAIL PROTECTED]:

 Only two things here.
 
 1. you have to identify your sparse file in advance.
 
 That is the question. Look at the title.
 

Hi, Daniel.

Did you look at the Perl script I sent?

[code]
use strict;
use warnings;
use File::Find;
sub process_file {
 my $f=$File::Find::name;
 (my $dev,my $ino,my $mode,my $nlink,my $uid,my $gid,my  
$rdev,my $size,my $atime,my $mtime,my $ctime,my $blksize,my $blocks) 
=stat($f);
 if ($blocks * 512  $size) {
 print \t$f = SZ: $size BLSZ: $blksize BLKS: $blocks 
\n;
 print \t . -s $f;
 print \n;
 }
}
find(\process_file,(/home/sparse-files));
[/code]

Change the /home/sparse-files to a directory that has sparse files, and see if
it works - it should only list files where size is less than blocks * 512 (and
according to Otto, these are sparse files.)

If this DOES work, it might be a building block or an approach that can be
extended.  Or it might be useless - only one way to find out.

(P.S. - and I'm trying to be helpful here so PLEASE take it as that, not me
being mean  - witch is a person on a broomstick and waist is around your
hips - you usually intend which and waste in your postings.)



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Daniel Ouellet

[EMAIL PROTECTED] wrote:

Quoting Daniel Ouellet [EMAIL PROTECTED]:


Only two things here.

1. you have to identify your sparse file in advance.

That is the question. Look at the title.



Hi, Daniel.

Did you look at the Perl script I sent?


I am playing with it and looking if that can help as well as may be 
modifying it to recreate the sparse files when found.


I didn't reply yet as I had neither good or bad things to say about the 
results yet.


But looks like it may work well for finding them.


If this DOES work, it might be a building block or an approach that can be
extended.  Or it might be useless - only one way to find out.

(P.S. - and I'm trying to be helpful here so PLEASE take it as that, not me
being mean  - witch is a person on a broomstick and waist is around your
hips - you usually intend which and waste in your postings.)


And I do appreciate it too! (; Help is always good and your efforts 
haven't pass unnoticed trust me. (;


Best regards,

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Daniel Ouellet

[EMAIL PROTECTED] wrote:

Did you look at the Perl script I sent?


I should also add in my previous emails in regards to good and bad part 
of it that it is actually a much better idea then what I was doing by 
the way! I think my emails didn't come out right in regard to the idea 
express however. Sorry about that.


The idea is good and much better then I use, but I am still testing and 
that's why I haven't reply to it. Sorry.


At first glance however, I can say that it look a much better way to do 
it and allow to find the files and address the first part of my question 
and for that I thank you!


Best,

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Douglas A. Tutty
On Sun, Nov 11, 2007 at 09:18:34PM +0100, knitti wrote:
 
 if I'm not completely wrong, you could always tar -czf the sparse file, scp 
 the
 archive and then tar -xzf the file in place in the other side. this should 
 also
 create a new sparse file. of course, you lose the rsyncabilty and you have to
 identify your sparse file in advance. But 16GB of nothing should compress
 very well  ;)
 

I tried making a very sparse file (100 MB data, 1000 GB sparseness) and
gave up trying to compress it.  gzip has to process the whole thing,
sparseness and all.  Sure it would probably end up with a very small
file, but the whole thing has to be processed.

I imagine that its no less time than that which rsync takes to process.
Rsync takes lots of time and computation but saves on bandwidth.

Doug.



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Daniel Ouellet

Douglas A. Tutty wrote:

I tried making a very sparse file (100 MB data, 1000 GB sparseness) and
gave up trying to compress it.  gzip has to process the whole thing,
sparseness and all.  Sure it would probably end up with a very small
file, but the whole thing has to be processed.


Yes it does and I am not sure anyone said it would be less work. I sure 
didn't and yes it needs to be process and I demonstrated it with the 
time it takes to rsync with a sparse file and without. In my test, 45+ 
minutes oppose to 17 seconds.



I imagine that its no less time than that which rsync takes to process.
Rsync takes lots of time and computation but saves on bandwidth.


Yes it is a lots of processing to do it and lots of time wasted and lots 
of CPU power wasted and if you don't use the -S in case of rsync, you 
can't even sync it if the space on the destination is not the size of 
the sparse file, not the real data part.


The short of it is that sparse file are a good thing when you don't have 
to copy them across file system on different servers in witch case, it's 
a way different ball game.


It's been interesting learning and testing anyway.

Hopefully it was useful to others, if not, it was to me anyway.

Best,

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread RW
On Sun, 11 Nov 2007 22:31:13 -0500, Daniel Ouellet wrote:

Douglas A. Tutty wrote:
 I tried making a very sparse file (100 MB data, 1000 GB sparseness) and
 gave up trying to compress it.  gzip has to process the whole thing,
 sparseness and all.  Sure it would probably end up with a very small
 file, but the whole thing has to be processed.

Yes it does and I am not sure anyone said it would be less work. I sure 
didn't and yes it needs to be process and I demonstrated it with the 
time it takes to rsync with a sparse file and without. In my test, 45+ 
minutes oppose to 17 seconds.

 I imagine that its no less time than that which rsync takes to process.
 Rsync takes lots of time and computation but saves on bandwidth.

Yes it is a lots of processing to do it and lots of time wasted and lots 
of CPU power wasted and if you don't use the -S in case of rsync, you 
can't even sync it if the space on the destination is not the size of 
the sparse file, not the real data part.

The short of it is that sparse file are a good thing when you don't have 
to copy them across file system on different servers in witch case, it's 
a way different ball game.

It's been interesting learning and testing anyway.

Hopefully it was useful to others, if not, it was to me anyway.

Best,

Daniel


Daniel,
it is more years than I care to calculate since I last did anything
with sparse files. Certainly it was before any of today's *BSD tribe.

What has not been addressed here is the question of what created those
files. It isn't something you do with a shell script usually.

So if you have, just as an example, a database program that does make
such a file it is often possible to dump the database in such a way as
to load it into another instance. Maybe a remote replication is
possible.

So, what evil little daemon do you have toiling away making TB files
that only use 2k (joke!) and, is it not possible to teach the little
bastard how to reconstruct its data on another drive?

Rod/
/earth: write failed, file system is full
cp: /earth/creatures: No space left on device



Re: identifying sparse files and get ride of them trick available?

2007-11-11 Thread Daniel Ouellet

RW wrote:

What has not been addressed here is the question of what created those
files. It isn't something you do with a shell script usually.


Many things can do this, or could use this.


So if you have, just as an example, a database program that does make
such a file it is often possible to dump the database in such a way as
to load it into another instance. Maybe a remote replication is
possible.


As an example that I sure can provide. Not sure if many are still using 
it, but a quick example I come up with is the dns cache file use by 
webalizer. One of the example I provided what the simple dns cache file 
use by webalizer. Not a big deal if your servers are not busy, but on 
very busy web server for example it could grow pretty quickly. In my 
case as an example, to address this sparse file, I simply delete that 
file after a few days. Yes, it grow in my case to multiple GB in just a 
few weeks and yes it is also faster to do reverse dns lookup using it as 
well and I do also run cache dns, etc.



So, what evil little daemon do you have toiling away making TB files
that only use 2k (joke!) and, is it not possible to teach the little
bastard how to reconstruct its data on another drive?


So, I am not sure if that was what you wee asking, but I thought to give 
you that simple example as it is more common then what I may be using I 
think. (; In my case, you give it about 3 weeks if you want and I will 
have that file grow to may be 15GB with only 1GB or 2GB of valid data in it.


Then I delete it and restart it from scratch.

As for database, there isn't any needs for that as for example MySQL can 
use replication pretty well, but no need to copy files across servers. 
Or if you want to copy files, then you just optimize it before doing so 
and then the empty space is taken care of before end.


Hope it answer your question and give you an example.

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Otto Moerbeek
On Fri, Nov 09, 2007 at 08:40:15PM +0200, Enache Adrian wrote:

 On Fri, Nov 09, 2007 at 11:03:31AM +0100, Otto Moerbeek wrote:

  So your problem seems to be that rsync -S is inefficient to the point
  where it is not useable.  I do not use rsync a lot, so I do not know
  if there's a solution to that problem. It does seem strange that a
  feature to solve a problem actually make the problem worse. 
 
 Anything is inefficient in that case.
 
 Just create a huge dummy file:
 
 $ dd if=/dev/null seek=1m bs=1m of=file
 
 Then copy it (with cp, or any sparse-file aware program) to another
 filesystem. Watch how much time and power it takes to copy nothing
 from one place to another.
 
 Any way to obtain a 'map' of the file that tell you exactly where the
 written sectors are would make for a BIG improvement.
 
 You can't do that on OpenBSD without raw low-level fs hacks and
 reinventing half of dump(8) and fsck(8).
 
 Adi

Your example just shows copying big files takes long. The point being,
if the file was not sparse, it would take at least the same time.
Blaming sparseness for the long cp time is not fair. 

-Otto



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Richard Toohey

On 10/11/2007, at 10:05 AM, Daniel Ouellet wrote:


Otto Moerbeek wrote:
stat -s gives the raw info in one go. Some shell script hacking  
should

make it easy to detect sparse files.


Thanks Otto for the suggestion. That might help until it can be  
address for good. It would help speed up some of it. (;




This looked interesting (curiosity killed the cat?), so I started  
looking at sparse files (not heard of them before.)


Is this a sparse file?

# dd if=/dev/zero of=sparsefile bs=1024 seek=10240 count=0
0+0 records in
0+0 records out
0 bytes transferred in 0.000 secs (0 bytes/sec)
# ls -lh
[--cut--]
-rw-r--r--  1 root  wheel  10.0M Nov 11 08:43 sparsefile
# du -hsc sparsefile
32.0K   sparsefile
32.0K   total
# du sparsefile
64  sparsefile
# stat -s sparsefile
st_dev=7 st_ino=51969 st_mode=0100644 st_nlink=1 st_uid=0 st_gid=0  
st_rdev=0 st_size=10485760 st_atime=1194723829 st_mtime=1194723829  
st_ctime=1194723829 st_blksize=16384 st_blocks=64 st_flags=0


So because blocks allocated = 64, and block size is (usually) 512  
bytes = file is 32K (but ls and others will report 10Mb size.)


So if you scanned whatever director(y|ies) you are interested in,

If st_size  (st_blocks * 512) Then
*** this may be a sparse file?

(BUT - blocksize of 16384 is reported so I must be missing something?)

A stab at it in Perl (lifted from Perl Cookbook):

use strict;
use warnings;
use File::Find;
sub process_file {
my $f=$File::Find::name;
(my $dev,my $ino,my $mode,my $nlink,my $uid,my $gid,my  
$rdev,my $size,my $atime,my $mtime,my $ctime,my $blksize,my $blocks) 
=sat($f);

if ($blocks * 512  $size) {
print \t$f = SZ: $size BLSZ: $blksize BLKS: $blocks 
\n;

print \t . -s $f;
print \n;
}
}
find(\process_file,(/home/sparse-files));

The output is:

# perl check.pl
/home/sparse-files/sparsefile = SZ: 10485760 BLSZ: 16384  
BLKS: 64

10485760

Thanks.



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Otto Moerbeek
On Fri, Nov 09, 2007 at 03:47:10PM -0500, Daniel Ouellet wrote:

 Ted Unangst wrote:
 On 11/9/07, Daniel Ouellet [EMAIL PROTECTED] wrote:
 Just for example, a source file that is sparse badly, don't really have
 allocated disk block yet, but when copy over, via scp, or rsync will
 actually use that space on the destination servers. All the servers are
 identical (or suppose to be anyway) but what is happening is the copy of
 them are running out of space at time in the copy process. Like when it
 is copying them, it may easy use twice the amount of space in the
 process and sadly filling up the destinations then then the sync process
 stop making the distribution of the load unusable. I need to increase
 the capacity yes, except that it will take me times to do so.
 
 so what are you going to do when you find these sparse files?
 
 So far. When I find them. Not all of them, but huge waisting space one. 
 I delete them and replace them. with the original one, or even with the 

I am confused by what you say. A sparse file does NOT waste space, it
REDUCES disk usage, compared to a non-sparse (dense?) file with the
same contents. 

 one copy using rsync -S back to the original reduce it's size in 1/2 and 

If the size is reduced, it is not the same file. Please be more
accurate in your description. A file's size is not the same as it's
disk usage. 

 more at times. So, yes, very inefficiently, but manageable anyway. It's 
 a plaster for now if you want. Don't get me wrong. Sparse files makes no 
 problem what so ever when they stay on the same systems. It's when you 
 need to move them around servers, and specially across Internet 
 connected locations and keep them in sync as much as possible in as 
 shorter time as possible that it becomes unmanageable. That's really the 
 issue at hands. Not that sparse files are bad in any ways. Keeping them 
 in sync across multiples system is however.

You cannot blame sparse files for that. If the same file would not be
sparse, your problem would be at least as big.

-Otto


 
 I was looking if there was a more intelligent ways to do it. (; Like 
 finding them about some level of sparse, like let say 25% and then 
 compact them at the source to be none sparse again, or something 
 similar. Doesn't need to do every single one, even if that might be a 
 good thing in special cases, not all obviously.
 
 The problem is that some customers end up running out of space and I 
 really didn't know, plus the huge factor of waisted bandwidth and 
 filling up their connections transferring empty files if you like and 
 taking much longer in sync time that other wise it wouldn't if you sync 
 as is.
 
 Still is an interesting problem after I found out what it really was.
 
 I hope it explained the issue somewhat better.
 
 Thanks for the feedback never the less.
 
 Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Richard Toohey

On 10/11/2007, at 9:11 PM, Richard Toohey wrote:

(my $dev,my $ino,my $mode,my $nlink,my $uid,my $gid,my  
$rdev,my $size,my $atime,my $mtime,my $ctime,my $blksize,my $blocks) 
=sat($f);


Oops - should end with:

=stat($f);

not

=sat($f);



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Otto Moerbeek
On Sat, Nov 10, 2007 at 09:11:27PM +1300, Richard Toohey wrote:
 On 10/11/2007, at 10:05 AM, Daniel Ouellet wrote:
 
 Otto Moerbeek wrote:
 stat -s gives the raw info in one go. Some shell script hacking  
 should
 make it easy to detect sparse files.
 
 Thanks Otto for the suggestion. That might help until it can be  
 address for good. It would help speed up some of it. (;
 
 
 This looked interesting (curiosity killed the cat?), so I started  
 looking at sparse files (not heard of them before.)
 
 Is this a sparse file?

yes.

 
 # dd if=/dev/zero of=sparsefile bs=1024 seek=10240 count=0
 0+0 records in
 0+0 records out
 0 bytes transferred in 0.000 secs (0 bytes/sec)
 # ls -lh
 [--cut--]
 -rw-r--r--  1 root  wheel  10.0M Nov 11 08:43 sparsefile
 # du -hsc sparsefile
 32.0K   sparsefile
 32.0K   total
 # du sparsefile
 64  sparsefile
 # stat -s sparsefile
 st_dev=7 st_ino=51969 st_mode=0100644 st_nlink=1 st_uid=0 st_gid=0  
 st_rdev=0 st_size=10485760 st_atime=1194723829 st_mtime=1194723829  
 st_ctime=1194723829 st_blksize=16384 st_blocks=64 st_flags=0
 
 So because blocks allocated = 64, and block size is (usually) 512  
 bytes = file is 32K (but ls and others will report 10Mb size.)
 
 So if you scanned whatever director(y|ies) you are interested in,
 
   If st_size  (st_blocks * 512) Then
   *** this may be a sparse file?
 
 (BUT - blocksize of 16384 is reported so I must be missing something?)

yeah, look at stat(2):

 int64_tst_blocks;  /* blocks allocated for file */
 u_int32_t  st_blksize; /* optimal file sys I/O ops blocksize */

actually st_blocks's unit is disk sectors, to be precise.

I don't read perl, so I cannot comment on the script below.

-Otto
 
 A stab at it in Perl (lifted from Perl Cookbook):
 
 use strict;
 use warnings;
 use File::Find;
 sub process_file {
 my $f=$File::Find::name;
 (my $dev,my $ino,my $mode,my $nlink,my $uid,my $gid,my  
 $rdev,my $size,my $atime,my $mtime,my $ctime,my $blksize,my $blocks) 
 =sat($f);
 if ($blocks * 512  $size) {
 print \t$f = SZ: $size BLSZ: $blksize BLKS: $blocks 
 \n;
 print \t . -s $f;
 print \n;
 }
 }
 find(\process_file,(/home/sparse-files));
 
 The output is:
 
 # perl check.pl
 /home/sparse-files/sparsefile = SZ: 10485760 BLSZ: 16384  
 BLKS: 64
 10485760
 
 Thanks.



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Richard Toohey

On 10/11/2007, at 9:32 PM, Otto Moerbeek wrote:


yeah, look at stat(2):

 int64_tst_blocks;  /* blocks allocated for file */
 u_int32_t  st_blksize; /* optimal file sys I/O ops blocksize */

actually st_blocks's unit is disk sectors, to be precise.

I don't read perl, so I cannot comment on the script below.

-Otto


Thanks for the feedback.

I tried in C, but could not get past getting 0 for st_blocks every  
time (will be my C, but I can't see (C?) what it is yet ...)


# man -s 2 stat:
[cut]
 struct timespec st_ctimespec;  /* time of last file status  
change */

 off_t  st_size;   /* file size, in bytes */
 int64_tst_blocks; /* blocks allocated for file */
[cut]

check.c
---

#include sys/stat.h

int main(void) {
struct stat stat_stuff;
int result;
result=stat(/home/sparse-files/sparsefile,stat_stuff);
printf(%d %d\n,stat_stuff.st_size,stat_stuff.st_blocks);
}

# cc check.c -o check
# ./check
10485760 0



[EMAIL PROTECTED]: Re: identifying sparse files and get ride of them trick available?]

2007-11-10 Thread Otto Moerbeek
Forgat to send to the list.

-Otto

- Forwarded message from Otto Moerbeek [EMAIL PROTECTED] -

Date: Sat, 10 Nov 2007 10:36:20 +0100
From: Otto Moerbeek [EMAIL PROTECTED]
To: Richard Toohey [EMAIL PROTECTED]
Subject: Re: identifying sparse files and get ride of them trick available?

On Sat, Nov 10, 2007 at 09:44:46PM +1300, Richard Toohey wrote:
 
 On 10/11/2007, at 9:32 PM, Otto Moerbeek wrote:
 
 yeah, look at stat(2):
 
  int64_tst_blocks;  /* blocks allocated for file */
  u_int32_t  st_blksize; /* optimal file sys I/O ops blocksize */
 
 actually st_blocks's unit is disk sectors, to be precise.
 
 I don't read perl, so I cannot comment on the script below.
 
  -Otto
 
 Thanks for the feedback.
 
 I tried in C, but could not get past getting 0 for st_blocks every  
 time (will be my C, but I can't see (C?) what it is yet ...)

Wrong format specifier. -Wall is your friend.

-Otto

 
 # man -s 2 stat:
 [cut]
  struct timespec st_ctimespec;  /* time of last file status  
 change */
  off_t  st_size;   /* file size, in bytes */
  int64_tst_blocks; /* blocks allocated for file */
 [cut]
 
 check.c
 ---
 
 #include sys/stat.h
 
 int main(void) {
 struct stat stat_stuff;
 int result;
 result=stat(/home/sparse-files/sparsefile,stat_stuff);
 printf(%d %d\n,stat_stuff.st_size,stat_stuff.st_blocks);
 }
 
 # cc check.c -o check
 # ./check
 10485760 0

- End forwarded message -



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread ropers
Would people say that this edit is a decent description of these issues?

http://en.wikipedia.org/w/index.php?title=Sparse_filediff=170645177oldid=168346326



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread ropers
On 10/11/2007, Otto Moerbeek [EMAIL PROTECTED] wrote:

 Your example just shows copying big files takes long. The point being,
 if the file was not sparse, it would take at least the same time.
 Blaming sparseness for the long cp time is not fair.

 -Otto

But of course it would be semi-nice if the copy/sync commands were not
only aware that they are copying a sparse file, but if they also only
copied the data/space that the sparse file actually occupies (as
opposed to copying the full allocated data).

I say semi-nice because the benefits in speed and decreased bandwith
requirements would come at the expense of extra special case code,
ie. added complexity, which as we all know might not necessarily
always be worth it.



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Daniel Ouellet

Hi,

Before we go nuts on this issue, or look for the wrong things or create 
miss understanding.


Just allow me a little bit more time to try to come with a viseable 
example showing the problem, or the issue here.


Obviously as Otto pointed out to me, looks like I can't explain it to well.

I have spend many hours today trying to get a better example of this and 
showing it better as to not create any miss understanding and I think I 
have may be find a better way to explain it.


I will send it a bit later with examples. as I have been able to isolate 
a good case between many that may relate to many because of it's 
possible use, oppose to may personal issue.


I will update misc@ with hopefully a better example.

But I have a way around the problem, so that works for me, however I 
think it might be of interest to others and as such I will send a better 
example.


Thanks

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Daniel Ouellet

Hi,

I will try to make this very simple and show the issue by example only 
when possible. I use two old servers on the Internet for the tests. The 
source use real example sparse file, but that have only ~1GB of usable 
data in it. The size show by 'ls -al' as an example gives~17GB. That's 
what we will use for the demonstration and explications of the issues as 
presented at first were network bandwidth waist, much servers resources 
waisted as well for way more then it should be and the issue of not be 
able to even complete the sync in some cases as well in the process. 
Also, a miss judgment at the start as looking at df output to know 
quickly the space I would need to transfer the content, I forgot the 
possible issue with sparse files. If I use 20% or 30% of a file system, 
it's was fair for me to assume anyway that I would definitely be able to 
copy these files on other systems that provide a minimum of the same 
size of more. That's where I went wrong and had to work around it.


A simple mistakes one would say, however interesting to find why as it 
snot obvious at first specially if you look at the final df look on 
remote servers as well. If the remote system was much bigger then the 
source, there isn't any problem for the transfer. Still waist bandwidth 
in some cases, but it will work as the remote file systems will grow up, 
but not fill up like I show in the example below. However, you would 
never see it at first glance as doing df at the source or at the 
destination, you see no differences in size, forgetting the space 
required to transfer the data, but again, who would think that if you 
don't use 50% of the space available, that you would be in a situation 
where you couldn't transfer it? I could even have thought that may be if 
I wanted to, I could even have done a full backup as I use 30%, so there 
is plenty of space right?


So, that's what created the look for what the hell could be wrong 
process and prompt me to look for a possible way to eliminate sparse 
files in specific cases.


Again, just to be sure no one go crazy, sparse files are not a bad 
thing. They have pretty useful at times. But you can get bitten by them 
too, in some cases. (;


1. ==
Some numbers:
# df /home
Filesystem  512-blocks  Used Avail Capacity  Mounted on
/dev/wd0h  2096316   1973256 1824899%/home

# ls -al /home/sparcefile
-rw-r--r--  1 douellet  douellet  17416290304 Nov 10 02:02 /home/sparcefile

You can see that partition can only suppose ~1GB of real data and the 
file is showing up at ~17GB. So, definitely sparse file. Nothing wrong 
there.


2. ==
To answer the question about network data transfer usages. Yes a sparse 
file will use that much waisted bandwidth, depending what you use to 
transfer the file. Three example:


2.1 ==
rsync without -S, just can't do. The remote disk fill itself up and then 
transfer session crash, plain an simple. Plus it takes an insane amount 
of time:


And just to see it live as well, I keep doing df /var/www/sites as it 
was doing it to see it in action. It filled up, then crash the transfer.


Destination is:
Filesystem  512-blocks  Used Avail Capacity  Mounted on
/dev/wd1a 41280348  12668632  2654770032%/var/www/sites
Filesystem  512-blocks  Used Avail Capacity  Mounted on
/dev/wd1a 41280348  39216344   -12   100%/var/www/sites
Filesystem  512-blocks  Used Avail Capacity  Mounted on
/dev/wd1a 41280348  12668632  2654770032%/var/www/sites


Sat Nov 10 18:06:19 EST 2007
rsync: writefd_unbuffered failed to write 16385 bytes: phase unknown 
[sender]: Broken pipe (32)
rsync: write failed on /var/www/sites/testing/sparcefile: No space 
left on device (28)
rsync error: error in file IO (code 11) at 
/usr/obj/i386/rsync-2.6.6p0/rsync-2.6.6/receiver.c(291)
rsync: connection unexpectedly closed (1140235 bytes received so far) 
[generator]
rsync error: error in rsync protocol data stream (code 12) at 
/usr/obj/i386/rsync-2.6.6p0/rsync-2.6.6/io.c(434)
rsync: connection unexpectedly closed (1056064 bytes received so far) 
[sender]
rsync error: error in rsync protocol data stream (code 12) at 
/usr/obj/ports/rsync-2.6.6p0/rsync-2.6.6/io.c(434)

Sat Nov 10 18:39:22 EST 2007

2.2 ==
With -S you can transfer it but still will take a lots of time and the 
bandwidth usage is representing the real data of the file content, not 
it's size. So rsync is doing an ok job here and will send the changes 
only, witch is what we want. PF also reflect that.


# ./fullsync-test
Sat Nov 10 20:09:52 EST 2007
Sat Nov 10 20:40:51 EST 2007

# pfctl -si | grep Bytes
  Bytes In   3697287510
  Bytes Out   103655890

Note, I did reset pf stats just before starting the test.

2.3 ==
Now using scp as many times it's can also be use for quick sync of 
changed files. Here however, we 

Re: identifying sparse files and get ride of them trick available?

2007-11-10 Thread Daniel Ouellet

ropers wrote:

Would people say that this edit is a decent description of these issues?

http://en.wikipedia.org/w/index.php?title=Sparse_filediff=170645177oldid=168346326


I can't really comment well for proper writing for sure. (;

But one thing that is not right as Otto pointed out to me and that my 
tests showed just to well. The size of the files size doesn't changed. 
The transfer data however is different depending of what utility or 
options you use to transfer it. That part I didn't express it properly 
in my previous emails and Otto kindly corrected it as well. In some 
cases it might be good to have the capability to compact it. meaning 
making it none sparse again, but I can't put a good judgment as to if 
that would be good in most cases, witch I am sure it's not the case for 
many, specially for database files for example come to mind.


I guess what I can conclude with is in some cases there is substantial 
waist of bandwidth (depend on utility use for sync), CPU resources, way 
more time consume in the sync process as well, ( in my example case, up 
to 50+ minutes instead of possible  2 minutes) and possible sync 
process that will break, or stop if sparse are getting to big. But 
that's a case by case obviously. No rules that I can think of right now. 
And the last point I have to include is that may be in some cases you 
can't even do it, or it will stop doing it when you expect it less. (;


Best,

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Otto Moerbeek
On Fri, Nov 09, 2007 at 02:00:14AM -0500, Daniel Ouellet wrote:
 Hi,
 
 I am trying to find a way ti identifying sparse files properly and 
 quickly and find a way to rectify the situation.
 
 Any trick to do this?
 
 The problem is that overtime looks like I am ending up with lots of them 
 and because I have to sync multiples servers together the sparse files 
 makes the sync painful over time as well as huge obviously and slow. I 
 am talking multiple GB's here.
 
 So far the only way I have do it is with rsync and -S options, but then 
 the sync process takes a lots of time and when you need to sync multiple 
 boxes multiple times per hours, it end up not be able to do it anymore 
 and the process is not finish and it is suppose to start again.
 
 The other way that I found is to use dump and then restore, but that 
 also is painful to do on live systems obviously. I need to find a way to 
 clean the source, so that the sync system do their stuff easy. If I 
 simply sync with the sparse file, sure I can do that, but then, the 
 problem is the destinations runs out of space as the sparse gets to big 
 over time.
 
 Google also pointed out that may be FIBMAP ioctl may have done to job, 
 may be, but that was kill by Theo on 2007/06/02 09:14:36. I assume for 
 many good reason for sure, so I didn't pursue that anymore.
 
 Then may be filefrag -v might work, but not much success there either.
 
 So, I am running out of ideas and may be there isn't any way to do this, 
 I however hope there is.
 
 If it is not possible to correct the problem in a cronjob fashion or 
 something, may be how could I possible find sparse files efficiently?
 
 At a minimum, if I could find the file getting out of control, then I 
 could at a minimum delete them and copy them from the source again and 
 reduce the problem of the sparse files.

I do not get you at all. Unsparsing the file will only make it use
more disk space.

Actually, since some time cp(1) will actively create sparse files if it can.

 
 Any clue as to how to tackle this problem, or any trick around it?

I really do not understand the problem here. But you might be able to
detect sparse files compartaring the size vs the number of blocks it uses.

-Otto



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Daniel Ouellet

Any clue as to how to tackle this problem, or any trick around it?


I really do not understand the problem here. But you might be able to
detect sparse files compartaring the size vs the number of blocks it uses.


Without making a bit writing out of it. Let say that the problem is for 
now a storage capacity problem on the destinations servers, a timing one 
in the extended transfer process and the additional bandwidth required 
at some of the destination point and the volumes of files. Let just say 
that if it was syncing 100K files, it would be a piece of cake, but it's 
much bigger.


Just for example, a source file that is sparse badly, don't really have 
allocated disk block yet, but when copy over, via scp, or rsync will 
actually use that space on the destination servers. All the servers are 
identical (or suppose to be anyway) but what is happening is the copy of 
them are running out of space at time in the copy process. Like when it 
is copying them, it may easy use twice the amount of space in the 
process and sadly filling up the destinations then then the sync process 
stop making the distribution of the load unusable. I need to increase 
the capacity yes, except that it will take me times to do so.


Sparse file for database example is a very good thing, but not for 
everything however.


The problem is not the sparse file at the source. It sure can stay as 
is. It's just offset pointers anyway.


The problem is in the sync process between multiple servers using the 
Internet to sync them and the bandwidth waisted as well as the lack of 
space available at the destination. Plus because the copy is different 
in size, then the sync process see it as different files and as such 
will copy them again.


Or it can be copy using -S with rsync, however this process will inflate 
the file at the destination and run out of space during the process and 
make them smaller at the end. Plus this obviously take a lots more time 
and as such, the timely sync process that was good for a long time now, 
well... Let say, not reliable. Let say, sync without concern for sparse 
is done just in a few minutes, but then use lots more space on the 
destination. Doing it with -S to address the capacity issue fix that, 
but then it takes a HUGE amount of time more and sadly there is useless 
transfer of null data cause from the sparse source empty space.


I can manage, I find ways to use ls -laR, or du -k and do diff's between 
them and fine the files that are getting out of wack, replace them and 
then continue, but this really is painful.


Obviously when the capacity will be there, it will be a none issue, 
however I am sadly not at that point yet and it will take me some time.


Not sure if that explain it any better, I hope so.

But I was looking if it was possible to identify these files in a more 
efficient way.


If not, I will just deal with it.

It's just going to be painful for sometime that's all.

The issue is really in the transfer process and at the final 
destination. Not at the source.


I hope it make more sense explaining it this way, if not I apologists 
for the lack of better thinking at the moment in explaining it.


Best,

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Otto Moerbeek
On Fri, Nov 09, 2007 at 04:27:49AM -0500, Daniel Ouellet wrote:

 Any clue as to how to tackle this problem, or any trick around it?
 
 I really do not understand the problem here. But you might be able to
 detect sparse files compartaring the size vs the number of blocks it uses.
 
 Without making a bit writing out of it. Let say that the problem is for 
 now a storage capacity problem on the destinations servers, a timing one 
 in the extended transfer process and the additional bandwidth required 
 at some of the destination point and the volumes of files. Let just say 
 that if it was syncing 100K files, it would be a piece of cake, but it's 
 much bigger.
 
 Just for example, a source file that is sparse badly, don't really have 
 allocated disk block yet, but when copy over, via scp, or rsync will 
 actually use that space on the destination servers. All the servers are 
 identical (or suppose to be anyway) but what is happening is the copy of 
 them are running out of space at time in the copy process. Like when it 
 is copying them, it may easy use twice the amount of space in the 
 process and sadly filling up the destinations then then the sync process 
 stop making the distribution of the load unusable. I need to increase 
 the capacity yes, except that it will take me times to do so.
 
 Sparse file for database example is a very good thing, but not for 
 everything however.
 
 The problem is not the sparse file at the source. It sure can stay as 
 is. It's just offset pointers anyway.
 
 The problem is in the sync process between multiple servers using the 
 Internet to sync them and the bandwidth waisted as well as the lack of 
 space available at the destination. Plus because the copy is different 
 in size, then the sync process see it as different files and as such 
 will copy them again.

The size will not be different, just the disk space used.

 
 Or it can be copy using -S with rsync, however this process will inflate 
 the file at the destination and run out of space during the process and 
 make them smaller at the end. Plus this obviously take a lots more time 
 and as such, the timely sync process that was good for a long time now, 
 well... Let say, not reliable. Let say, sync without concern for sparse 
 is done just in a few minutes, but then use lots more space on the 
 destination. Doing it with -S to address the capacity issue fix that, 
 but then it takes a HUGE amount of time more and sadly there is useless 
 transfer of null data cause from the sparse source empty space.

So your problem seems to be that rsync -S is inefficient to the point
where it is not useable.  I do not use rsync a lot, so I do not know
if there's a solution to that problem. It does seem strange that a
feature to solve a problem actually make the problem worse. 

 I can manage, I find ways to use ls -laR, or du -k and do diff's between 
 them and fine the files that are getting out of wack, replace them and 
 then continue, but this really is painful.

stat -s gives the raw info in one go. Some shell script hacking should
make it easy to detect sparse files.

-Otto

 Obviously when the capacity will be there, it will be a none issue, 
 however I am sadly not at that point yet and it will take me some time.
 
 Not sure if that explain it any better, I hope so.
 
 But I was looking if it was possible to identify these files in a more 
 efficient way.
 
 If not, I will just deal with it.
 
 It's just going to be painful for sometime that's all.
 
 The issue is really in the transfer process and at the final 
 destination. Not at the source.
 
 I hope it make more sense explaining it this way, if not I apologists 
 for the lack of better thinking at the moment in explaining it.
 
 Best,
 
 Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Enache Adrian
On Fri, Nov 09, 2007 at 11:03:31AM +0100, Otto Moerbeek wrote:
 So your problem seems to be that rsync -S is inefficient to the point
 where it is not useable.  I do not use rsync a lot, so I do not know
 if there's a solution to that problem. It does seem strange that a
 feature to solve a problem actually make the problem worse. 

Anything is inefficient in that case.

Just create a huge dummy file:

$ dd if=/dev/null seek=1m bs=1m of=file

Then copy it (with cp, or any sparse-file aware program) to another
filesystem. Watch how much time and power it takes to copy nothing
from one place to another.

Any way to obtain a 'map' of the file that tell you exactly where the
written sectors are would make for a BIG improvement.

You can't do that on OpenBSD without raw low-level fs hacks and
reinventing half of dump(8) and fsck(8).

Adi



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Daniel Ouellet

Ted Unangst wrote:

On 11/9/07, Daniel Ouellet [EMAIL PROTECTED] wrote:

Just for example, a source file that is sparse badly, don't really have
allocated disk block yet, but when copy over, via scp, or rsync will
actually use that space on the destination servers. All the servers are
identical (or suppose to be anyway) but what is happening is the copy of
them are running out of space at time in the copy process. Like when it
is copying them, it may easy use twice the amount of space in the
process and sadly filling up the destinations then then the sync process
stop making the distribution of the load unusable. I need to increase
the capacity yes, except that it will take me times to do so.


so what are you going to do when you find these sparse files?


So far. When I find them. Not all of them, but huge waisting space one. 
I delete them and replace them. with the original one, or even with the 
one copy using rsync -S back to the original reduce it's size in 1/2 and 
more at times. So, yes, very inefficiently, but manageable anyway. It's 
a plaster for now if you want. Don't get me wrong. Sparse files makes no 
problem what so ever when they stay on the same systems. It's when you 
need to move them around servers, and specially across Internet 
connected locations and keep them in sync as much as possible in as 
shorter time as possible that it becomes unmanageable. That's really the 
issue at hands. Not that sparse files are bad in any ways. Keeping them 
in sync across multiples system is however.


I was looking if there was a more intelligent ways to do it. (; Like 
finding them about some level of sparse, like let say 25% and then 
compact them at the source to be none sparse again, or something 
similar. Doesn't need to do every single one, even if that might be a 
good thing in special cases, not all obviously.


The problem is that some customers end up running out of space and I 
really didn't know, plus the huge factor of waisted bandwidth and 
filling up their connections transferring empty files if you like and 
taking much longer in sync time that other wise it wouldn't if you sync 
as is.


Still is an interesting problem after I found out what it really was.

I hope it explained the issue somewhat better.

Thanks for the feedback never the less.

Daniel



Re: identifying sparse files and get ride of them trick available?

2007-11-09 Thread Daniel Ouellet

Otto Moerbeek wrote:

So your problem seems to be that rsync -S is inefficient to the point
where it is not useable.  I do not use rsync a lot, so I do not know
if there's a solution to that problem. It does seem strange that a
feature to solve a problem actually make the problem worse. 


Well, I don't want to create a miss understanding here about rsync. -S 
does address the issue of making the sparse file copy smaller at the 
end, if you like. It doesn't help in the fact that it is processing them 
however and if not use -S, then the sync is much, much faster, however 
the destination is huge space actually use space on the drive that fill 
it up. So they do their respective job, but with different side effect. 
Faster sync, much lower bandwidth usage, bigger end results usage space 
on disk. The other one is the opposite. (;. But the biggest side effect 
is that using -S will redo the transfer regardless if the end results 
would be the same as it doesn't see it as the same. In the end, that's 
really the problem, witch I don't think there is a solutions for anyway 
in the design of rsync. If there is one anyway, I don't think about it 
at the moment for sure. It's just a catch 22 situation at the moment.


A stupid solution that I just though of writing it to explain it might 
be as simple as putting temporary a box between the original one and the 
remote bunch that would be sync to first. This way, it add more delay 
yes, but may be in the end would much better.


I can manage, I find ways to use ls -laR, or du -k and do diff's between 
them and fine the files that are getting out of wack, replace them and 
then continue, but this really is painful.


stat -s gives the raw info in one go. Some shell script hacking should
make it easy to detect sparse files.


Thanks Otto for the suggestion. That might help until it can be address 
for good. It would help speed up some of it. (;


Many thanks as trying to explain the problem better may have given me a 
temporary work around that is not brilliant, but that might just work 
until the problem can be address better.


Daniel



identifying sparse files and get ride of them trick available?

2007-11-08 Thread Daniel Ouellet

Hi,

I am trying to find a way ti identifying sparse files properly and 
quickly and find a way to rectify the situation.


Any trick to do this?

The problem is that overtime looks like I am ending up with lots of them 
and because I have to sync multiples servers together the sparse files 
makes the sync painful over time as well as huge obviously and slow. I 
am talking multiple GB's here.


So far the only way I have do it is with rsync and -S options, but then 
the sync process takes a lots of time and when you need to sync multiple 
boxes multiple times per hours, it end up not be able to do it anymore 
and the process is not finish and it is suppose to start again.


The other way that I found is to use dump and then restore, but that 
also is painful to do on live systems obviously. I need to find a way to 
clean the source, so that the sync system do their stuff easy. If I 
simply sync with the sparse file, sure I can do that, but then, the 
problem is the destinations runs out of space as the sparse gets to big 
over time.


Google also pointed out that may be FIBMAP ioctl may have done to job, 
may be, but that was kill by Theo on 2007/06/02 09:14:36. I assume for 
many good reason for sure, so I didn't pursue that anymore.


Then may be filefrag -v might work, but not much success there either.

So, I am running out of ideas and may be there isn't any way to do this, 
I however hope there is.


If it is not possible to correct the problem in a cronjob fashion or 
something, may be how could I possible find sparse files efficiently?


At a minimum, if I could find the file getting out of control, then I 
could at a minimum delete them and copy them from the source again and 
reduce the problem of the sparse files.


Any clue as to how to tackle this problem, or any trick around it?

Best,

Daniel