Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Daniel Ouellet

OK,

Here is the source of the problem. The cache file generated by 
webazolver is the source of the problem. Based on the information of the 
software webalizer, as this:


Cached DNS addresses have a TTL (time to live) of 3 days.  This may be
changed at compile time by editing the dns_resolv.h header file and
changing the value for DNS_CACHE_TTL.

The cache file is process each night, and the records older then 3 days 
are remove, but somehow that file become a sparse file in the process 
and when copy else where show it's real size. In my case that file was 
using a bit over 4 millions blocks more then it should have and give me 
the 4GB+ difference in mirroring the content.


So, as far as I can see it, this process of expiring the records from 
the cache file that is always reuse doesn't shrink the file really, but 
somehow just mark the records inside the file as bad, or something like 
that.


So, nothing to do with OpenBSD at all but I would think there is a bug 
in the portion of webalizer however base on what I see from it's usage.


Now the source of the problem was found and many thanks to all that 
stick with me along the way.


Always feel good to know in the end!

Thanks to Otto, Ted and Tom.

Daniel



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Otto Moerbeek
On Tue, 17 Jan 2006, Daniel Ouellet wrote:

 OK,
 
 Here is the source of the problem. The cache file generated by webazolver is
 the source of the problem. Based on the information of the software webalizer,
 as this:
 
 Cached DNS addresses have a TTL (time to live) of 3 days.  This may be
 changed at compile time by editing the dns_resolv.h header file and
 changing the value for DNS_CACHE_TTL.
 
 The cache file is process each night, and the records older then 3 days are
 remove, but somehow that file become a sparse file in the process and when
 copy else where show it's real size. In my case that file was using a bit over
 4 millions blocks more then it should have and give me the 4GB+ difference in
 mirroring the content.
 
 So, as far as I can see it, this process of expiring the records from the
 cache file that is always reuse doesn't shrink the file really, but somehow
 just mark the records inside the file as bad, or something like that.
 
 So, nothing to do with OpenBSD at all but I would think there is a bug in the
 portion of webalizer however base on what I see from it's usage.
 
 Now the source of the problem was found and many thanks to all that stick with
 me along the way.

You are wrong in thinking sparse files are a problem. Having sparse
files quite a nifty feature, I would say. 


-Otto



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Joachim Schipper
On Tue, Jan 17, 2006 at 02:15:57PM +0100, Otto Moerbeek wrote:
 On Tue, 17 Jan 2006, Daniel Ouellet wrote:
 
  OK,
  
  Here is the source of the problem. The cache file generated by
  webazolver is the source of the problem. Based on the information of
  the software webalizer, as this:
  
  Cached DNS addresses have a TTL (time to live) of 3 days.  This may
  be changed at compile time by editing the dns_resolv.h header file
  and changing the value for DNS_CACHE_TTL.
  
  The cache file is process each night, and the records older then 3
  days are remove, but somehow that file become a sparse file in the
  process and when copy else where show it's real size. In my case
  that file was using a bit over 4 millions blocks more then it should
  have and give me the 4GB+ difference in mirroring the content.
  
  So, as far as I can see it, this process of expiring the records
  from the cache file that is always reuse doesn't shrink the file
  really, but somehow just mark the records inside the file as bad, or
  something like that.
  
  So, nothing to do with OpenBSD at all but I would think there is a
  bug in the portion of webalizer however base on what I see from it's
  usage.
  
  Now the source of the problem was found and many thanks to all that
  stick with me along the way.
 
 You are wrong in thinking sparse files are a problem. Having sparse
 files quite a nifty feature, I would say. 

Are we talking about webazolver or OpenBSD?

I'd argue that relying on the OS handling sparse files this way instead
of handling your own log data in an efficient way *is* a problem, as
evidenced by Daniels post. After all, it's reasonable to copy data to,
say, a different drive and expect it to take about as much space as the
original.

On the other hand, I agree with you that handling sparse files
efficiently is rather neat in an OS.

Joachim



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Otto Moerbeek
On Tue, 17 Jan 2006, Joachim Schipper wrote:

 On Tue, Jan 17, 2006 at 02:15:57PM +0100, Otto Moerbeek wrote:

  You are wrong in thinking sparse files are a problem. Having sparse
  files quite a nifty feature, I would say. 
 
 Are we talking about webazolver or OpenBSD?
 
 I'd argue that relying on the OS handling sparse files this way instead
 of handling your own log data in an efficient way *is* a problem, as
 evidenced by Daniels post. After all, it's reasonable to copy data to,
 say, a different drive and expect it to take about as much space as the
 original.

Now that's a wrong assumption. A file is a row of bytes. The only
thing I can assume is that if I write a byte at a certain position, I
will get the same byte back when reading the file. Furthermoe, the
file size (not the disk space used!) is the largest position written.
If I assume anything more, I'm assuming too much.

For an application, having sparse files is completely transparant. The
application doesn't even know the difference. How the OS stores the
file is up to the OS.

Again, assuming a copy of a file takes up as much space as the
original is wrong. 

 On the other hand, I agree with you that handling sparse files
 efficiently is rather neat in an OS.

-Otto



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Daniel Ouellet

You are wrong in thinking sparse files are a problem. Having sparse
files quite a nifty feature, I would say. 



Are we talking about webazolver or OpenBSD?

I'd argue that relying on the OS handling sparse files this way instead
of handling your own log data in an efficient way *is* a problem, as
evidenced by Daniels post. After all, it's reasonable to copy data to,
say, a different drive and expect it to take about as much space as the
original.


Just as feedback the size showed something like 150MB or so as the 
original file on OpenBSD. Using RSYNC to copy it over makes it almost 
5GB in size, well I wouldn't call that good. But again, before I say no 
 definitely, there is always something that I may not understands, so I 
am welling to leave some space for that here. But not much! (:



On the other hand, I agree with you that handling sparse files
efficiently is rather neat in an OS.


I am not sure that the OS handle it well or not. Again, no punch 
intended, but if it was, why copy no data then? Obviously something I 
don't understand for sure.


However, here is something I didn't include in my previous email with 
all the stats and may be very interesting to know. I didn't think it was 
so important at the time, but if you talk about handling it properly, 
may be it might be relevant.


The test were done with three servers. The file showing ~150MB in size 
was on www1. Then copying it to www2 with the -S switch in rsync 
regardless got it to ~5GB. Then copying the same file from www2 to www3 
using the same rsync -S setup go that file back to the size it was on 
www1. So, why not in the www2 in that case. So, it the the OS, or is 
that the rsync. Was it handle properly or wasn't it? I am not sure. If 
it was, then the www2 file should not have been ~5GB should it?


So the picture was

www1-www2-www3

www1 cache DB show 150MB

rsync -e ssh -aSuqz --delete /var/www/sites/ [EMAIL PROTECTED]:/var/www/sites

www2 cache DB show ~5GB

rsync -e ssh -aSuqz --delete /var/www/sites/ [EMAIL PROTECTED]:/var/www/sites

www3 cache DB show ~150MB

Why not 150Mb on www2???

One think that I haven't tried and regret not have done that not to know 
is just copying that file on www1 to a different name and then copying 
it again to it's original name and check the size at the and and the 
transfer of that file as well I without the -S switch to see if the OS 
did copy the empty data or not.


I guess the question would be, should it, or shouldn't it do it?

My own opinion right now is the file should show the size it really is. 
So, if it is 5GB and only 100MB is good on it, shouldn't it show it to 
be 5GB? I don't know, better mind then me sure have the answer to this 
one, right now, I do not for sure.




Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Joachim Schipper
On Tue, Jan 17, 2006 at 05:49:24PM +0100, Otto Moerbeek wrote:
 On Tue, 17 Jan 2006, Joachim Schipper wrote:
 
  On Tue, Jan 17, 2006 at 02:15:57PM +0100, Otto Moerbeek wrote:
 
   You are wrong in thinking sparse files are a problem. Having sparse
   files quite a nifty feature, I would say. 
  
  Are we talking about webazolver or OpenBSD?
  
  I'd argue that relying on the OS handling sparse files this way instead
  of handling your own log data in an efficient way *is* a problem, as
  evidenced by Daniels post. After all, it's reasonable to copy data to,
  say, a different drive and expect it to take about as much space as the
  original.
 
 Now that's a wrong assumption. A file is a row of bytes. The only
 thing I can assume is that if I write a byte at a certain position, I
 will get the same byte back when reading the file. Furthermoe, the
 file size (not the disk space used!) is the largest position written.
 If I assume anything more, I'm assuming too much.
 
 For an application, having sparse files is completely transparant. The
 application doesn't even know the difference. How the OS stores the
 file is up to the OS.
 
 Again, assuming a copy of a file takes up as much space as the
 original is wrong. 
 
  On the other hand, I agree with you that handling sparse files
  efficiently is rather neat in an OS.

Okay - I understand your logic, and yes, I do know about sparse files
and how they are typically handled. And yes, you are right that
there are very good reasons for handling sparse files this way.

And yes, application are right to make use of this feature where
applicable.

However, in this case, it's a simple log file, and what the application
did, while very much technically correct, clearly violated the principle
of least astonishment, for no real reason I can see. Sure, trying to
make efficient use of every single byte may not be very efficient - but
just zeroing out the first five GB of the file is more than a little
hackish, and not really necessary.

Joachim



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Matthias Kilian
On Tue, Jan 17, 2006 at 02:36:44PM -0500, Daniel Ouellet wrote:
 [...] But having a 
 file that is let say 1MB of valid data that grow very quickly to 4 and 
 6GB quickly and takes time to rsync between servers were in one instance 
 fill the fill system and create other problem. (: I wouldn't call that 
 a feature.

As Otto noted, you've distinguish between file size (that's what
stat(2) and friends report, and at the same time it's the number
of bytes you can read sequentially from the file), and a file's
disk usage.

For more explanations, see the RATIONALE section at

http://www.opengroup.org/onlinepubs/009695399/utilities/du.html

(You may have to register, but it doesn't hurt)

See also the reference to lseek(2) mentioned there.


 But at the same time, I wasn't using the -S switch in rsync, 
 so my own stupidity there. However, why spend lots of time processing 
 empty files I still don't understand that however.

Please note that -S in rsync does not *guarantee* that source and
destination files are *identical* in terms of holes or disk usage.

For example:

$ dd if=/dev/zero of=foo bs=1m count=42
$ rsync -S foo host:
$ du foo
$ ssh host du foo

Got it? The local foo is *not* sparse (no holes), but the remote
one has been optimized by rsync's -S switch.

We recently had a very controverse (and flaming) discussion at our
local UG on such optimizations (or heuristics, as in GNU cp).
IMO, if they have to be explicitely enabled (like `-S' for rsync),
that's o.k. The other direction (copy is *not* sparse by default)
is exactly what I would expect.

Telling wether a sequence of zeroes is a hole or just a (real) block
of zeroes isn't possible in userland -- it's a filesystem implementation
detail.

To copy the *exact* contents of an existing filesystem including
all holes to another disk (or system), you *have* to use
filesystem-specific tools, such as dump(8) and restore(8). Period.


 I did research on google for sparse files and try to get more 
 informations about it. In some cases I would assume like if you do round 
 database type of stuff where you have a fix file that you write in at 
 various place or something, would be good and useful, but a sparse file 
 that keep growing over time uncontrol, I may be wrong, but I don't call 
 that useful feature.

Sparse files for databases on heavy load (many insertions and
updates) ar the death of performance -- you'll get files with blocks
spreaded all over your filesystem.

OTH, *spare* databases such as quotas files (potentially large, but
growing very slowly) are good candidates for sparse files.

Ciao,
Kili



Re: df -h stats for same file systems display different result son AMD64 then on i386 (Source solved)

2006-01-17 Thread Daniel Ouellet

Hi all,

First let me start with my apology to some of you for having waisted 
your time!


As much as this was/is interesting and puzzling to me and that I am 
trying obviously to get my hands around this issue and usage of sparse 
files, the big picture of it, is obviously something missing in my 
understanding at this time.


I am doing more research on my own, so lets kill this tread and sorry to 
have waisted any of your time with my lack of understanding of this aspect!


I am not trying to be a fucking idiot on the list, but it's obvious 
that I don't understand this at this time.


So, lets drop it and I will continue my homework!

Big thanks to all that try to help me as well!

Daniel