Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Roy Sigurd Karlsbakk
 If anyone has any ideas be it ZFS based or any useful scripts that
 could help here, I am all ears.

Something like this one-liner will show what would be allocated by everything 
if hardlinks weren't used:

# size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do 
size=$(( $size + $i )); done; echo $size

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net 
wrote:
 If anyone has any ideas be it ZFS based or any useful scripts that
 could help here, I am all ears.

 Something like this one-liner will show what would be allocated by everything 
 if hardlinks weren't used:

 # size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do 
 size=$(( $size + $i )); done; echo $size

Oh, you don't want to do that: you'll run into max argument list size issues.

Try this instead:

(echo 0; find . -type f \! -links 1 | xargs stat -c  %b %B *+ $p; echo p) | dc

;)

xargs is your friend (and so is dc... RPN FTW!).  Note that I'm not
printing the number of links because find will print a name for every
link (well, if you do the find from the root of the relevant
filesystem), so we'd be counting too much space.

You'll need the GNU stat(1).  Or you could do something like this
using the ksh stat builtin:

(
echo 0
find . -type f \! -links 1 | while read p; do
xargs stat -c  %b %B *+ $p
done
echo p
) | dc

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams n...@cryptonector.com wrote:
 Try this instead:

 (echo 0; find . -type f \! -links 1 | xargs stat -c  %b %B *+ $p; echo p) | 
 dc

s/\$p//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
And, without a sub-shell:

find . -type f \! -links 1 | xargs stat -c  %b %B *+p /dev/null | dc
2/dev/null | tail -1

(The stderr redirection is because otherwise dc whines once that the
stack is empty, and the tail is because we print interim totals as we
go.)

Also, this doesn't quit work, since it counts every link, when we want
to count all but one links.  This, then, is what will tell you how
much space you saved due to hardlinks:

find . -type f \! -links 1 | xargs stat -c  8k %b %B * %h 1 - * %h
/+p /dev/null 2/dev/null | dc

Excuse my earlier brainfarts :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Nico Williams
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
scott.law...@manukau.ac.nz wrote:
 I have an interesting question that may or may not be answerable from some
 internal
 ZFS semantics.

This is really standard Unix filesystem semantics.

 [...]

 So total storage used is around ~7.5MB due to the hard linking taking place
 on each store.

 If hard linking capability had been turned off, this same message would have
 used 1500 x 2MB =3GB
 worth of storage.

 My question is there any simple ways of determining the space savings on
 each of the stores from the usage of hard links?  [...]

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Jim Klimov

Some time ago I wrote a script to find any duplicate files and replace
them with hardlinks to one inode. Apparently this is only good for same
files which don't change separately in future, such as distro archives.

I can send it to you offlist, but it would be slow in your case because it
is not quite the tool for the job (it will start by calculating checksums
of all of your files ;) )

What you might want to do and script up yourself is a recursive listing
find /var/opt/SUNWmsqsr/store/partition... -ls. This would print you
the inode numbers and file sizes and link counts. Pipe it through
something like this:

find ... -ls | awk '{print $1 $4 $7}' | sort | uniq

And you'd get 3 columns - inode, count, size

My AWK math is a bit rusty today, so I present a monster-script like
this to multiply and sum up the values:

( find ... -ls | awk '{print $1 $4 $7}' | sort | uniq | awk '{ print 
$2*$3+\\ }'; echo 0 ) | bc


Can be done cleaner, i.e. in a PERL one-liner, and if you have
many values - that would probably complete faster too. But as
a prototype this would do.

HTH,
//Jim

PS: Why are you replacing the cool Sun Mail? Is it about Oracle
licensing and the now-required purchase and support cost?


2011-06-13 1:14, Scott Lawson пишет:

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are 
stored in standard SMTP
MIME format so the binary attachments are included in the message 
ASCII. Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message 
would have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings 
on each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
to M$ Exchange 2010 which doesn't do message single instancing. I need 
to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that 
could help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО ЦОС и ВТ  JSC COSHT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|  CC:ad...@cos.ru,jimkli...@mail.ru |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Jim Klimov

2011-06-13 2:28, Nico Williams пишет:

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...


That's especially strange, because NTFS has hardlinks and softlinks...
Not that Microsoft provided any tools for using that, but there are
third-party programs like CygWin ls and the FAR File Manger.

Well, enought off-topicing ;)
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 10:28 AM, Nico Williams wrote:

On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
scott.law...@manukau.ac.nz  wrote:
   

I have an interesting question that may or may not be answerable from some
internal
ZFS semantics.
 

This is really standard Unix filesystem semantics.
   
I Understand this, just wanting to see if here is any easy way before I 
trawl

through 10 million little files.. ;)
   

[...]

So total storage used is around ~7.5MB due to the hard linking taking place
on each store.

If hard linking capability had been turned off, this same message would have
used 1500 x 2MB =3GB
worth of storage.

My question is there any simple ways of determining the space savings on
each of the stores from the usage of hard links?  [...]
 

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.
   

Yes this number varies based on number of recipients, so could be as many a

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.
   
Looks like I will have to, just looking for a tried and tested method 
before I have to create my own
one if possible. Just was looking for an easy option before I have to 
sit down and
develop and test a script. I have resigned from my current job of 9 
years and finish in 15 days and have
a heck of a lot of documentation and knowledge transfer I need to do 
around other UNIX systems

and am running very short on time...

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
   
As a side not Exchange 2002 + Exchange 2007 do do this. But apparently 
M$ decided in Exchange
2010 that they no longer wished to do this and dropped the capability. 
Bizarre to say the least,
but it may come down to what they have done in the underlying store 
technology changes..


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 11:36 AM, Jim Klimov wrote:

Some time ago I wrote a script to find any duplicate files and replace
them with hardlinks to one inode. Apparently this is only good for same
files which don't change separately in future, such as distro archives.

I can send it to you offlist, but it would be slow in your case 
because it

is not quite the tool for the job (it will start by calculating checksums
of all of your files ;) )

What you might want to do and script up yourself is a recursive listing
find /var/opt/SUNWmsqsr/store/partition... -ls. This would print you
the inode numbers and file sizes and link counts. Pipe it through
something like this:

find ... -ls | awk '{print $1 $4 $7}' | sort | uniq

And you'd get 3 columns - inode, count, size

My AWK math is a bit rusty today, so I present a monster-script like
this to multiply and sum up the values:

( find ... -ls | awk '{print $1 $4 $7}' | sort | uniq | awk '{ 
print $2*$3+\\ }'; echo 0 ) | bc
This looks something like what I thought would have to be done, I was 
just looking
to see if there was something tried and tested before I had to invent 
something. I was really hoping
in zdb there might have been some magic information I could have tapped 
into.. ;)


Can be done cleaner, i.e. in a PERL one-liner, and if you have
many values - that would probably complete faster too. But as
a prototype this would do.

HTH,
//Jim

PS: Why are you replacing the cool Sun Mail? Is it about Oracle
licensing and the now-required purchase and support cost?
Yes it is about cost mostly. We had Sun Mail for our Staff and students. 
We had
20,000 + students on it up until Christmas time as well. We have now 
migrated them
to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use 
LookOut. The Sun
connector for LookOut is a bit flaky at best. But the Oracle licensing 
cost for Messaging
and Calendar starts at 10,000 users plus and so is now rather expensive 
for what mailboxes
we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle 
is not very friendly
the way Sun was with their JES licensing. So it is bye bye Sun Messaging 
Server for us.



2011-06-13 1:14, Scott Lawson пишет:

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are 
stored in standard SMTP
MIME format so the binary attachments are included in the message 
ASCII. Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message 
would have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings 
on each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
to M$ Exchange 2010 which doesn't do message single instancing. I 
need to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that 
could help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Tim Cook
On Sun, Jun 12, 2011 at 5:28 PM, Nico Williams n...@cryptonector.comwrote:

 On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
 scott.law...@manukau.ac.nz wrote:
  I have an interesting question that may or may not be answerable from
 some
  internal
  ZFS semantics.

 This is really standard Unix filesystem semantics.

  [...]
 
  So total storage used is around ~7.5MB due to the hard linking taking
 place
  on each store.
 
  If hard linking capability had been turned off, this same message would
 have
  used 1500 x 2MB =3GB
  worth of storage.
 
  My question is there any simple ways of determining the space savings on
  each of the stores from the usage of hard links?  [...]

 But... you just did!  :)  It's: number of hard links * (file size +
 sum(size of link names and/or directory slot size)).  For sufficiently
 large files (say, larger than one disk block) you could approximate
 that as: number of hard links * file size.  The key is the number of
 hard links, which will typically vary, but for e-mails that go to all
 users, well, you know the number of links then is the number of users.

 You could write a script to do this -- just look at the size and
 hard-link count of every file in the store, apply the above formula,
 add up the inflated sizes, and you're done.

 Nico

 PS: Is it really the case that Exchange still doesn't deduplicate
 e-mails?  Really?  It's much simpler to implement dedup in a mail
 store than in a filesystem...



MS has had SIS since Exchange 4.0.  They dumped it in 2010 because it was a
huge source of their small random I/O's.  In an effort to allow Exchange to
be more storage friendly (IE: more of a large sequential I/O profile),
they've done away with SIS.  The defense for it is that you can buy more
cheap storage for less money than you'd save with SIS and 15k rpm disks.
 Whether that's factual I suppose is for the reader to decide.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss