[zfs-discuss] strangeness after resilvering disk from raidz1 on disks with no EFI GPTs

2010-03-06 Thread Ethan
I have a zpool of five 1.5TB disks in raidz1. They are on c?t?d?p0 devices -
using the full disk, not any slice or partition, because the pool was
created in zfs-fuse in linux and no partition tables were ever created. (for
the full saga of my move from that to opensolaris, anyone who missed out on
the fun can read the thread
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg34813.html - but
I will try to include all relevant information so that's not necessary).
When I had gotten things working in opensolaris, and done a scrub, I had
gotten some errors on one disk and so I offlined it, overwrote the whole
disk with random-looking data, and read the data back to check that the disk
was behaving. it was, and I resilvered, and things have seemed fine since.

I just noticed, though (some time later, with things working correctly in
the meantime), that I now have a EFI partition table on the disk that I
resilvered. none of the others have any partition table. This confuses me
greatly, for a few reasons.
One, why did zfs create a partition table? I thought it only did that when
you gave it a shorthand disk in the form c?t?d? with no slice or partition
number - I did replace giving it the full path /dev/dsk/c9t4d0p0. Doesn't
this mean that zfs must actually be using s0 of the drive, not p0? c9t4d0p0
is what shows up in the zpool status, along with p0 devices for the other
four drives.
Two, given that this one disk has a EFI partition table - including 8MB
reserved slice 8 - the actual device that zfs is using is more than 8MB
smaller than the other four. How am I using a raidz1 with unequally sized
devices?

Since this has been running without a problem for a few weeks now, I'm not
actually concerned about it being a problem - just rather confused. Can
anybody explain what's up with this?

Thanks,
-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Abdullah Al-Dahlawi
Hi ALL

I might be little bit confused !!!

I will try to ask my question in a simple way ...

Why would a 16GB L2ARC device got filled by running a benchmark that uses a
2GB workingset while having a 2GB ARC max ?

I know I am missing something here !

Thanks


On Sun, Mar 7, 2010 at 12:05 AM, Richard Elling wrote:

> On Mar 6, 2010, at 8:05 PM, Eric D. Mudama wrote:
> > On Sat, Mar  6 at 15:04, Richard Elling wrote:
> >> On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote:
> >>> On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:
> 
>  hdd ONLINE   0 0 0
>    c7t0d0p3  ONLINE   0 0 0
> 
>  rpool   ONLINE   0 0 0
>    c7t0d0s0  ONLINE   0 0 0
> >>>
> >>> I trimmed your zpool status output a bit.
> >>>
> >>> Are those two the same device?  I'm barely familiar with solaris
> >>> partitioning and labels... what's the difference between a slice and a
> >>> partition?
> >>
> >> In this context, "partition" is an fdisk partition and "slice" is a
> >> SMI or EFI labeled slice.  The SMI or EFI labeling tools (format,
> >> prtvtoc, and ftmhard) do not work on partitions.  So when you
> >> choose to use ZFS on a partition, you have no tools other than
> >> fdisk to manage the space. This can lead to confusion... a bad
> >> thing.
> >
> > So in that context, is the above 'zpool status' snippet a "bad thing
> > to do"?
>
> If the partition containing c7t0d0s0 was p3, then it could be exceedingly
> bad. Normally, if you try to create a zpool on a slice which already has a
> zpool, then you will get an error message to that effect, which you can
> override with the "-f" flag.  However, that checking is done on slices, not
> fdisk partitions. Hence, there is an opportunity for confusion... a bad
> thing.
>  -- richard
>
> ZFS storage and performance consulting at http://www.RichardElling.com
> ZFS training on deduplication, NexentaStor, and NAS performance
> http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
>
>
>
>
>


-- 
Abdullah Al-Dahlawi
PhD Candidate
George Washington University
Department. Of Electrical & Computer Engineering

Check The Fastest 500 Super Computers Worldwide
http://www.top500.org/list/2009/11/100
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS aclmode property

2010-03-06 Thread Paul B. Henson
On Sat, 6 Mar 2010, Ralf Utermann wrote:

> we recently started to look at a ZFS based solution as a possible
> replacement for our DCE/DFS based campus filesystem (yes, this is still
> in production here).

Hey, a fellow DFS shop :)... We finally migrated the last production files
off of DFS last month, I'm actually going to pull the plug on the
infrastructure within a couple of weeks. It will be nice not to have to
worry that software that's been unsupported for years will go blooey :(.

> The ACL model of the combination OpenSolaris+ZFS+in-kernel-CIFS+NFSv4
> looks like a really promising setup, something which could place it high
> up on our list ...

Indeed, while we're currently running S10 with samba (our development
started before OpenSolaris support was announced; we're hoping to migrate
sometime this year), Solaris/ZFS was the best option we could find to
replace our DFS infrastructure. The main thing I miss is the location
independence and ability to migrate data between servers while it's in use.
Other than this annoying chmod/ACL issue, our only other major problem is
lack of scalability in NFS sharing, it takes a good 45 minutes to
share/unshare the 8000 filesystems on each of our X4500's (we have 5),
resulting in about a 2 hour reboot cycle :(. There's an open bug on it, but
they say it will never be addressed in Solaris 10, but hopefully someday in
OpenSolaris.

> So from this site: we very much support the idea of adding ignore and
> deny values for the aclmode property!

If you have a Sun support contract, open a support call and ask to be added
to SR #72456444, which is the case I have open to try and get a better
solution to chmod/ACL interaction. If you're thinking of spending a lot of
money on Sun hardware, bring this issue up to your sales guy and push for a
solution. I think part of the problem is very few sites actually use ACLs,
particularly to the extent people coming from a DFS background are used to
:(.

> However, reading PSARC/2010/029, it looks like we will get
> aclmode=discard for everybody and the property removed. I hope this is
> not the end of the story ...

As do I, but so far it's not looking too good. I discussed my proposal with
Mark Shellenbaum (the author of that PSARC case), and he was pretty
strongly against it. I thought I made some rather good points, but as I'm
sure you saw from the threads you referenced there are quite strong
opinions on both sides. He seems to be Sun's main guy when it comes to
ACL's; if he was on board it would be a lot more likely to happen, but I
never heard back from him on my counter response to his initial reply
detailing his reasons he thought it was a bad idea, and he was
conspicuously absent during the recent list free-for-all...

As I've offered before, I'll implement it if they'll merge it...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Richard Elling
On Mar 6, 2010, at 8:05 PM, Eric D. Mudama wrote:
> On Sat, Mar  6 at 15:04, Richard Elling wrote:
>> On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote:
>>> On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:
 
 hdd ONLINE   0 0 0
   c7t0d0p3  ONLINE   0 0 0
 
 rpool   ONLINE   0 0 0
   c7t0d0s0  ONLINE   0 0 0
>>> 
>>> I trimmed your zpool status output a bit.
>>> 
>>> Are those two the same device?  I'm barely familiar with solaris
>>> partitioning and labels... what's the difference between a slice and a
>>> partition?
>> 
>> In this context, "partition" is an fdisk partition and "slice" is a
>> SMI or EFI labeled slice.  The SMI or EFI labeling tools (format,
>> prtvtoc, and ftmhard) do not work on partitions.  So when you
>> choose to use ZFS on a partition, you have no tools other than
>> fdisk to manage the space. This can lead to confusion... a bad
>> thing.
> 
> So in that context, is the above 'zpool status' snippet a "bad thing
> to do"?

If the partition containing c7t0d0s0 was p3, then it could be exceedingly
bad. Normally, if you try to create a zpool on a slice which already has a
zpool, then you will get an error message to that effect, which you can 
override with the "-f" flag.  However, that checking is done on slices, not
fdisk partitions. Hence, there is an opportunity for confusion... a bad thing.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Eric D. Mudama

On Sat, Mar  6 at 15:04, Richard Elling wrote:

On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote:

On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:


 hdd ONLINE   0 0 0
   c7t0d0p3  ONLINE   0 0 0

 rpool   ONLINE   0 0 0
   c7t0d0s0  ONLINE   0 0 0


I trimmed your zpool status output a bit.

Are those two the same device?  I'm barely familiar with solaris
partitioning and labels... what's the difference between a slice and a
partition?


In this context, "partition" is an fdisk partition and "slice" is a
SMI or EFI labeled slice.  The SMI or EFI labeling tools (format,
prtvtoc, and ftmhard) do not work on partitions.  So when you
choose to use ZFS on a partition, you have no tools other than
fdisk to manage the space. This can lead to confusion... a bad
thing.


So in that context, is the above 'zpool status' snippet a "bad thing
to do"?



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool on sparse files

2010-03-06 Thread Edward Ned Harvey
> You are running into this bug:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6929751
> Currently, building a pool from files is not fully supported.

I think Cindy and I interpreted the question differently.  If you want the
zpool inside a file to stay mounted while the system is running, and come up
again after reboot, then I think she's right.  You're running into that bug.

If you want to dismount your zpool, for the sake of backing it up to tape or
something like that ... and then you're seeing this error on reboot, I think
you need to export your filesystem before you do your backups or reboot.
Then when you want to mount it again, you just import it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

2010-03-06 Thread Edward Ned Harvey
>  From everything I've seen, an SSD wins simply because it's 20-100x the
> size. HBAs almost never have more than 512MB of cache, and even fancy
> SAN boxes generally have 1-2GB max. So, HBAs are subject to being
> overwhelmed with heavy I/O. The SSD ZIL has a much better chance of
> being able to weather a heavy I/O period without being filled. Thus,
> SSDs are better at "average" performance - they provide a relatively
> steady performance profile, whereas HBA cache is very spiky.

This is a really good point.  So you think I may actually get better
performance by disabling the WriteBack on all the spindle disks, and
enabling it on the SSD instead.  This is precisely the opposite of what I
was thinking.

I'm planning to publish some more results soon, but haven't gathered it all
yet.  But see these:
Just naked disks, no acceleration.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteThr
ough.txt
Same configuration as above, but WriteBack enabled.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k.txt
Same configuration as the naked disks, but a ramdrive was created for ZIL
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-ramZIL.t
xt
Using the ramdrive for ZIL, and also WriteBack enabled on PERC
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k_and_ramZIL.txt

This result shows the WriteBack enabled makes a huge performance difference
(3-4x higher) for writes, compared to the naked disks.  I don't think it's
because an entire write operation fits into the HBA DRAM, or the HBA is
remaining un-saturated.  The PERC has 256M, but the test includes 8 threads
all simultaneously writing separate 4G files in various sized chunks and
patterns.  I think when the PERC ram is full of stuff queued for write to
disk, it's simply able to order and organize and optimize the write
operations to leverage the disks as much as possible.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts pls. : Create 3 way rpool mirror and shelve one mirror as a backup

2010-03-06 Thread Richard Elling
On Mar 6, 2010, at 5:38 PM, tomwaters wrote:

> Hi guys,
>   On my home server (2009.6) I have a 2 HDD's in a mirrored rpool.
> 
> I just added a 3rd to the mirror and made all disks bootable (ie. installgrub 
> on the mirror disks).
> 
> My though is this, I remove the 3rd mirror disk and offsite it as a backup.
> 
> That way if I mess up the rpool, I can get back the offsite HDD, boot from it 
> and re-mirror this to the other 2 HDD's and I am back in business.
> 
> I plan to leave the 3rd mirror device in the rpool (just no HDD loaded so it 
> will show as degraded all the time). On a monthly basis, I'll physically 
> insert the 3rd HDD and get it to resilver and then remove the 3rd hdd offsite 
> again - ie. refresh the backup.
> 
> Anyone see any flaws in this plan?


To do this either:
1. upgrade to a later version where the "zpool split" command is 
available
2. zfs send/receive to the disk to be stored offsite

IMHO, splitting mirros for backups is a waste of time, but it is a popular way 
of
backup for non-ZFS file systems.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does OpenSolaris mpt driver support LSI 2008 controller

2010-03-06 Thread norm.tallant
I'm about to try it!  My LSI SAS 9211-8i should arrive Monday or Tuesday.  I 
bought the cable-less version, opting instead to save a few $ and buy Adaptec 
2247000-R SAS to SATA cables.

My rig will be based off of fairly new kit, so it should be interesting to see 
how 2009.06 deals with it all :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Thoughts pls. : Create 3 way rpool mirror and shelve one mirror as a backup

2010-03-06 Thread tomwaters
Hi guys,
   On my home server (2009.6) I have a 2 HDD's in a mirrored rpool.

I just added a 3rd to the mirror and made all disks bootable (ie. installgrub 
on the mirror disks).

My though is this, I remove the 3rd mirror disk and offsite it as a backup.

That way if I mess up the rpool, I can get back the offsite HDD, boot from it 
and re-mirror this to the other 2 HDD's and I am back in business.

I plan to leave the 3rd mirror device in the rpool (just no HDD loaded so it 
will show as degraded all the time). On a monthly basis, I'll physically insert 
the 3rd HDD and get it to resilver and then remove the 3rd hdd offsite again - 
ie. refresh the backup.

Anyone see any flaws in this plan?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Richard Elling
On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote:
> On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:
>> 
>>  hdd ONLINE   0 0 0
>>c7t0d0p3  ONLINE   0 0 0
>> 
>>  rpool   ONLINE   0 0 0
>>c7t0d0s0  ONLINE   0 0 0
> 
> I trimmed your zpool status output a bit.
> 
> Are those two the same device?  I'm barely familiar with solaris
> partitioning and labels... what's the difference between a slice and a
> partition?

In this context, "partition" is an fdisk partition and "slice" is a
SMI or EFI labeled slice.  The SMI or EFI labeling tools (format,
prtvtoc, and ftmhard) do not work on partitions.  So when you 
choose to use ZFS on a partition, you have no tools other than
fdisk to manage the space. This can lead to confusion... a bad 
thing.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring my disk activity

2010-03-06 Thread Richard Elling
On Mar 6, 2010, at 1:02 PM, Edward Ned Harvey wrote:
> Recently, I’m benchmarking all kinds of stuff on my systems.  And one 
> question I can’t intelligently answer is what blocksize I should use in these 
> tests.
>  
> I assume there is something which monitors present disk activity, that I 
> could run on my production servers, to give me some statistics of the block 
> sizes that the users are actually performing on the production server.  And 
> then I could use that information to make an informed decision about block 
> size to use while benchmarking.
>  
> Is there a man page I should read, to figure out how to monitor and get 
> statistics on my real life users’ disk activity?

It all depends on how they are connecting to the storage.  iSCSI, CIFS, NFS, 
database, rsync, ...?

The reason I say this is because ZFS will coalesce writes, so just looking at
iostat data (ops versus size) will not be appropriate.  You need to look at the
data flowing between ZFS and the users. fsstat works for file systems, but
won't work for zvols, as an example.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WriteBack versus SSD-ZIL

2010-03-06 Thread Richard Elling
On Mar 6, 2010, at 1:38 AM, Zhu Han wrote:

> On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble  wrote:
> This is true.  SSDs and HDs differ little in their ability to handle raw 
> throughput. However, we often still see problems in ZFS associated with 
> periodic system "pauses" where ZFS effectively monopolizes the HDs to write 
> out it's current buffered I/O.  People have been complaining about this for 
> quite awhile.  SSDs have a huge advantage where IOPS are concerned, and given 
> that the backing store HDs have to service both read and write requests, 
> they're severely limited on the number of IOPs they can give to incoming data.
> 
> You have a good point, but I'd still be curious to see what an async cache 
> would do.  After all, that is effectively what the HBA cache is, and we see a 
> significant improvement with it, and not just for sync write.
> 
> 
> I might see what your mean here. Because ZFS has to aggregate some write data 
> during a short period (txn alive time) to avoid generating too many random 
> write HDD requests, the bandwidth of HDD during this time is wasted. For 
> write heavy streaming workload, especially those who can saturate the HDD 
> pool bandwidth easily, ZFS will make the performance worse than those legacy 
> file system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation 
> here. The bandwidth of the HDD is the root cause. 

This statement is too simple, and thus does not represent reality very well.
For a fully streaming workload where the load is near the capacity of the
storage, the algorithms in ZFS will work to optimize the match.  There is 
still some work to be done, but I don't believe UFS has beat ZFS on Solaris
for a significant streaming benchmark for several years now.

What we do see is that high performance SSDs can saturate the SAS/SATA
link for extended periods of time. For example, a Western Digital SiliconEdge
Blue (a new, midrange model) can read at 250 MB/sec in contrast to a 
WD RE4 which has a media transfer rate of 138 MB/sec.  High-speed SSDs
are already putting the hurt on 6Gbps SAS/SATA -- the Micron models claim
370 MB/sec sustained.  Since this can be easily parallelized, expect that
the high-end SSDs will saturate whatever you can connect them to.  This is 
one reason why the F5100 has 64 SAS channels for host connections.

> This is the design choice of ZFS. Reducing the length of period during txn 
> commit can alleviate the problem. So that the size of data needing to flush 
> to the disk every time could be smaller. Replace the HDD with some high-end 
> FC disks may solve this problem.

Properly matching I/O source and sink is still important, no file system can
relieve you of that duty :-)

>  I also don't know what the threshold is in ZFS for it to consider it time to 
> do a async buffer flush.  Is it time based?  % of RAM based? Absolute amount? 
> All of that would impact whether an SSD async cache would be useful.

The answer is "yes" to all of these questions, but there are many variables
to consider, so YMMV.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Eric D. Mudama

On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:


  hdd ONLINE   0 0 0
c7t0d0p3  ONLINE   0 0 0

  rpool   ONLINE   0 0 0
c7t0d0s0  ONLINE   0 0 0


I trimmed your zpool status output a bit.

Are those two the same device?  I'm barely familiar with solaris
partitioning and labels... what's the difference between a slice and a
partition?


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?

2010-03-06 Thread Richard Elling
On Mar 5, 2010, at 5:10 PM, James Dickens wrote:
> On Fri, Mar 5, 2010 at 4:48 PM, Tonmaus  wrote:
> Hi,
> 
> so, what would be a critical test size in your opinion? Are there any other 
> side conditions?
> 
>  
> when your dedup hash table ( a table that holds a checksum of every block 
> seen on filesystems/zvols  after dedup was enabled) exceeds memory, your 
> performance degrades exponentially probably before that. 

More important is the small, random I/O performance of your pool.
For fast devices, like 15krpm disks, SSDs, or array controllers with
nonvolatile caches, performance should be good.  For big, slow JBOD
drives, the small, random I/O performance is poor and you pay for
that cost savings with time spent waiting.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Monitoring my disk activity

2010-03-06 Thread Edward Ned Harvey
Recently, I'm benchmarking all kinds of stuff on my systems.  And one
question I can't intelligently answer is what blocksize I should use in
these tests.

 

I assume there is something which monitors present disk activity, that I
could run on my production servers, to give me some statistics of the block
sizes that the users are actually performing on the production server.  And
then I could use that information to make an informed decision about block
size to use while benchmarking.

 

Is there a man page I should read, to figure out how to monitor and get
statistics on my real life users' disk activity?

 

Thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS aclmode property

2010-03-06 Thread Dick Hoogendijk

On 6-3-2010 18:41, Ralf Utermann wrote:

So from this site: we very much support the idea of adding ignore
and deny values for the aclmode property!

However, reading PSARC/2010/029, it looks like we will get
aclmode=discard for everybody and the property removed.
I hope this is not the end of the story ...


+1
Carefully constructed ACL's should -never- be destroyed by an 
(unwanted/unexpected) chmod. Extra aclmode properties should not be so 
hard to implement.


--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | OpenSolaris 2010.03 b131
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread James Dickens
Hi

okay its not what i feared, it is probably caching every bit of data and
metadata you have written so far, why shouldn't it you have the space in the
l2 cache, and it can't offer to return it if its not in the cache, after the
cache is full or near full it will choose more carefully what to keep and
what to throw away.

James Dickens
http://uadmin.blogspot.com


On Sat, Mar 6, 2010 at 2:15 AM, Abdullah Al-Dahlawi wrote:

> hi James
>
>
> here is the out put you've requested
>
> abdul...@hp_hdx_16:~/Downloads# zpool status -v
>   pool: hdd
>  state: ONLINE
>  scrub: none requested
> config:
>
> NAMESTATE READ WRITE CKSUM
> hdd ONLINE   0 0 0
>   c7t0d0p3  ONLINE   0 0 0
> cache
>   c8t0d0p0  ONLINE   0 0 0
>
> errors: No known data errors
>
>   pool: rpool
>  state: ONLINE
>  scrub: none requested
> config:
>
> NAMESTATE READ WRITE CKSUM
> rpool   ONLINE   0 0 0
>   c7t0d0s0  ONLINE   0 0 0
>
> ---
>
> abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd
>capacity operationsbandwidth
> pool used  avail   read  write   read  write
> --  -  -  -  -  -  -
> hdd 1.96G  17.7G 10 64  1.27M  7.76M
>   c7t0d0p3  1.96G  17.7G 10 64  1.27M  7.76M
> cache   -  -  -  -  -  -
>   c8t0d0p0  *2.87G*  12.0G  0 17103  2.19M
> --  -  -  -  -  -  -
>
> abdul...@hp_hdx_16:~/Downloads# kstat -m zfs
> module: zfs instance: 0
> name:   arcstatsclass:misc
> c   2147483648
> c_max   2147483648
> c_min   268435456
> crtime  34.558539423
> data_size   2078015488
> deleted 9816
> demand_data_hits382992
> demand_data_misses  20579
> demand_metadata_hits74629
> demand_metadata_misses  6434
> evict_skip  21073
> hash_chain_max  5
> hash_chains 7032
> hash_collisions 31409
> hash_elements   36568
> hash_elements_max   36568
> hdr_size7827792
> hits481410
> l2_abort_lowmem 0
> l2_cksum_bad0
> l2_evict_lock_retry 0
> l2_evict_reading0
> l2_feeds1157
> l2_free_on_write475
> l2_hdr_size 0
> l2_hits 0
> l2_io_error 0
> l2_misses   14997
> l2_read_bytes   0
> l2_rw_clash 0
> l2_size 588342784
> l2_write_bytes  3085701632
> l2_writes_done  194
> l2_writes_error 0
> l2_writes_hdr_miss  0
> l2_writes_sent  194
> memory_throttle_count   0
> mfu_ghost_hits  9410
> mfu_hits343112
> misses  33011
> mru_ghost_hits  4609
> mru_hits116739
> mutex_miss  90
> other_size  51590832
> p   1320449024
> prefetch_data_hits  4775
> prefetch_data_misses1694
> prefetch_metadata_hits  19014
> prefetch_metadata_misses4304
> recycle_miss484
> size2137434112
> snaptime1945.241664714
>
> module: zfs instance: 0
> name:   vdev_cache_statsclass:misc
> crtime  34.558587713
> delegations 3415
> hits5578
> misses  3647
> snaptime1945.243484925
>
>
>
>
> On Fri, Mar 5, 2010 at 9:02 PM, James Dickens  wrote:
>
>> please post the output of zpool status -v.
>>
>>
>> Thanks
>>
>> James Dickens
>>
>>
>> On Fri, Mar 5, 2010 at 3:46 AM, Abdullah Al-Dahlawi wrote:
>>
>>> Greeting All
>>>
>>> I have create a pool that consists oh a hard disk and a ssd as a cache
>>>
>>> zpool create hdd c11t0d0p3
>>> zpool add hdd cache c8t0d0p0 - cache device
>>>
>>> I ran an OLTP bench mark to emulate a DMBS
>>>
>>> One I ran the benchmark, the pool started create the database file on the
>>> ssd cache device ???
>>>
>>>
>>> can any one explain why this happening ?
>>>
>>> is not L2ARC is used to a

[zfs-discuss] ZFS aclmode property

2010-03-06 Thread Ralf Utermann

we recently started to look at a ZFS based solution as a possible
replacement for our DCE/DFS based campus filesystem (yes, this is
still in production here). The ACL model of the combination
OpenSolaris+ZFS+in-kernel-CIFS+NFSv4 looks like a really
promising setup, something which could place it high
up on our list ...

So we had our test system installed (build 133) and were happily
manipulating ACLs from Windows and also from our standard Debian
client using the Linux nfsv4 utilities ... transparently! We
were impressed ... until an applicaton issued a chmod and destroyed
the ACL.
We then of course found Paul Henson's proposal for aclmode ignore
and deny values 
[http://mail.opensolaris.org/pipermail/zfs-discuss/2010/February/037206.html] 
and the ZFS ACL thread he started in
http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037863.html 
.


So from this site: we very much support the idea of adding ignore
and deny values for the aclmode property!

However, reading PSARC/2010/029, it looks like we will get
aclmode=discard for everybody and the property removed.
I hope this is not the end of the story ...

- Ralf

--
Ralf Utermann
_
Universität Augsburg, Institut für Physik   --   EDV-Betreuer
Universitätsstr.1
D-86135 Augsburg Phone:  +49-821-598-3231
SMTP: ralf.uterm...@physik.uni-augsburg.de Fax: -3411

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Henrik Johansson
Hello,

On Mar 6, 2010, at 6:02 PM, Andrey Kuzmin wrote:

> This is purely tactical, to avoid l2arc write penalty on eviction. You seem 
> to have missed the very next paragraph:
> 
>3644  * 2. The L2ARC attempts to cache data from the ARC before it is 
> evicted.
>3645  * It does this by periodically scanning buffers from the 
> eviction-end of
>3646  * the MFU and MRU ARC lists, copying them to the L2ARC devices if 
> they are
>3647  * not already there.  
> 
> 

My point was just that nothing is evicted from the ARC to the L2ARC, of course 
things evicted can be available in the L2ARC, but its not pushed there when 
evicted. I commented on the question "is not L2ARC is used to absorb the 
evicted data from ARC ?" Then no, the L2ARC absorbs non-evicted data from the 
ARC, that possibly gets evicted later. But it's just semantics.

Regards

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Andrey Kuzmin
This is purely tactical, to avoid l2arc write penalty on eviction. You seem
to have missed the very next paragraph:

   3644  * 2. The L2ARC attempts to cache data from the ARC before it
is evicted.   3645  * It does this by periodically scanning buffers
from the eviction-end of   3646  * the MFU and MRU ARC lists, copying
them to the L2ARC devices if they are   3647  * not already there.



Regards,
Andrey



On Sat, Mar 6, 2010 at 3:58 PM, Henrik Johansson  wrote:

> Hello,
>
> On Mar 5, 2010, at 10:46 AM, Abdullah Al-Dahlawi wrote:
>
> Greeting All
>
> I have create a pool that consists oh a hard disk and a ssd as a cache
>
> zpool create hdd c11t0d0p3
> zpool add hdd cache c8t0d0p0 - cache device
>
> I ran an OLTP bench mark to emulate a DMBS
>
> One I ran the benchmark, the pool started create the database file on the
> ssd cache device ???
>
>
> can any one explain why this happening ?
>
> is not L2ARC is used to absorb the evicted data from ARC ?
>
>
> No, it is not. if we look in the source there is a very good description of
> the L2ARC behavior:
>
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c
>
> "1. There is no eviction path from the ARC to the L2ARC.  Evictions
> from the ARC behave as usual, freeing buffers and placing headers on
> ghost lists.  The ARC does not send buffers to the L2ARC during eviction
> as this would add inflated write latencies for all ARC memory pressure."
>
> Regards
>
>  Henrik
> http://sparcv9.blogspot.com
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Henrik Johansson
Hello,

On Mar 5, 2010, at 10:46 AM, Abdullah Al-Dahlawi wrote:

> Greeting All
> 
> I have create a pool that consists oh a hard disk and a ssd as a cache
> 
> zpool create hdd c11t0d0p3
> zpool add hdd cache c8t0d0p0 - cache device
> 
> I ran an OLTP bench mark to emulate a DMBS
> 
> One I ran the benchmark, the pool started create the database file on the ssd 
> cache device ???
> 
> 
> can any one explain why this happening ?
> 
> is not L2ARC is used to absorb the evicted data from ARC ?

No, it is not. if we look in the source there is a very good description of the 
L2ARC behavior:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c

"1. There is no eviction path from the ARC to the L2ARC.  Evictions from the 
ARC behave as usual, freeing buffers and placing headers on ghost lists.  The 
ARC does not send buffers to the L2ARC during eviction as this would add 
inflated write latencies for all ARC memory pressure."

Regards

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Fajar A. Nugraha
On Sat, Mar 6, 2010 at 3:15 PM, Abdullah Al-Dahlawi  wrote:
> abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd
>    capacity operations    bandwidth
> pool used  avail   read  write   read  write
> --  -  -  -  -  -  -
> hdd 1.96G  17.7G 10 64  1.27M  7.76M
>   c7t0d0p3  1.96G  17.7G 10 64  1.27M  7.76M

you only have 17.7GB free space there, not 50GB that you said earlier.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware for high-end ZFS NAS file server - 2010 March edition

2010-03-06 Thread Brandon High
2010/3/4 Michael Shadle :
> Typically rackmounts are not designed for quiet. He said quietness is
> #2 in his priorities...

I have a Supermicro 743 case, also 4U. The one I used is the "Super
Quiet" variant, which uses fewer & slower PWM fans. It's got 8 hot
swap bays and an additional 3x 5.25" bays which you can put an
additional hot swap bay in.

It's quiet enough to have in my home office without being a distraction.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WriteBack versus SSD-ZIL

2010-03-06 Thread Zhu Han
On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble  wrote:

> This is true.  SSDs and HDs differ little in their ability to handle raw
> throughput. However, we often still see problems in ZFS associated with
> periodic system "pauses" where ZFS effectively monopolizes the HDs to write
> out it's current buffered I/O.  People have been complaining about this for
> quite awhile.  SSDs have a huge advantage where IOPS are concerned, and
> given that the backing store HDs have to service both read and write
> requests, they're severely limited on the number of IOPs they can give to
> incoming data.
>
> You have a good point, but I'd still be curious to see what an async cache
> would do.  After all, that is effectively what the HBA cache is, and we see
> a significant improvement with it, and not just for sync write.
>
>
I might see what your mean here. Because ZFS has to aggregate some write
data during a short period (txn alive time) to avoid generating too many
random write HDD requests, the bandwidth of HDD during this time is wasted.
For write heavy streaming workload, especially those who can saturate the
HDD pool bandwidth easily, ZFS will make the performance worse than those
legacy file system, i.e. UFS or EXT3. The IOPS of the HDD is not the
limitation here. The bandwidth of the HDD is the root cause.

This is the design choice of ZFS. Reducing the length of period during txn
commit can alleviate the problem. So that the size of data needing to flush
to the disk every time could be smaller. Replace the HDD with some high-end
FC disks may solve this problem.


> I also don't know what the threshold is in ZFS for it to consider it time
> to do a async buffer flush.  Is it time based?  % of RAM based? Absolute
> amount? All of that would impact whether an SSD async cache would be useful.
>
>
IMHO, ZFS flush the data back to disk asynchronously every 5 seconds, which
is the default configuration of txn commit period.  ZFS will also flush the
data back to disk even before the 5 second period, based on the estimation
of amount of memory has been used for the current txn.  This is called as
write throttle. See below link:
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle



> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Abdullah Al-Dahlawi
hi James


here is the out put you've requested

abdul...@hp_hdx_16:~/Downloads# zpool status -v
  pool: hdd
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
hdd ONLINE   0 0 0
  c7t0d0p3  ONLINE   0 0 0
cache
  c8t0d0p0  ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c7t0d0s0  ONLINE   0 0 0

---

abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
hdd 1.96G  17.7G 10 64  1.27M  7.76M
  c7t0d0p3  1.96G  17.7G 10 64  1.27M  7.76M
cache   -  -  -  -  -  -
  c8t0d0p0  *2.87G*  12.0G  0 17103  2.19M
--  -  -  -  -  -  -

abdul...@hp_hdx_16:~/Downloads# kstat -m zfs
module: zfs instance: 0
name:   arcstatsclass:misc
c   2147483648
c_max   2147483648
c_min   268435456
crtime  34.558539423
data_size   2078015488
deleted 9816
demand_data_hits382992
demand_data_misses  20579
demand_metadata_hits74629
demand_metadata_misses  6434
evict_skip  21073
hash_chain_max  5
hash_chains 7032
hash_collisions 31409
hash_elements   36568
hash_elements_max   36568
hdr_size7827792
hits481410
l2_abort_lowmem 0
l2_cksum_bad0
l2_evict_lock_retry 0
l2_evict_reading0
l2_feeds1157
l2_free_on_write475
l2_hdr_size 0
l2_hits 0
l2_io_error 0
l2_misses   14997
l2_read_bytes   0
l2_rw_clash 0
l2_size 588342784
l2_write_bytes  3085701632
l2_writes_done  194
l2_writes_error 0
l2_writes_hdr_miss  0
l2_writes_sent  194
memory_throttle_count   0
mfu_ghost_hits  9410
mfu_hits343112
misses  33011
mru_ghost_hits  4609
mru_hits116739
mutex_miss  90
other_size  51590832
p   1320449024
prefetch_data_hits  4775
prefetch_data_misses1694
prefetch_metadata_hits  19014
prefetch_metadata_misses4304
recycle_miss484
size2137434112
snaptime1945.241664714

module: zfs instance: 0
name:   vdev_cache_statsclass:misc
crtime  34.558587713
delegations 3415
hits5578
misses  3647
snaptime1945.243484925



On Fri, Mar 5, 2010 at 9:02 PM, James Dickens  wrote:

> please post the output of zpool status -v.
>
>
> Thanks
>
> James Dickens
>
>
> On Fri, Mar 5, 2010 at 3:46 AM, Abdullah Al-Dahlawi wrote:
>
>> Greeting All
>>
>> I have create a pool that consists oh a hard disk and a ssd as a cache
>>
>> zpool create hdd c11t0d0p3
>> zpool add hdd cache c8t0d0p0 - cache device
>>
>> I ran an OLTP bench mark to emulate a DMBS
>>
>> One I ran the benchmark, the pool started create the database file on the
>> ssd cache device ???
>>
>>
>> can any one explain why this happening ?
>>
>> is not L2ARC is used to absorb the evicted data from ARC ?
>>
>> why it is used this way ???
>>
>>
>>
>>
>>
>> --
>> Abdullah Al-Dahlawi
>> George Washington University
>> Department. Of Electrical & Computer Engineering
>> 
>> Check The Fastest 500 Super Computers Worldwide
>> http://www.top500.org/list/2009/11/100
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>


-- 
Abdullah Al-Dahlawi
PhD Candidate
George Washington University
Department. Of Electrical & Computer Engineering

Check The Fastest 500 Super Computers Worldwide
http://ww