Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Tim Cook
On Sun, Jun 12, 2011 at 5:28 PM, Nico Williams wrote:

> On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
>  wrote:
> > I have an interesting question that may or may not be answerable from
> some
> > internal
> > ZFS semantics.
>
> This is really standard Unix filesystem semantics.
>
> > [...]
> >
> > So total storage used is around ~7.5MB due to the hard linking taking
> place
> > on each store.
> >
> > If hard linking capability had been turned off, this same message would
> have
> > used 1500 x 2MB =3GB
> > worth of storage.
> >
> > My question is there any simple ways of determining the space savings on
> > each of the stores from the usage of hard links?  [...]
>
> But... you just did!  :)  It's: number of hard links * (file size +
> sum(size of link names and/or directory slot size)).  For sufficiently
> large files (say, larger than one disk block) you could approximate
> that as: number of hard links * file size.  The key is the number of
> hard links, which will typically vary, but for e-mails that go to all
> users, well, you know the number of links then is the number of users.
>
> You could write a script to do this -- just look at the size and
> hard-link count of every file in the store, apply the above formula,
> add up the inflated sizes, and you're done.
>
> Nico
>
> PS: Is it really the case that Exchange still doesn't deduplicate
> e-mails?  Really?  It's much simpler to implement dedup in a mail
> store than in a filesystem...
>


MS has had SIS since Exchange 4.0.  They dumped it in 2010 because it was a
huge source of their small random I/O's.  In an effort to allow Exchange to
be more "storage friendly" (IE: more of a large sequential I/O profile),
they've done away with SIS.  The defense for it is that you can buy more
"cheap" storage for less money than you'd save with SIS and 15k rpm disks.
 Whether that's factual I suppose is for the reader to decide.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Edmund White
On 6/12/11 7:25 PM, "Richard Elling"  wrote:


>>> 
>> 
>> Here's the timeline:
>> 
>> - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not
>> detected by NexentaStor.
>
>Is the volume-check runner enabled? All of the check runner results are
>logged in
>the report database and sent to the system administrator via email. I
>will assume
>that you have configured email for delivery, as it is a required step in
>the installation
>procedure.
>
>In any case, a disk declared FAULTED is no longer used by ZFS, except
>when a
>pool is cleared. The volume-check runner can do this on your behalf, if
>it is 
>configured to do so. See Data Management -> Runners -> volume-check
>And, of course, these actions are recorded in the logs and report
>database.
>
>-- richard


I checked seven of my NexentaStor installations (3.0.4 and 3.0.5). Six of
them had the disk-check fault trigger disabled by default. volume-check is
enabled on all and is set to run hourly. Email notification is configured,
and I actively receive other alerts (DDT table, auto-sync) and reports.

-- 
Edmund White
ewwh...@mac.com
847-530-1605


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Richard Elling
On Jun 12, 2011, at 5:04 PM, Edmund White wrote:
> On 6/12/11 6:18 PM, "Jim Klimov"  wrote:
>> 2011-06-12 23:57, Richard Elling wrote:
>>> 
>>> How long should it wait? Before you answer, read through the thread:
>>> http://lists.illumos.org/pipermail/developer/2011-April/001996.html
>>> Then add your comments :-)
>>>  -- richard
>> 
>> But the point of my previous comment was that, according
>> to the original poster, after a while his disk did get
>> marked as "faulted" or "offlined". IF this happened
>> during the system's initial uptime, but it froze anyway,
>> it it a problem.
>> 
>> What I do not know is if he rebooted the box within the
>> 5 minutes set aside for the timeout, or if some other
>> processes gave up during the 5 minutes of no IO and
>> effectively hung the system.
>> 
>> If it is somehow the latter - that the inaccessible drive
>> did (lead to) hang(ing) the system past any set IO retry
>> timeouts - that is a bug, I think.
>> 
> 
> Here's the timeline:
> 
> - The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not
> detected by NexentaStor.

Is the volume-check runner enabled? All of the check runner results are logged 
in
the report database and sent to the system administrator via email. I will 
assume
that you have configured email for delivery, as it is a required step in the 
installation
procedure.

In any case, a disk declared FAULTED is no longer used by ZFS, except when a
pool is cleared. The volume-check runner can do this on your behalf, if it is 
configured to do so. See Data Management -> Runners -> volume-check
And, of course, these actions are recorded in the logs and report database.

> - The storage system performance diminished at 9am the next morning.
> Intermittent spikes in system load (of the VMs hosted on the unit).

This is consistent with reset storms.

> - By 11am, the Nexenta interface and console were unresponsive and the
> virtual machines dependent on the underlying storage stalled completely.

Also consistent with reset storms.

> - At 12pm, I gained physical access to the server, but I could not acquire
> console access (shell or otherwise). I did see the FMA error output on the
> screen indicating the actual device FAULT time.
> - I powered the system off, removed the Intel X-25M, and powered back on.
> The VMs picked up where they left off and the system stabilized.
> 
> The total impact to end-users was 3 hours of either poor performance or
> straight downtime. 

Yes, this is consistent with reset storms. Older Intel SSDs are not the only 
devices that
handle this poorly. In my experience a number of SATA devices are poorly 
designed :-(
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 11:36 AM, Jim Klimov wrote:

Some time ago I wrote a script to find any "duplicate" files and replace
them with hardlinks to one inode. Apparently this is only good for same
files which don't change separately in future, such as distro archives.

I can send it to you offlist, but it would be slow in your case 
because it

is not quite the tool for the job (it will start by calculating checksums
of all of your files ;) )

What you might want to do and script up yourself is a recursive listing
"find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you
the inode numbers and file sizes and link counts. Pipe it through
something like this:

find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq

And you'd get 3 columns - inode, count, size

My AWK math is a bit rusty today, so I present a monster-script like
this to multiply and sum up the values:

( find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq | awk '{ 
print $2"*"$3"+\\" }'; echo 0 ) | bc
This looks something like what I thought would have to be done, I was 
just looking
to see if there was something tried and tested before I had to invent 
something. I was really hoping
in zdb there might have been some magic information I could have tapped 
into.. ;)


Can be done cleaner, i.e. in a PERL one-liner, and if you have
many values - that would probably complete faster too. But as
a prototype this would do.

HTH,
//Jim

PS: Why are you replacing the cool Sun Mail? Is it about Oracle
licensing and the now-required purchase and support cost?
Yes it is about cost mostly. We had Sun Mail for our Staff and students. 
We had
20,000 + students on it up until Christmas time as well. We have now 
migrated them
to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use 
LookOut. The Sun
connector for LookOut is a bit flaky at best. But the Oracle licensing 
cost for Messaging
and Calendar starts at 10,000 users plus and so is now rather expensive 
for what mailboxes
we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle 
is not very friendly
the way Sun was with their JES licensing. So it is bye bye Sun Messaging 
Server for us.



2011-06-13 1:14, Scott Lawson пишет:

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are 
stored in standard SMTP
MIME format so the binary attachments are included in the message 
ASCII. Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message 
would have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings 
on each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
to M$ Exchange 2010 which doesn't do message single instancing. I 
need to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that 
could help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Edmund White
On 6/12/11 6:18 PM, "Jim Klimov"  wrote:


>2011-06-12 23:57, Richard Elling wrote:
>>
>> How long should it wait? Before you answer, read through the thread:
>>  http://lists.illumos.org/pipermail/developer/2011-April/001996.html
>> Then add your comments :-)
>>   -- richard
>
>But the point of my previous comment was that, according
>to the original poster, after a while his disk did get
>marked as "faulted" or "offlined". IF this happened
>during the system's initial uptime, but it froze anyway,
>it it a problem.
>
>What I do not know is if he rebooted the box within the
>5 minutes set aside for the timeout, or if some other
>processes gave up during the 5 minutes of no IO and
>effectively hung the system.
>
>If it is somehow the latter - that the inaccessible drive
>did (lead to) hang(ing) the system past any set IO retry
>timeouts - that is a bug, I think.
>

Here's the timeline:

- The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not
detected by NexentaStor.
- The storage system performance diminished at 9am the next morning.
Intermittent spikes in system load (of the VMs hosted on the unit).
- By 11am, the Nexenta interface and console were unresponsive and the
virtual machines dependent on the underlying storage stalled completely.
- At 12pm, I gained physical access to the server, but I could not acquire
console access (shell or otherwise). I did see the FMA error output on the
screen indicating the actual device FAULT time.
- I powered the system off, removed the Intel X-25M, and powered back on.
The VMs picked up where they left off and the system stabilized.

The total impact to end-users was 3 hours of either poor performance or
straight downtime. 

-- 
Edmund White
ewwh...@mac.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 10:28 AM, Nico Williams wrote:

On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
  wrote:
   

I have an interesting question that may or may not be answerable from some
internal
ZFS semantics.
 

This is really standard Unix filesystem semantics.
   
I Understand this, just wanting to see if here is any easy way before I 
trawl

through 10 million little files.. ;)
   

[...]

So total storage used is around ~7.5MB due to the hard linking taking place
on each store.

If hard linking capability had been turned off, this same message would have
used 1500 x 2MB =3GB
worth of storage.

My question is there any simple ways of determining the space savings on
each of the stores from the usage of hard links?  [...]
 

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.
   

Yes this number varies based on number of recipients, so could be as many a

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.
   
Looks like I will have to, just looking for a tried and tested method 
before I have to create my own
one if possible. Just was looking for an easy option before I have to 
sit down and
develop and test a script. I have resigned from my current job of 9 
years and finish in 15 days and have
a heck of a lot of documentation and knowledge transfer I need to do 
around other UNIX systems

and am running very short on time...

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
   
As a side not Exchange 2002 + Exchange 2007 do do this. But apparently 
M$ decided in Exchange
2010 that they no longer wished to do this and dropped the capability. 
Bizarre to say the least,
but it may come down to what they have done in the underlying store 
technology changes..


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Jim Klimov

2011-06-13 2:28, Nico Williams пишет:

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...


That's especially strange, because NTFS has hardlinks and softlinks...
Not that Microsoft provided any tools for using that, but there are
third-party programs like CygWin "ls" and the FAR File Manger.

Well, enought off-topicing ;)
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Jim Klimov

Some time ago I wrote a script to find any "duplicate" files and replace
them with hardlinks to one inode. Apparently this is only good for same
files which don't change separately in future, such as distro archives.

I can send it to you offlist, but it would be slow in your case because it
is not quite the tool for the job (it will start by calculating checksums
of all of your files ;) )

What you might want to do and script up yourself is a recursive listing
"find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you
the inode numbers and file sizes and link counts. Pipe it through
something like this:

find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq

And you'd get 3 columns - inode, count, size

My AWK math is a bit rusty today, so I present a monster-script like
this to multiply and sum up the values:

( find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq | awk '{ print 
$2"*"$3"+\\" }'; echo 0 ) | bc


Can be done cleaner, i.e. in a PERL one-liner, and if you have
many values - that would probably complete faster too. But as
a prototype this would do.

HTH,
//Jim

PS: Why are you replacing the cool Sun Mail? Is it about Oracle
licensing and the now-required purchase and support cost?


2011-06-13 1:14, Scott Lawson пишет:

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are 
stored in standard SMTP
MIME format so the binary attachments are included in the message 
ASCII. Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message 
would have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings 
on each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
to M$ Exchange 2010 which doesn't do message single instancing. I need 
to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that 
could help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО "ЦОС и ВТ"  JSC COS&HT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|  CC:ad...@cos.ru,jimkli...@mail.ru |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Richard Elling
On Jun 12, 2011, at 4:18 PM, Jim Klimov wrote:

> 2011-06-12 23:57, Richard Elling wrote:
>> 
>> How long should it wait? Before you answer, read through the thread:
>>  http://lists.illumos.org/pipermail/developer/2011-April/001996.html
>> Then add your comments :-)
>>  -- richard
> 
> Interesting thread. I did not quite get the resentment against
> a tunable value instead of a hard-coded #define, though.

Tunables are evil. They increase complexity and lead to local optimizations
that interfere with systemic optimizations.

> Especially if we might want to somehow tune it per-device,
> i.e. CDROM, enterprise SAS and some commodity drive or a
> USB stick (or a VMWare emulated HDD, as Ceri pointed out)
> might all be plugged into the same box and require different
> timeouts only the sysadmin might know about (the numeric
> values per-device). So I'd rather go with some hardcoded
> default and many tuned lines in sd.conf, probably.

yuck. I'd rather have my eye poked out with a sharp stick.

> But the point of my previous comment was that, according
> to the original poster, after a while his disk did get
> marked as "faulted" or "offlined". IF this happened
> during the system's initial uptime, but it froze anyway,
> it it a problem.
> 
> What I do not know is if he rebooted the box within the
> 5 minutes set aside for the timeout, or if some other
> processes gave up during the 5 minutes of no IO and
> effectively hung the system.

Not likely. Much more likely that that which you were expecting
was blocked.

> If it is somehow the latter - that the inaccessible drive
> did (lead to) hang(ing) the system past any set IO retry
> timeouts - that is a bug, I think.
> 
> But maybe I'm just too annoyed with my box hanging with
> a more-or-less reproducible scenario, and now I'm barking
> up any tree that looks like system freeze related to IO ;)

Yep, a common reaction.

I think we can be more creative...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Jim Klimov

2011-06-12 23:57, Richard Elling wrote:


How long should it wait? Before you answer, read through the thread:
http://lists.illumos.org/pipermail/developer/2011-April/001996.html
Then add your comments :-)
  -- richard


Interesting thread. I did not quite get the resentment against
a tunable value instead of a hard-coded #define, though.

Especially if we might want to somehow tune it per-device,
i.e. CDROM, enterprise SAS and some commodity drive or a
USB stick (or a VMWare emulated HDD, as Ceri pointed out)
might all be plugged into the same box and require different
timeouts only the sysadmin might know about (the numeric
values per-device). So I'd rather go with some hardcoded
default and many tuned lines in sd.conf, probably.

But the point of my previous comment was that, according
to the original poster, after a while his disk did get
marked as "faulted" or "offlined". IF this happened
during the system's initial uptime, but it froze anyway,
it it a problem.

What I do not know is if he rebooted the box within the
5 minutes set aside for the timeout, or if some other
processes gave up during the 5 minutes of no IO and
effectively hung the system.

If it is somehow the latter - that the inaccessible drive
did (lead to) hang(ing) the system past any set IO retry
timeouts - that is a bug, I think.

But maybe I'm just too annoyed with my box hanging with
a more-or-less reproducible scenario, and now I'm barking
up any tree that looks like system freeze related to IO ;)

//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Nico Williams
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
 wrote:
> I have an interesting question that may or may not be answerable from some
> internal
> ZFS semantics.

This is really standard Unix filesystem semantics.

> [...]
>
> So total storage used is around ~7.5MB due to the hard linking taking place
> on each store.
>
> If hard linking capability had been turned off, this same message would have
> used 1500 x 2MB =3GB
> worth of storage.
>
> My question is there any simple ways of determining the space savings on
> each of the stores from the usage of hard links?  [...]

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are stored 
in standard SMTP
MIME format so the binary attachments are included in the message ASCII. 
Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
 is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message would 
have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings on 
each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
 to M$ Exchange 2010 which doesn't do message single instancing. I need 
to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that could 
help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Richard Elling
On Jun 11, 2011, at 9:26 AM, Jim Klimov wrote:

> 2011-06-11 19:15, Pasi Kärkkäinen пишет:
>> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote:
>>>I've had two incidents where performance tanked suddenly, leaving the VM
>>>guests and Nexenta SSH/Web consoles inaccessible and requiring a full
>>>reboot of the array to restore functionality. In both cases, it was the
>>>Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed 
>>> to
>>>alert me on the cache failure, however the general ZFS FMA alert was
>>>visible on the (unresponsive) console screen.
>>> 
>>>The "zpool status" output showed:
>>> 
>>>  cache
>>>  c6t5001517959467B45d0 FAULTED  2   542 0  too many errors
>>> 
>>>This did not trigger any alerts from within Nexenta.
>>> 
>>>I was under the impression that an L2ARC failure would not impact the
>>>system. But in this case, it was the culprit. I've never seen any
>>>recommendations to RAID L2ARC for resiliency. Removing the bad SSD
>>>entirely from the server got me back running, but I'm concerned about the
>>>impact of the device failure and the lack of notification from
>>>NexentaStor.
>> IIRC recently there was discussion on this list about firmware bug
>> on the Intel X25 SSDs causing them to fail under high disk IO with "reset 
>> storms".
> Even if so, this does not forgive ZFS hanging - especially
> if it detected the drive failure, and especially if this drive
> is not required for redundant operation.

How long should it wait? Before you answer, read through the thread:
http://lists.illumos.org/pipermail/developer/2011-April/001996.html
Then add your comments :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Richard Elling
On Jun 11, 2011, at 6:35 AM, Edmund White wrote:

> Posted in greater detail at Server Fault - 
> http://serverfault.com/q/277966/13325
> 
Replied in greater detail at same.

> I have an HP ProLiant DL380 G7 system running NexentaStor. The server has 
> 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system 
> drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and 
> a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare 
> hosts. I also have about 90-100GB of deduplicated data on the array.
> 
> I've had two incidents where performance tanked suddenly, leaving the VM 
> guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot 
> of the array to restore functionality.
> 
The reboot is your decision, the software will, eventually, recover.

> In both cases, it was the Intel X-25M L2ARC SSD that failed or was 
> "offlined". NexentaStor failed to alert me on the cache failure, however the 
> general ZFS FMA alert was visible on the (unresponsive) console screen.
> 
> 

NexentaStor fault triggers run in addition to the existing FMA and syslog 
services.

> The "zpool status" output showed:
> 
> 
> cache
> c6t5001517959467B45d0 FAULTED  2   542 0  too many errors
> 
> This did not trigger any alerts from within Nexenta.
> 
> 

The NexentaStor volume-check runner looks for zpool status error messages. 
Check your configuration
for the runner schedule, by default it is hourly.


> I was under the impression that an L2ARC failure would not impact the system.
> 
With all due respect, that is a naive assumption. Any system failure can impact 
the system. The
worst kinds of failures are those that impact performance. In this case, the 
broken SSD firmware
causes very slow response to I/O requests. It does not return an error code 
that says "I'm broken" 
it just responds very slowly, perhaps after other parts of the system ask it to 
reset and retry a few
times.

> But in this case, it was the culprit. I've never seen any recommendations to 
> RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got 
> me back running, but I'm concerned about the impact of the device failure and 
> the lack of notification from NexentaStor.
> 
> 

We have made some improvements in notification for this type of failure in the 
3.1 release. Why?
Because we have seen a large number of these errors from various disk and SSD 
manufacturers
recently. You will notice that Nexenta does not support these SSDs behind SAS 
expanders for this
very reason. At the end of the day, resolution is to get the device fixed or 
replaced. Contact your hardware
provider for details.

> What's the current best-choice SSD for L2ARC cache applications these days? 
> It seems as though the Intel units are no longer well-regarded. 
> 
> 

No device is perfect. Some have better firmware, components, or design than 
others. YMMV.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-12 Thread Richard Elling
On Jun 11, 2011, at 5:46 AM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Jim Klimov
>> 
>> See FEC suggestion from another poster ;)
> 
> Well, of course, all storage mediums have built-in hardware FEC.  At least 
> disk & tape for sure.  But naturally you can't always trust it blindly...
> 
> If you simply want to layer on some more FEC, there must be some standard 
> generic FEC utilities out there, right?
>   zfs send | fec > /dev/...
> Of course this will inflate the size of the data stream somewhat, but 
> improves the reliability...

The problem is that many FEC algorithms are good at correcting a few bits. For 
example, disk 
drives tend to correct somewhere on the order of 8 bytes per block. Tapes can 
correct more bytes
per block. I've collected a large number of error reports showing the bitwise 
analysis of data
corruption we've seen in ZFS and there is only one case where a stuck bit was 
detected. Most of
the corruptions I see are multiple bytes and many are zero-filled.

In other words, if you are expecting to use FEC and FEC only corrects a few 
bits, you might be
disappointed.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-06-12 Thread Richard Elling
On May 10, 2011, at 9:18 AM, Ray Van Dolson wrote:

> We recently had a disk fail on one of our whitebox (SuperMicro) ZFS
> arrays (Solaris 10 U9).
> 
> The disk began throwing errors like this:
> 
> May  5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: 
> /pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):
> May  5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, 
> IOCLogInfo=0x31110610

These are commonly seen when hardware is having difficulty and devices are
being reset.

> 
> And errors for the drive were incrementing in iostat -En output.
> Nothing was seen in fmdump.

That is unusual because the ereports are sent along with the code that 
increments
error counters in sd. Are you sure you ran "fmdump -e" as root or with 
appropriate 
privileges?

> 
> Unfortunately, it took about three hours for ZFS (or maybe it was MPT)
> to decide the drive was actually dead:
> 
> May  5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: 
> /scsi_vhci/disk@g5000c5002cbc76c0 (sd4):
> May  5 07:41:06 dev-zfs4drive offline
> 
> During this three hours the I/O performance on this server was pretty
> bad and caused issues for us.  Once the drive "failed" completely, ZFS
> pulled in a spare and all was well.
> 
> My question is -- is there a way to tune the MPT driver or even ZFS
> itself to be more/less aggressive on what it sees as a "failure"
> scenario?

mpt driver is closed source. Contact the source author for such details.

mpt_sas is open source, but the decision to retire for Solaris-derived OSes
is done via the Fault Management Architecture (FMA) agents. Many of 
these have tunable algorithms, but AFAIK they are only documented in 
source.

That said, there are failure modes that do not fit the current algorithms very
well. Feel free to propose alternatives.

> 
> I suppose this would have been handled differently / better if we'd
> been using real Sun hardware?

Maybe, maybe not. These are generic conditions and can be seen on all
sorts of hardware under a wide variety of failure conditions.
 -- richard

> 
> Our other option is to watch better for log entries similar to the
> above and either alert someone or take some sort of automated action
> .. I'm hoping there's a better way to tune this via driver or ZFS
> settings however.
> 
> Thanks,
> Ray
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import crashs SX11 trying to recovering a corrupted zpool

2011-06-12 Thread Jim Klimov
Did you try a read-only import as well? I THINK it goes like this:

zpool import -o ro -o cachefile=none -F -f badpool

Did you manage to capture any error output? For example, is it an option for 
you to set up a serial console and copy-paste the error text from the serial 
terminal on another machine?

As far as I know there are no other releases of Solaris 11 yet, and since the 
code is now developed behind closed doors (due to be opened after Solaris 11 OS 
public GA release), no other implementations of ZFS match its v31 features. 

In part (another part being licensing) this is why the open community at large 
shunned the S11X release - its ZFS version not yet interoperable (unlike the 
options with OpenIndiana on different kernels and FreeBSD all supporting zpool 
v28, not so sure Linux-FUSE - but maybe it is close too), and you can't look at 
the source code to see what might go wrong, or build a debug version which 
would not kernel-panic.

Speaking of which, you might try to enforce dumping the kernel memory to a 
dedicated volume/slice/partition so as to stacktrace it with the kernel 
debugger. Maybe that would yield something...

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

2011-06-12 Thread Pasi Kärkkäinen
On Sat, Jun 11, 2011 at 08:26:34PM +0400, Jim Klimov wrote:
> 2011-06-11 19:15, Pasi Kärkkäinen ??:
>> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote:
>>> I've had two incidents where performance tanked suddenly, leaving the VM
>>> guests and Nexenta SSH/Web consoles inaccessible and requiring a full
>>> reboot of the array to restore functionality. In both cases, it was the
>>> Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed 
>>> to
>>> alert me on the cache failure, however the general ZFS FMA alert was
>>> visible on the (unresponsive) console screen.
>>>
>>> The "zpool status" output showed:
>>>
>>>   cache
>>>   c6t5001517959467B45d0 FAULTED  2   542 0  too many errors
>>>
>>> This did not trigger any alerts from within Nexenta.
>>>
>>> I was under the impression that an L2ARC failure would not impact the
>>> system. But in this case, it was the culprit. I've never seen any
>>> recommendations to RAID L2ARC for resiliency. Removing the bad SSD
>>> entirely from the server got me back running, but I'm concerned about 
>>> the
>>> impact of the device failure and the lack of notification from
>>> NexentaStor.
>> IIRC recently there was discussion on this list about firmware bug
>> on the Intel X25 SSDs causing them to fail under high disk IO with "reset 
>> storms".
> Even if so, this does not forgive ZFS hanging - especially
> if it detected the drive failure, and especially if this drive
> is not required for redundant operation.
>
> I've seen similar bad behaviour on my oi_148a box when
> I tested USB flash devices as L2ARC caches and
> occasionally they died by slightly moving out of the
> USB socket due to vibration or whatever reason ;)
>
> Similarly, this oi_148a box hung upon loss of SATA
> connection to a drive in the raidz2 disk set due to
> unreliable cable connectors, while it should have
> stalled IOs to that pool but otherwise the system
> should have remained remain responsive (tested
> failmode=continue and failmode=wait on different
> occasions).
>
> So I can relate - these things happen, they do annoy,
> and I hope they will be fixed sometime soon so that
> ZFS matches its docs and promises ;)
>

True, definitely sounds like a bug in ZFS aswell..

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q: pool didn't expand. why? can I force it?

2011-06-12 Thread Johan Eliasson
Indeed it was!

Thanks!!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q: pool didn't expand. why? can I force it?

2011-06-12 Thread Tim Cook
On Sun, Jun 12, 2011 at 3:54 AM, Johan Eliasson <
johan.eliasson.j...@gmail.com> wrote:

> I replaced a smaller disk in my tank2, so now they're all 2TB. But look,
> zfs still thinks it's a pool of 1.5 TB disks:
>
> nebol@filez:~# zpool list tank2
> NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
> tank2  5.44T  4.20T  1.24T77%  1.00x  ONLINE  -
>
> nebol@filez:~# zpool status tank2
>  pool: tank2
>  state: ONLINE
>  scrub: none requested
> config:
>
>NAMESTATE READ WRITE CKSUM
>tank2   ONLINE   0 0 0
>  raidz1-0  ONLINE   0 0 0
>c8t0d0  ONLINE   0 0 0
>c8t1d0  ONLINE   0 0 0
>c8t2d0  ONLINE   0 0 0
>c8t3d0  ONLINE   0 0 0
>
> errors: No known data errors
>
> and:
>
>   6. c8t0d0 
>  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@0,0
>   7. c8t1d0 
>  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@1,0
>   8. c8t2d0 
>  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@2,0
>   9. c8t3d0 
>
> So the question is, why didn't it expand? And can I fix it?
>
>
Autoexpand is likely turned off.
http://download.oracle.com/docs/cd/E19253-01/819-5461/githb/index.html

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Q: pool didn't expand. why? can I force it?

2011-06-12 Thread Johan Eliasson
I replaced a smaller disk in my tank2, so now they're all 2TB. But look, zfs 
still thinks it's a pool of 1.5 TB disks:

nebol@filez:~# zpool list tank2
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
tank2  5.44T  4.20T  1.24T77%  1.00x  ONLINE  -

nebol@filez:~# zpool status tank2
  pool: tank2
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank2   ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c8t0d0  ONLINE   0 0 0
c8t1d0  ONLINE   0 0 0
c8t2d0  ONLINE   0 0 0
c8t3d0  ONLINE   0 0 0

errors: No known data errors

and:

   6. c8t0d0 
  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@0,0
   7. c8t1d0 
  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@1,0
   8. c8t2d0 
  /pci@0,0/pci8086,29f1@1/pci8086,32c@0/pci11ab,11ab@1/disk@2,0
   9. c8t3d0 

So the question is, why didn't it expand? And can I fix it?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss