[zfs-discuss] Trick to keeping NFS file references in kernel memory for Dtrace?

2012-10-03 Thread Mark

Hey all,

So I have a couple of storage boxes (NexentaCore & Illumian) and have 
been playing with some DTrace scripts to monitor NFS usage.  Initially I 
ran into the (seemingly common) problem of basically everything showing 
up as '', and then after some searching online I found a 
workaround was to do a 'find' on the file system from the remote end and 
it would refresh the kernels knowledge of the files.  This works.. 
however it doesn't stay for good.  It seems to sometimes last a couple 
of hours (and sometimes much less) and then we are back to receiving 
's.


Has anyone else come across something similar?  Does anyone know what 
may be causing the kernel to lose the references?  There is plenty of 
memory in the main system (72gb with ARC sitting ~53gb and 11gb 'free'), 
so I don't think a OOM situation is causing it.



Otherwise does anyone have any other tips for monitoring usage?  I 
wonder how they have it all working in Fishworks gear as some of the 
analytics demos show you being able to drill down on through file 
activity in real time.



Any advice or suggestions greatly appreciated.

Cheers,
Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Schweiss, Chip
> 
> How can I determine for sure that my ZIL is my bottleneck?  If it is the
> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
> make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-03 Thread Jim Klimov

2012-10-03 16:04, Fajar A. Nugraha wrote:

On Ubuntu + zfsonlinux + root/boot on zfs, the boot script helper is
"smart" enough to try all available device nodes, so it wouldn't
matter if the dev path/id/name changed. But ONLY if there's no
zpool.cache in the initramfs.

Not sure how easy it would be to port that functionality to solaris.



Thanks, I thought of zpool.cache too, but it is only listed in
/boot/solaris/filelist.safe which ironically still exists -
though proper failsafe archives are not generated anymore.
Even returning them would be a huge step forward in - a locally
hosted self-sufficient interactive mini OS image in an archive
unpacked and booted by GRUB indepependently of Solaris's view
of the hardware is much simpler than external live media...

Unfortunately, so far I didn't see ways of fixing the boot
procedure short of hacking the binaries by compiling new ones,
i.e. I did not find any easily changeable scripted logic.
I digress, I did not yet look much further than unpacking
the boot archive file itself and inspecting the files there.
There's even no binaries in it, which I'm afraid means the
logic is in the kernel monofile... :(

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Timothy Coalson
To answer your questions more directly, zilstat is what I used to check
what the ZIL was doing:

http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

While I have added a mirrored log device, I haven't tried adding multiple
sets of mirror log devices, but I think it should work.  I believe that a
failed unmirrored log device is only a problem if the pool is ungracefully
closed before ZFS notices that the log device failed (ie, simultaneous
power failure and log device failure), so mirroring them may not be
required.

Tim

On Wed, Oct 3, 2012 at 2:54 PM, Timothy Coalson  wrote:

> I found something similar happening when writing over NFS (at
> significantly lower throughput than available on the system directly),
> specifically that effectively all data, even asynchronous writes, were
> being written to the ZIL, which I eventually traced (with help from Richard
> Elling and others on this list) at least partially to the linux NFS client
> issuing commit requests before ZFS wanted to write the asynchronous data to
> a txg.  I tried fiddling with zfs_write_limit_override to get more data
> onto normal vdevs faster, but this reduced performance (perhaps setting a
> tunable to make ZFS not throttle writes while hitting the write limit could
> fix that), and didn't cause it to go significantly easier on the ZIL
> devices.  I decided to live with the default behavior, since my main
> bottleneck is ethernet anyway, and the projected lifespan of the ZIL
> devices was fairly large due to our workload.
>
> I did find that setting logbias=throughput on a zfs filesystem caused it
> to act as though the ZIL devices weren't there, which actually reduced
> commit times under continuous streaming writes (mostly due to having more
> throughput for the same amount of data to commit, in large chunks, but the
> zilstat script also reported less writing to the ZIL blocks (which are
> allocated from normal vdevs without a ZIL device, or with
> logbias=throughput) under this condition, so perhaps there is more to the
> story), so if you have different workloads for different datasets, this
> could help (since it isn't a poolwide setting).  Obviously, small
> synchronous writes to that zfs filesystem will take a large hit from this
> setting.
>
> It would be nice if there was a feature in ZFS that could direct small
> commits to ZIL blocks on log devices, but behave like logbias=throughput
> for large commits.  It would probably need manual tuning, but it would
> treat SSD log devices more gently, and increase performance for large
> contiguous writes.
>
> If you can't configure ZFS to write less data to the ZIL, I think a RAM
> based ZIL device would be a good way to get throughput up higher (and less
> worries about flash endurance, etc).
>
> Tim
>
> On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip  wrote:
>
>> I'm in the planing stages of a rather larger ZFS system to house
>> approximately 1 PB of data.
>>
>> I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
>> the bottle neck for large bursts of data being written.I can't confirm
>> this for sure, but the when throwing enough data at my storage pool and the
>> write latency starts rising, the ZIL write speed hangs close the max
>> sustained throughput I've measured on the SSD (~200 MB/s).
>>
>> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
>> showed ~1300MB/s serial read and ~800MB/s serial write speed.
>>
>> How can I determine for sure that my ZIL is my bottleneck?  If it is the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
>> to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
>>
>> Thanks for any input,
>> -Chip
>>
>>
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Timothy Coalson
I found something similar happening when writing over NFS (at significantly
lower throughput than available on the system directly), specifically that
effectively all data, even asynchronous writes, were being written to the
ZIL, which I eventually traced (with help from Richard Elling and others on
this list) at least partially to the linux NFS client issuing commit
requests before ZFS wanted to write the asynchronous data to a txg.  I
tried fiddling with zfs_write_limit_override to get more data onto normal
vdevs faster, but this reduced performance (perhaps setting a tunable to
make ZFS not throttle writes while hitting the write limit could fix that),
and didn't cause it to go significantly easier on the ZIL devices.  I
decided to live with the default behavior, since my main bottleneck is
ethernet anyway, and the projected lifespan of the ZIL devices was fairly
large due to our workload.

I did find that setting logbias=throughput on a zfs filesystem caused it to
act as though the ZIL devices weren't there, which actually reduced commit
times under continuous streaming writes (mostly due to having more
throughput for the same amount of data to commit, in large chunks, but the
zilstat script also reported less writing to the ZIL blocks (which are
allocated from normal vdevs without a ZIL device, or with
logbias=throughput) under this condition, so perhaps there is more to the
story), so if you have different workloads for different datasets, this
could help (since it isn't a poolwide setting).  Obviously, small
synchronous writes to that zfs filesystem will take a large hit from this
setting.

It would be nice if there was a feature in ZFS that could direct small
commits to ZIL blocks on log devices, but behave like logbias=throughput
for large commits.  It would probably need manual tuning, but it would
treat SSD log devices more gently, and increase performance for large
contiguous writes.

If you can't configure ZFS to write less data to the ZIL, I think a RAM
based ZIL device would be a good way to get throughput up higher (and less
worries about flash endurance, etc).

Tim

On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip  wrote:

> I'm in the planing stages of a rather larger ZFS system to house
> approximately 1 PB of data.
>
> I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
> the bottle neck for large bursts of data being written.I can't confirm
> this for sure, but the when throwing enough data at my storage pool and the
> write latency starts rising, the ZIL write speed hangs close the max
> sustained throughput I've measured on the SSD (~200 MB/s).
>
> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
> showed ~1300MB/s serial read and ~800MB/s serial write speed.
>
> How can I determine for sure that my ZIL is my bottleneck?  If it is the
> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
> to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
>
> Thanks for any input,
> -Chip
>
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Making ZIL faster

2012-10-03 Thread Schweiss, Chip
I'm in the planing stages of a rather larger ZFS system to house
approximately 1 PB of data.

I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
the bottle neck for large bursts of data being written.I can't confirm
this for sure, but the when throwing enough data at my storage pool and the
write latency starts rising, the ZIL write speed hangs close the max
sustained throughput I've measured on the SSD (~200 MB/s).

The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and showed
~1300MB/s serial read and ~800MB/s serial write speed.

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

Thanks for any input,
-Chip
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-03 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> it doesn't work right - It turns out, iscsi
> devices (And I presume SAS devices) are not removable storage.  That
> means, if the device goes offline and comes back online again, it doesn't just
> gracefully resilver and move on without any problems, it's in a perpetual
> state of IO error, device unreadable.  

I am revisiting this issue today.  I've tried everything I can think of to 
recreate this issue, and haven't been able to do it.  I have certainly 
encountered some bad behaviors - which I'll expound upon momentarily - but they 
all seem to be addressable, fixable, logical problems, and none of them result 
in a supposedly good pool (as reported in zpool status) returning scsi IO 
errors or halting the system.  The most likely explanation right now, for the 
bad behavior I saw before, perpetual IO error even after restoring connection, 
is that I screwed something up in my iscsi config the first time.

Herein lie the new problems:

If I don't export the pool before rebooting, then either the iscsi target or 
initiator is shutdown before the filesystems are unmounted.  So the system 
spews all sorts of error messages while trying to go down, but it eventually 
succeeds.  It's somewhat important to know if it was the target or initiator 
that went down first - If it was the target, then only the local disks became 
inaccessible, but if it was the intiiator, then both the local and remote disks 
became inaccessible.  I don't know yet.

Upon reboot, the pool fails to import, so the svc:/system/filesystem/local 
service fails, and comes up in maintenance mode.  The whole world is a mess, 
you have to login at physical text console to export the pool, and reboot.  But 
it comes up cleanly the second time.

These sorts of problems seem like they should be solvable by introducing some 
service manifest dependencies...  But there's no way to make it a 
generalization for the distribution as a whole (illumos/openindiana/oracle).  
It's just something that should be solvable on a case-by-case basis.

If you are going to be an initiator only, then it makes sense for 
svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local 

If you are going to be a target only, then it makes sense for 
svc:/system/filesystem/local to be required by svc:/network/iscsi/target

If you are going to be a target & initiator, then you could get yourself into a 
deadlock situation.  Make the filesystem depend on the initiator, and make the 
initiator depend on the target, and make the target depend on the filesystem.  
Uh-oh.

But we can break that cycle easy enough in a lot of situations - If you're 
doing as I'm doing, where the only targets are raw devices (not zvols) then it 
should be ok to make the filesystem depend on the initiator, which depends on 
the target, and the target doesn't depend on anything.

If you're both a target and an initiator, but all of your targets are zvols 
that you export to other systems (you're not nesting a filesystem in a zvol of 
your own, are you?) then it's ok to let the target needs filesystem and 
filesystem needs initiator, but initiator doesn't need anything.

So in my case, I'm sharing raw disks, I'm going to try and make filessytem 
needs initiator, initiator needs target, and target doesn't need anything.

Haven't tried yet ... Hopefully google will help accelerate me figuring out how 
to do that.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Failure to zfs destroy - after interrupting zfs receive

2012-10-03 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ariel T. Glenn
> 
> I have the same issue as described by Ned in his email.  I had a zfs
> recv going that deadlocked against a zfs list; after a day of leaving
> them hung I finally had to hard reset the box (shutdown wouldn't, since
> it couldn't terminate the processes).  When it came back up, I wanted to
> zfs destroy that last snapshot but I got the dreaded

For what it's worth - that is precisely the behavior I saw.  No "zfs" or 
"zpool" commands would return, and eventually the system hung badly enough I 
had to power cycle.  And afterward, I was unable to destroy either the 
filesystem, the snapshot, or any clones.  I posted here, didn't get any 
response...  And at some point, I "zfs send" my filesystem somewhere else, 
destroy & recreate the pool, "zfs send" the filesystem back.

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-September/052412.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Failure to zfs destroy - after interrupting zfs receive

2012-10-03 Thread Ariel T. Glenn
I have the same issue as described by Ned in his email.  I had a zfs
recv going that deadlocked against a zfs list; after a day of leaving
them hung I finally had to hard reset the box (shutdown wouldn't, since
it couldn't terminate the processes).  When it came back up, I wanted to
zfs destroy that last snapshot but I got the dreaded

cannot destroy 'export/upload@partial-2012-10-01_08:00:00': snapshot is
cloned

but there are no clones:

root@ms8 # zdb -d export/upload | grep '%'
root@ms8 #

and an attempt to remove what the clone ought to be fails:

zfs destroy export/upload/%partial-2012-10-01_08:00:00
cannot open 'export/upload/%partial-2012-10-01_08:00:00': dataset does
not exist

This isn't opensolaris, it's SunOS 5.10 Generic_142901-06 from before
Oracle took it over, but that's not going to make any difference as to
the bug, I think.  Any ideas?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-03 Thread Fajar A. Nugraha
On Wed, Oct 3, 2012 at 5:43 PM, Jim Klimov  wrote:
> 2012-10-03 14:40, Ray Arachelian пишет:
>
>> On 10/03/2012 05:54 AM, Jim Klimov wrote:
>>>
>>> Hello all,
>>>
>>>It was often asked and discussed on the list about "how to
>>> change rpool HDDs from AHCI to IDE mode" and back, with the
>>> modern routine involving reconfiguration of the BIOS, bootup
>>> from separate live media, simple import and export of the
>>> rpool, and bootup from the rpool.

IIRC when working with xen I had to boot with live cd, import the
pool, then poweroff (without exporting the pool). Then it can boot.
Somewhat inline with what you described.

>> The documented way is to
>>> reinstall the OS upon HW changes. Both are inconvenient to
>>> say the least.
>>
>>
>> Any chance to touch /reconfigure, power off, then change the BIOS
>> settings and reboot, like in the old days?   Or maybe with passing -r
>> and optionally -s and -v from grub like the old way we used to
>> reconfigure Solaris?
>
>
> Tried that, does not help. Adding forceloads to /etc/system
> and remaking the boot archive - also no.

On Ubuntu + zfsonlinux + root/boot on zfs, the boot script helper is
"smart" enough to try all available device nodes, so it wouldn't
matter if the dev path/id/name changed. But ONLY if there's no
zpool.cache in the initramfs.

Not sure how easy it would be to port that functionality to solaris.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-03 Thread Jim Klimov

2012-10-03 14:40, Ray Arachelian пишет:

On 10/03/2012 05:54 AM, Jim Klimov wrote:

Hello all,

   It was often asked and discussed on the list about "how to
change rpool HDDs from AHCI to IDE mode" and back, with the
modern routine involving reconfiguration of the BIOS, bootup
from separate live media, simple import and export of the
rpool, and bootup from the rpool. The documented way is to
reinstall the OS upon HW changes. Both are inconvenient to
say the least.


Any chance to touch /reconfigure, power off, then change the BIOS
settings and reboot, like in the old days?   Or maybe with passing -r
and optionally -s and -v from grub like the old way we used to
reconfigure Solaris?


Tried that, does not help. Adding forceloads to /etc/system
and remaking the boot archive - also no.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-03 Thread Ray Arachelian
On 10/03/2012 05:54 AM, Jim Klimov wrote:
> Hello all,
>
>   It was often asked and discussed on the list about "how to
> change rpool HDDs from AHCI to IDE mode" and back, with the
> modern routine involving reconfiguration of the BIOS, bootup
> from separate live media, simple import and export of the
> rpool, and bootup from the rpool. The documented way is to
> reinstall the OS upon HW changes. Both are inconvenient to
> say the least.

Any chance to touch /reconfigure, power off, then change the BIOS
settings and reboot, like in the old days?   Or maybe with passing -r
and optionally -s and -v from grub like the old way we used to
reconfigure Solaris?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Changing rpool device paths/drivers

2012-10-03 Thread Jim Klimov

Hello all,

  It was often asked and discussed on the list about "how to
change rpool HDDs from AHCI to IDE mode" and back, with the
modern routine involving reconfiguration of the BIOS, bootup
from separate live media, simple import and export of the
rpool, and bootup from the rpool. The documented way is to
reinstall the OS upon HW changes. Both are inconvenient to
say the least.

  Linux and recent Windows are much more careless about
total changes of hardware underneath the OS image between
boots, they just boot up and work. Why do we shoot ourselves
in the foot with this boot-up problem?

  Now that I'm trying to dual-boot my OI-based system, I hit
the problem hard: I have either a HW SATA (AMD Hudson, often
not recognized upon bootup, but that's another story) and a
VirtualBox SATA on different pci dev/vendor IDs, or Physical
and Virtual IDE which result in the same device path to cmdk
and pci-ide - so I'm stuck with IDE mode at least for these
compatibility reasons.

  So the basic question is: WHY does the OS want to use the
device path (/pci... string) coded into the rpool's vdevs
mid-way in the bootup during vfs root-import routine, and
fail with a panic if the device naming changed, when the
loader (GRUB) for example already had no problem reading
the same rpool? Is there any rationale or historic baggage
to this situation? Is it a design error or shortsight?

  Isn't it possible to use the same routine as for other
pool imports, including import of this same rpool from a
live-media boot - just find the component devices (starting
with the one passed by the loader and/or matching by pool
name and/or GUID) and import the resulting pool? Perhaps,
this could be attempted if the current method fails, before
reverting to a kernel panic - try another method first.

  Would this be a sane thing to change, or are there known
beasts lurking in the dark?

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss