Re: [zfs-discuss] Periodic flush

2008-04-15 Thread Mark Maybee
ZFS has always done a certain amount of write throttling.  In the past
(or the present, for those of you running S10 or pre build 87 bits) this
throttling was controlled by a timer and the size of the ARC: we would
cut a transaction group every 5 seconds based off of our timer, and
we would also cut a transaction group if we had more than 1/4 of the
ARC size worth of dirty data in the transaction group.  So, for example,
if you have a machine with 16GB of physical memory it wouldn't be
unusual to see an ARC size of around 12GB.  This means we would allow
up to 3GB of dirty data into a single transaction group (if the writes
complete in less than 5 seconds).  Now we can have up to three
transaction groups in progress at any time: open context, quiesce
context, and sync context.  As a final wrinkle, we also don't allow more
than 1/2 the ARC to be composed of dirty write data.  All taken
together, this means that there can be up to 6GB of writes in the pipe
(using the 12GB ARC example from above).

Problems with this design start to show up when the write-to-disk
bandwidth can't keep up with the application: if the application is
writing at a rate of, say, 1GB/sec, it will fill the pipe within
6 seconds.  But if the IO bandwidth to disk is only 512MB/sec, its
going to take 12sec to get this data onto the disk.  This impedance
mis-match is going to manifest as pauses:  the application fills
the pipe, then waits for the pipe to empty, then starts writing again.
Note that this won't be smooth, since we need to complete an entire
sync phase before allowing things to progress.  So you can end up
with IO gaps.  This is probably what the original submitter is
experiencing.  Note there are a few other subtleties here that I
have glossed over, but the general picture is accurate.

The new write throttle code put back into build 87 attempts to
smooth out the process.  We now measure the amount of time it takes
to sync each transaction group, and the amount of data in that group.
We dynamically resize our write throttle to try to keep the sync
time constant (at 5secs) under write load.  We also introduce
fairness delays on writers when we near pipeline capacity: each
write is delayed 1/100sec when we are about to fill up.  This
prevents a single heavy writer from starving out occasional
writers.  So instead of coming to an abrupt halt when the pipeline
fills, we slow down our write pace.  The result should be a constant
even IO load.

There is one down side to this new model: if a write load is very
bursty, e.g., a large 5GB write followed by 30secs of idle, the
new code may be less efficient than the old.  In the old code, all
of this IO would be let in at memory speed and then more slowly make
its way out to disk.  In the new code, the writes may be slowed down.
The data makes its way to the disk in the same amount of time, but
the application takes longer.  Conceptually: we are sizing the write
buffer to the pool bandwidth, rather than to the memory size.

Robert Milkowski wrote:
 Hello eric,
 
 Thursday, March 27, 2008, 9:36:42 PM, you wrote:
 
 ek On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote:
 On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:
 This causes the sync to happen much faster, but as you say,  
 suboptimal.
 Haven't had the time to go through the bug report, but probably
 CR 6429205 each zpool needs to monitor its throughput
 and throttle heavy writers
 will help.
 I hope that this feature is implemented soon, and works well. :-)
 
 ek Actually, this has gone back into snv_87 (and no we don't know which  
 ek s10uX it will go into yet).
 
 
 Could you share more details how it works right now after change?
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID-Z resilver broken

2007-04-11 Thread Mark Maybee

ugh, thanks for exploring this and isolating the problem.  We will look
into what is going on (wrong) here.  I have filed bug:

6545015 RAID-Z resilver broken

to track this problem.

-Mark

Marco van Lienen wrote:

On Sat, Apr 07, 2007 at 05:05:18PM -0500, in a galaxy far far away, Chris 
Csanady said:

In a recent message, I detailed the excessive checksum errors that
occurred after replacing a disk.  It seems that after a resilver
completes, it leaves a large number of blocks in the pool which fail
to checksum properly.  Afterward, it is necessary to scrub the pool in
order to correct these errors.

After some testing, it seems that this only occurs with RAID-Z.  The
same behavior can be observed on both snv_59 and snv_60, though I do
not have any other installs to test at the moment.


A colleague at work and I have followed the same steps, included running a 
digest on the /test/file, on a SXCE:61 build today and can confirm the exact 
same,
and disturbing?, result.
My colleague mentioned to me he has witnessed the same 'resilver' behavior on
builds 57 and 60.

The box which these steps were performed on was 'luupgraded' from SXCE: 60 to 
61 using the SUNWlu* packages from
61!

# cat /etc/release
Solaris Nevada snv_61 X86
   Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
 Assembled 26 March 2007

# mkdir /tmp/test
# mkfile 64m /tmp/test/0 /tmp/test/1
# zpool create test raidz /tmp/test/0 /tmp/test/1
# mkfile 16m /test/file
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
# 
# zpool export test

# rm /tmp/test/0
# zpool import -d /tmp/test test
# mkfile 64m /tmp/test/0
# zpool replace test /tmp/test/0
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
# zpool status test
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Wed Apr 11 15:19:15 2007
config:

NAME STATE READ WRITE CKSUM
test ONLINE   0 0 0
  raidz1 ONLINE   0 0 0
/tmp/test/0  ONLINE   0 0 0
/tmp/test/1  ONLINE   0 0 0

errors: No known data errors
# zpool scrub test
#
# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Wed Apr 11 15:22:30 2007
config:

NAME STATE READ WRITE CKSUM
test ONLINE   0 0 0
  raidz1 ONLINE   0 0 0
/tmp/test/0  ONLINE   0 017
/tmp/test/1  ONLINE   0 0 0

errors: No known data errors

I don't think these checksum errors are a good sign. 
The sha1 digest on the file *does* show to be the same so the question arises:

is the resilver process truly broken (even though in this test-case the test
file does appear to unchanged based on the sha1 digest) ?

Marco


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Something like spare sectors...

2007-04-10 Thread Mark Maybee

Anton B. Rang wrote:

This sounds a lot like:

6417779 ZFS: I/O failure (write on ...) -- need to
reallocate writes

Which would allow us to retry write failures on
alternate vdevs.


Of course, if there's only one vdev, the write should be retried to a different 
block on the original vdev ... right?


Yes, although it depends on the nature of the write failure.  If the
write failed because the device is no longer available, ZFS will not
continue to try different blocks.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshot issues.

2007-04-10 Thread Mark Maybee

Joseph Barbey wrote:

Matthew Ahrens wrote:

Joseph Barbey wrote:

Robert Milkowski wrote:

JB So, normally, when the script runs, all snapshots finish in 
maybe a minute
JB total.  However, on Sundays, it continues to take longer and 
longer.   On
JB 2/25 it took 30 minutes, and this last Sunday, it took 2:11.  
The only
JB thing special thing about Sunday's snapshots is that they are 
the first
JB ones created since the full backup (using NetBackup) on 
Saturday. All

JB other backups are incrementals.

hm do you have atime property set to off?
Maybe you spend most of the time in destroying snapshots due to much
larger delta coused by atime updates? You can possibly also gain some
performance by setting atime to off.


Yep, atime is set to off for all pools and filesystems.  I looked 
through the other possible properties, and nothing really looked like 
it would really affect things.


One additional weird thing.  My script hits each filesystem 
(email-pool/A..Z) individually, so I can run zfs list -t snapshot and 
find out how long each snapshot actually takes.  Everything runs fine 
until I get to around V or (normally) W.  Then it can take a couple 
of hours on the one FS.  After that, the rest go quickly.


So, what operation exactly is taking a couple of hours on the one 
FS?  The only one I can imagine taking more than a minute would be 
'zfs destroy', but even that should be very rare on a snapshot.  Is it 
always the same FS that takes longer than the rest?  Is the pool busy 
when you do the slow operation?


I've now determined that renaming the previous snapshot seems to be the 
problem in certain instances.


What we are currently doing through the script is to keep 2 weeks of 
daily snapshots of the various pool/filesystems.  These snapshots are 
named {fs}.$Day-2, {fs}.$Day-2, and {fs}.snap.  Specifically, for our 
'V' filesystem, which is created under the email-pool, I will have the 
following snapshots:


  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]
  email-pool/[EMAIL PROTECTED]

So, my script does the following for each FS:
  Check for FS.$Day-2.  If exists, then destroy it.
  Check if there is a FS.$Day-1.  If so, rename it to $DAY-2.
  Check for FS.snap. If so, rename to FS.$Yesterday-1 (day it was created).
  Create FS.snap

I added logging to a file, along with the action just run and the time 
that it completed:


  Destroy email-pool/[EMAIL PROTECTED]Sun Apr  8 00:01:04 CDT 2007
  Rename email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED]Sun Apr  8 
00:01:05 CDT 2007
  Rename email-pool/[EMAIL PROTECTED] email-pool/[EMAIL PROTECTED]Sun Apr  8 
00:54:52 CDT 2007

  Create email-pool/[EMAIL PROTECTED]Sun Apr  8 00:54:53 CDT 2007

Looking at the above, Rename took from 00:01:05 until 00:54:52, so 
almost 54 minutes.


So, any ideas on why a rename should take so long?  And again, why is 
this only happening on Sunday?  Any other information I can provide that 
might help diagnose this?



This could be an instance of:

6509628 unmount of a snapshot (from 'zfs destroy') is slow

The fact that this bug comes from a destroy op is not relevant, what is
relevant is the required unmount (also required in a rename op).  Has
there been recent activity in the Sunday-1 snapshot (like a backup or
'find' perhaps)?  This will cause the unmount to proceed very slowly.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: [zfs-code] Contents of transaction group?

2007-04-09 Thread Mark Maybee

Atul Vidwansa wrote:

Hi,
   I have few questions about the way a transaction group is created.

1. Is it possible to group transactions related to multiple operations
in same group? For example, an rmdir foo followed by mkdir bar,
can these end up in same transaction group?


Yes.


2. Is it possible for an operation (say write()) to occupie multiple
transaction groups?


Yes.  Writes are broken into transactions at block boundaries.  So it
is possible for a large write to span multiple transaction groups.


3. Is it possible to know the thread id(s) for every commited txg_id?


No.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zfs blocks numbers for small files

2007-04-04 Thread Mark Maybee

Frederic Payet - Availability Services wrote:

Hi gurus,
When creating some small files an ZFS directory, used blocks number is
not what could be espected:

hinano# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool2  702K  16.5G  26.5K  /pool2
pool2/new  604K  16.5G34K  /pool2/new
pool2/new/fs2  570K  16.5G   286K  /pool2/new/fs2
pool2/new/fs2/subfs2   284K  16.5G   284K  /pool2/new/fs2/subfs2


hinano# pwd
/pool2/new/fred

hinano# zfs get all pool2/new
NAME PROPERTY   VALUE  SOURCE
pool2/newtype   filesystem 
-   pool2/newcreation   Tue Mar 20 13:27 
2007  -   pool2/newused   
603K   -   pool2/new
available  16.5G  -   
pool2/newreferenced 33.5K  
-   pool2/newcompressratio  
1.00x  -   pool2/new
mountedyes-   
pool2/newquota  none   
default pool2/newreservation
none   default pool2/new
recordsize 128K   default 
pool2/newmountpoint /pool2/new 
default pool2/newsharenfs   
offdefault pool2/new
checksum   on default 
pool2/newcompressionoff
default pool2/newatime  
on default pool2/new
deviceson default 
pool2/newexec   on 
default pool2/newsetuid 
on default pool2/new
readonly   offdefault 
pool2/newzoned  off
default pool2/newsnapdir
hidden default pool2/new
aclmodegroupmask  default 
pool2/newaclinherit secure default
hinano# mkfile 9 file9bytes

hinano# mkfile 520 file520bytes
hinano# mkfile 1025 file1025bytes
hinano# mkfile 1023 file1023bytes
hinano# mkfile 10 file10bytes

hinano# ls -ls
total 14
  3 -rw--T   1 root root1023 Apr  4 13:34 file1023bytes
  4 -rw--T   1 root root1025 Apr  4 13:34 file1025bytes
  1 -rw--T   1 root root  10 Apr  4 13:38 file10bytes
  3 -rw--T   1 root root 520 Apr  4 13:33 file520bytes
  2 -rw--T   1 root root   9 Apr  4 13:33 file9bytes

After 2 seconds :
hinano# ls -ls
total 13
  3 -rw--T   1 root root1023 Apr  4 13:34 file1023bytes
  4 -rw--T   1 root root1025 Apr  4 13:34 file1025bytes
  2 -rw--T   1 root root  10 Apr  4 13:38 file10bytes
  3 -rw--T   1 root root 520 Apr  4 13:33 file520bytes
  2 -rw--T   1 root root   9 Apr  4 13:33 file9bytes

2 questions :
- Could somebody explain why a file of 9 bytes takes 2 512b blocks ?


One block for the znode (the meta-data), one block for the data.


- Why the block number of file10bytes has changes after a while doing
 nothing more than 'ls -ls'


The block count reflects the actual allocated storage on disk.  The
first time you did an 'ls' the data block had not yet been allocated
(i.e., the data was still in transit to the disk).



Please reply me directly as I'm not in this alias .

Best,
fred





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HELIOS and ZFS cache

2007-02-22 Thread Mark Maybee

This issue has been discussed a number of times in this forum.
To summerize:

ZFS (specifically, the ARC) will try to use *most* of the systems
available memory to cache file system data.  The default is to
max out at physmem-1GB (i.e., use all of physical memory except
for 1GB).  In the face of memory pressure, the ARC will give up
memory, however there are some situations where we are unable to
free up memory fast enough for an application that needs it (see
example in the HELIOS note below).  In these situations, it may
be necessary to lower the ARCs maximum memory footprint, so that
there is a larger amount of memory immediately available for
applications.  This is particularly relevant in situations where
there is a known amount of memory that will always be required for
use by some application (databases often fall into this category).
The tradeoff here is that the ARC will not be able to cache as much
file system data, and that could impact performance.

For example, if you know that an application will need 5GB on a
36GB machine, you could set the arc maximum to 30GB (0x78000).

In ZFS on on10 prior to update 4, you can only change the arc max
size via explicit actions with mdb(1):

# mdb -kw
 arc::print -a c_max
address c_max = current-max
 address/Z new-max

In the current opensolaris nevada bits, and in s10u4, you can use
the system variable 'zfs_arc_max' to set the maximum arc size.  Just
set this in /etc/system.

-Mark

Erik Vanden Meersch wrote:


Could someone please provide comments or solution for this?

Subject: Solaris 10 ZFS problems with database applications


HELIOS TechInfo #106



Tue, 20 Feb 2007

Solaris 10 ZFS problems with database applications
--

We have tested Solaris 10 release 11/06 with ZFS without any problems
using all HELIOS UB based products, including very high load tests.

However we learned from customers that some database solutions (known
are Sybase and Oracle), when allocating a large amount of memory may
slow down or even freeze the system for up to a minute. This can
result in RPC timeout messages and service interrupts for HELIOS
processes. ZFS is basically using most memory for file caching.
Freeing this ZFS memory for the database memory allocation can result
into serious delays. This does not occur when using HELIOS products
only.

HELIOS tested system was using 4GB memory.
Customer production machine was using 16GB memory.


Contact your SUN representative how to limit the ZFS cache and what
else to consider using ZFS in your workflow.

Check also with your application vendor for recommendations using ZFS
with their applications.


Best regards,

HELIOS Support

HELIOS Software GmbH
Steinriede 3
30827 Garbsen (Hannover)
Germany

Phone:  +49 5131 709320
FAX:+49 5131 709325
http://www.helios.de

--
http://www.sun.com/solaris  * Erik Vanden Meersch *
Solution Architect

*Sun Microsystems, Inc.*
Phone x48835/+32-2-704 8835
Mobile 0479/95 05 98
Email [EMAIL PROTECTED]




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool export consumes whole CPU and takes more than 30 minutes to complete

2007-02-13 Thread Mark Maybee

Robert,

This doesn't look like cache flushing, rather it looks like we are
trying to finish up some writes... but are having a hard time allocating
space for them.  Is this pool almost 100% full?  There are lots of
instances of zio_write_allocate_gang_members(), which indicates a very
high degree of fragmentation in the pool.

-Mark

Robert Milkowski wrote:

Hi.

T2000 1.2GHz 8-core, 32GB RAM, S10U3,  zil_disable=1.

Command 'zpool export f3-2' is hung for 30 minutes now and still is going.
Nothing else is running on the server. I can see one CPU being 100% in SYS like:

bash-3.00# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   0   67   220  110   200000 00   0   0 100
  10   00 1000000 00   0   0 100
  20   00 2020000 50   0   0 100
  30   00 1000000 00   0   0 100
  40   00 60   100000 00   0   0 100
  50   00 60   100000 00   0   0 100
  60   00 1000000 00   0   0 100
  70   0   62 80   140000 00   0   0 100
  80   00 60   100010 00   0   0 100
  90   01 70   120000 00   0   0 100
 100   00 80   140000 00   0   0 100
 110   00 4060000 00   0   0 100
 120   01210   390020 00   0   0 100
 130   01 5080000 00   0   0 100
 140   0   1842   35   120000 00   0   0 100
 150   01 5320000 00   0   0 100
 160   0   1111960010 00   0   0 100
 170   00 70   120000130   0   0 100
 180   00 60   100000 00   0   0 100
 190   00 1000000 00   0   0 100
 200   00 5080000200   0   0 100
 210   03 60   100020   1880   0   0 100
 220   00 2020000 00   0   0 100
 230   01 5080000   2190   0   0 100
 240   00 1000000 00   0   0 100
 250   00 1000000 00 100   0   0
 260   00 1000000 00   0   0 100
 270   00 1000000 00   0   0 100
 280   00 5080000 00   3   0  97
 290   01 4060010   2260   0   0 100
 300   01 2020000 00   0   0 100
 310   00 1000000 00   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   0   61   219  108   220020 00   1   0  99
  10   00 2020100 10   0   0 100
  20   00 2020000 50   0   0 100
  30   00 1000000 00   0   0 100
  40   00 5080000 00   0   0 100
  50   00 5080000 00   0   0 100
  60   00 1000000 00   0   0 100
  70   03 90   160040 00   0   0 100
  80   00 4060000 00   0   0 100
  90   00 60   100000 00   0   0 100
 100   01 70   120000 00   0   0 100
 110   00 4060000 00   0   0 100
 120   02220   410010 00   0   0 100
 130   00 60   100000 00   0   0 100
 140   0   2046   38   140010 00   0   0 100
 150   00 2100000 00   0   0 100
 160   0   1210960000 00   0   0 100
 170   00 90   160020130   0   0 100
 180   00 4060000 00   0   0 100
 190   00 2020000 00   0   0 100
 200   00 3040000100   0   0 100
 210   03 70   101040   1850   0   0 100
 220   00 1000000 00   0   0 100

Re: [zfs-discuss] file not persistent after node bounce when there is a bad disk?

2007-01-31 Thread Mark Maybee

Peter Buckingham wrote:

Hi Eric,

eric kustarz wrote:
The first thing i would do is see if any I/O is happening ('zpool 
iostat 1').  If there's none, then perhaps the machine is hung (which 
you then would want to grab a couple of '::threadlist -v 10's from mdb 
to figure out if there are hung threads).


there seems to be no IO after the initial IO according to zpool iostat. 
When we run zpool status it hangs:


HON hcb116 ~ $ zpool status
   pool: tank  state: ONLINE
   scrub: none requested
   hang

I'll send you the mdb output privately since it's quite big.

60 seconds should be plenty of time for the async write(s) to 
complete.  We try to push out txg (transaction groups) every 5 
seconds.  However, if the system is overloaded, then the txgs could 
take longer.


That's what I would have thought.

They 'sync' hanging is intriguing.  Perhaps the system is just 
overloaded and sync command is making it worse.  Seeing what 'fsync' 
would do would be interesting.


I've not tried this yet.


What else is the machine doing?


we are running the honeycomb environment (you can see when I send you 
the mdb output).


is there some issue for the zpool mirrors if one of the slices
disappears or is unresponsive after the pool has been brought online?


This can be a problem if an IO issued to the device never completes
(i.e., hangs).  This can hang up the pool.  A well-behaved device/driver
should eventually time out the IO, but we have seen instances where
this never seems to happen.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-11 Thread Mark Maybee

Al Hopper wrote:

On Wed, 10 Jan 2007, Mark Maybee wrote:


Jason J. W. Williams wrote:

Hi Robert,

Thank you! Holy mackerel! That's a lot of memory. With that type of a
calculation my 4GB arc_max setting is still in the danger zone on a
Thumper. I wonder if any of the ZFS developers could shed some light
on the calculation?


In a worst-case scenario, Robert's calculations are accurate to a
certain degree:  If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of related data held in memory: vnodes/znodes/
dnodes/etc.  This related data is the in-core data associated with
an accessed file.  Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the free of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches).  The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in slabs.

We are in the process of trying to improve this situation.

 snip .

Understood (and many Thanks).  In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
nGb of RAM and is used in the following broad workload scenarios:

a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem

It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??

I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into swap hell.


Ideally, there would be no need for any tunables; ZFS would always do
the right thing.   This is our grail.  In the meantime, I can give some
recommendations, but there is no rule of thumb that is going to work
in all circumstances.

ncsize: As I have mentioned previously, there are overheads
associated with caching vnode data in ZFS.  While
the physical on-disk data for a znode is only 512bytes,
the related in-core cost is significantly higher.
Roughly, you can expect that each ZFS vnode held in
the DNLC will cost about 3K of kernel memory.

So, you need to set ncsize appropriately for how much
memory you are willing to devote to it.  500,000 entries
is going to cost you 1.5GB of memory.

zfs_arc_max: This is the maximum amount of memory you want the
ARC to be able to use.  Note that the ARC won't
necessarily use this much memory: if other applications
need memory, the ARC will shrink to accommodate.
Although, also note that the ARC *can't* shrink if all
of its memory is held.  For example, data in the DNLC
cannot be evicted from the ARC, so this data must first
be evicted from the DNLC before the ARC can free up
space (this is why it is dangerous to turn off the ARCs
ability to evict vnodes from the DNLC).

Also keep in mind that the ARC size does not account for
many in-core data structures used by ZFS (znodes/dnodes/
dbufs/etc).  Roughly, for every 1MB of cached file
pointers, you can expect another 3MB of memory used
outside of the ARC.  So, in the example above, where
ncsize is 500,000, the ARC is only seeing about 400MB
of the 1.5GB consumed.  As I have stated previously,
we consider this a bug in the current ARC accounting
that we will soon fix.  This is only an issue in
environments where many files are being accessed.  If
the number of files accessed is relatively low, then
the ARC size will be much closer to the actual memory
consumed by ZFS.

So, in general, you should not really need to tune
zfs_arc_max.  However, in environments where you have
specific applications that consume known quantities of
memory (e.g. database), it will likely help to set the
ARC max size down, to guarantee that the necessary
kernel memory is available.  There may be other times
when it will be beneficial to explicitly set the ARCs
maximum

Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-11 Thread Mark Maybee

Jason J. W. Williams wrote:

Hi Mark,

That does help tremendously. How does ZFS decide which zio cache to
use? I apologize if this has already been addressed somewhere.


The ARC caches data blocks in the zio_buf_xxx() cache that matches
the block size.  For example, dnode data is stored on disk in 16K
blocks (32 dnodes/block), so zio_buf_16384() is used for those blocks.
Most file data blocks (in large files) are stored in 128K blocks, so
zio_buf_131072() is used, etc.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Mark Maybee

Jason J. W. Williams wrote:

Hi Robert,

Thank you! Holy mackerel! That's a lot of memory. With that type of a
calculation my 4GB arc_max setting is still in the danger zone on a
Thumper. I wonder if any of the ZFS developers could shed some light
on the calculation?


In a worst-case scenario, Robert's calculations are accurate to a
certain degree:  If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of related data held in memory: vnodes/znodes/
dnodes/etc.  This related data is the in-core data associated with
an accessed file.  Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the free of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches).  The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in slabs.

We are in the process of trying to improve this situation.


That kind of memory loss makes ZFS almost unusable for a database system.


Note that you are not going to experience these sorts of overheads
unless you are accessing *many* files.  In a database system, there are
only going to be a few files = no significant overhead.


I agree that a page cache similar to UFS would be much better.  Linux
works similarly to free pages, and it has been effective enough in the
past. Though I'm equally unhappy about Linux's tendency to grab every
bit of free RAM available for filesystem caching, and then cause
massive memory thrashing as it frees it for applications.


The page cache is much better in the respect that it is more tightly
integrated with the VM system, so you get more efficient response to
memory pressure.  It is *much worse* than the ARC at caching data for
a file system.  In the long-term we plan to integrate the ARC into the
Solaris VM system.


Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 10, 2007, 9:45:05 PM, you wrote:

JJWW Sanjeev  Robert,

JJWW Thanks guys. We put that in place last night and it seems to be 
doing
JJWW a lot better job of consuming less RAM. We set it to 4GB and 
each of
JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully 
slush

JJWW of 4GB on the Thumper is enough. I would be interested in what the
JJWW other ZFS modules memory behaviors are. I'll take a perusal through
JJWW the archives. In general it seems to me that a max cap for ZFS 
whether
JJWW set through a series of individual tunables or a single root 
tunable

JJWW would be very helpful.

Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.

I guess (and it's really a guess only based on some emails here) that
in worst case scenario ZFS caches would consume about:

  arc_max + 3*arc_max + memory lost for fragmentation

So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-05 Thread Mark Maybee

Thomas,

This could be fragmentation in the meta-data caches.  Could you
print out the results of ::kmastat?

-Mark

Tomas Ögren wrote:

On 05 January, 2007 - Robert Milkowski sent me these 3,8K bytes:


Hello Tomas,

I saw the same behavior here when ncsize was increased from default.
Try with default and lets see what will happen - if it works then it's
better than hung every an hour or so.


That's still not the point.. It was fine with ncsize=500k (and all of it
used) for a while.. then all of a sudden it just want haywire.. and when
it freed up dnlc, I got back 250MB.. where's the rest ~1750MB tied up?

/Tomas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-05 Thread Mark Maybee

So it looks like this data does not include ::kmastat info from *after*
you reset arc_reduce_dnlc_percent.  Can I get that?

What I suspect is happening:
1 with your large ncsize, you eventually ran the machine out
  of memory because (currently) the arc is not accounting for
  the space consumed by auxiliary caches (dnode_t, etc.).
2 the arc could not reduce at this point since almost all of
  its memory was tied up by the dnlc refs.
3 when you eventually allowed the arc to reduce the dnlc size,
  it managed to free up some space, but much of this did not
  appear because it was tied up in slabs in the auxiliary
  caches (fragmentation).

We are working on a fix for number 1 above.
You should *not* be setting arc_reduce_dnlc_percent to zero, even if
you want a large number of dnlc entries.  You are tying the arc hands
here, so it has no ability to reduce its size.
Number 3 is the most difficult issue.  We are looking into that at the
moment as well.

-Mark

Tomas Ögren wrote:

On 05 January, 2007 - Mark Maybee sent me these 0,8K bytes:


Thomas,

This could be fragmentation in the meta-data caches.  Could you
print out the results of ::kmastat?


http://www.acc.umu.se/~stric/tmp/zfs-dumps.tar.bz2

memstat, kmastat and dnlc_nentries from 10 minutes after boot up until
the near death experience..

I've got vmcore as well if needed..

/Tomas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-05 Thread Mark Maybee

Tomas Ögren wrote:

On 05 January, 2007 - Mark Maybee sent me these 1,5K bytes:


So it looks like this data does not include ::kmastat info from *after*
you reset arc_reduce_dnlc_percent.  Can I get that?


Yeah, attached. (although about 18 hours after the others)


Excellent, this confirms #3 below.


What I suspect is happening:
1 with your large ncsize, you eventually ran the machine out
  of memory because (currently) the arc is not accounting for
  the space consumed by auxiliary caches (dnode_t, etc.).
2 the arc could not reduce at this point since almost all of
  its memory was tied up by the dnlc refs.
3 when you eventually allowed the arc to reduce the dnlc size,
  it managed to free up some space, but much of this did not
  appear because it was tied up in slabs in the auxiliary
  caches (fragmentation).

We are working on a fix for number 1 above.


Great!


You should *not* be setting arc_reduce_dnlc_percent to zero, even if
you want a large number of dnlc entries.  You are tying the arc hands
here, so it has no ability to reduce its size.
Number 3 is the most difficult issue.  We are looking into that at the
moment as well.


Any idea where all the memory is going? I sure hope that 500k dnlc
entries (+dnode_t's etc belonging to that) isn't using up about 2GB
ram..?


Actually, thats pretty much what is happening:
500k dnlc = 170MB in the vnodes (vn_cache)
   + 320MB in znode_phys data (zio_buf_512)
   + 382MB in dnode_phys data (zio_buf_16384)
   + 208MB in dmu bufs (dmu_buf_impl_t)
   + 400MB in dnodes (dnode_t)
   + 120MB in znodes (zfs_znode_cache)
  -
total   1600MB

These numbers come from the last ::kmastat you ran before reducing the
DNLC size.  Note below that much of this space is still consumed by
these caches, even after the DNLC has dropped it references.  This is
largely due to fragmentation in the caches.


/Tomas




cachebufbufbufmemory alloc alloc 
namesize in use  totalin use   succeed  fail 
- -- -- -- - - - 
vn_cache 240 405388 657696 173801472948191 0 

...
zio_buf_512  512 137801 294975 161095680  43660052 0 

...
zio_buf_16384  16384   6692   6697 109723648   5877279 0 

...
dmu_buf_impl_t   328 145260 622392 212443136  65461261 0 
dnode_t  640 137799 512508 349872128  37995548 0 

...
zfs_znode_cache  200 137763 568040 116334592  35683478 0 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-03 Thread Mark Maybee

Tomas,

There are a couple of things going on here:

1. There is a lot of fragmentation in your meta-data caches (znode,
dnode, dbuf, etc).  This is burning up about 300MB of space in your
hung kernel.  This is a known problem that we are currently working
on.

2. While the ARC has set its desired size down to c_min (64MB), its
actually still consuming ~800MB in the hung kernel.  This is odd.
The bulk of this space is in the 32K and 64K data caches.  Could
you print out the contents of ARC_anon, ARC_mru, ARC_mfu, ARC_mru_ghost,
and ARC_mfu_ghost?

-Mark

Tomas Ögren wrote:

Hello.

Having some hangs on a snv53 machine which is quite probably ZFS+NFS
related, since that's all the machine do ;)

The machine is a 2x750MHz Blade1000 with 2GB ram, using a SysKonnect
9821 GigE card (with their 8.19.1.3 skge driver) and two HP branded MPT
SCSI cards. Normal load is pretty much read all you can with misc
tarballs and isos since it's an NFS backend to our caching http/ftp
cluster delivering free software to the world.

What happens is that the machine just stops responding.. it can respond
to ping for a while (while userland is non-responsive, including
console) but after a while, that stops too..

Produced a panic to get a dump and tried ::memstat;
unterweser:/scratch/070103# mdb unix.0 vmcore.0
Loading modules: [ unix krtld genunix specfs dtrace ufs scsi_vhci pcisch
ssd fcp fctl qlc md ip hook neti sctp arp usba s1394 nca lofs zfs random
sd nfs ptm cpc ]

::memstat

Page SummaryPagesMB  %Tot
     
Kernel 250919  1960   98%
Anon  888 60%
Exec and libs 247 10%
Page cache 38 00%
Free (cachelist)  405 30%
Free (freelist)  4370342%

Total  256867  2006
Physical   253028  1976

That doesn't seem too healthy to me.. probably something kernely eating
up everything and the machine is just swapping to death or something..

A dump from live kernel with mdb -k after 1.5h uptime;
Page SummaryPagesMB  %Tot
     
Kernel 212310  1658   83%
Anon11307884%
Exec and libs2418181%
Page cache  18400   1437%
Free (cachelist) 4383342%
Free (freelist)  8049623%


The tweaks I have are:
set ncsize = 50
set nfs:nrnode = 50
set zfs:zil_disable=1
set zfs:zfs_vdev_cache_bshift=14
set zfs:zfs_vdev_cache_size=0

Which according to ::kmem_cache results in about:
030002e30008 dmu_buf_impl_t 00  328   487728
030002e30288 dnode_t    00  640   453204
030002e30508 arc_buf_hdr_t  00  144   103544
030002e30788 arc_buf_t  00   4036743
030002e30a08 zil_lwb_cache  00  2000
030002e30c88 zfs_znode_cache    00  200   453200

but those buffers equal to about 550MB..

dnlc_nentries on the hung has gone down to 15000.. (where are the rest
of the ~450k-15k dnode/znodes hanging out?)

Hung kernel:

arc::print

{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x358a0600
p = 0x400
c = 0x400
c_min = 0x400
c_max = 0x5e114800
hits = 0xbc860fd
misses = 0x2f296e1
deleted = 0x1d88739
recycle_miss = 0xf7f30c
mutex_miss = 0x24b13d
evict_skip = 0x21501d02
hash_elements = 0x27f97
hash_elements_max = 0x27f97
hash_collisions = 0x1651b43
hash_chains = 0x7ac3
hash_chain_max = 0x12
no_grow = 0x1
}


Live kernel:

arc::print

{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x1b279400
p = 0x1a1dcaa4
c = 0x1a1dcaa4
c_min = 0x400
c_max = 0x5e114800
hits = 0xef7c96
misses = 0x25efa8
deleted = 0x1db537
recycle_miss = 0xa6221
mutex_miss = 0x12b59
evict_skip = 0x70d62b
hash_elements = 0xcda1
hash_elements_max = 0x1b589
hash_collisions = 0x18e58a
hash_chains = 0x3d16
hash_chain_max = 0xf
no_grow = 0x1
}


Should I post ::kmem_cache and/or ::kmastat somewhere? It's about
2*(20+30)kB..

/Tomas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-03 Thread Mark Maybee

Hmmm, so there is lots of evictable cache here (mostly in the MFU
part of the cache)... could you make your core file available?
I would like to take a look at it.

-Mark

Tomas Ögren wrote:

On 03 January, 2007 - Mark Maybee sent me these 5,0K bytes:


Tomas,

There are a couple of things going on here:

1. There is a lot of fragmentation in your meta-data caches (znode,
dnode, dbuf, etc).  This is burning up about 300MB of space in your
hung kernel.  This is a known problem that we are currently working
on.


Great!


2. While the ARC has set its desired size down to c_min (64MB), its
actually still consuming ~800MB in the hung kernel.  This is odd.
The bulk of this space is in the 32K and 64K data caches.  Could
you print out the contents of ARC_anon, ARC_mru, ARC_mfu, ARC_mru_ghost,
and ARC_mfu_ghost?


Like this?


ARC_anon::print

{
list = {
list_size = 0
list_offset = 0
list_head = {
list_next = 0
list_prev = 0
}
}
lsize = 0
size = 0x19c000
hits = 0
mtx = {
_opaque = [ 0 ]
}
}

ARC_mru::print

{
list = {
list_size = 0x90
list_offset = 0x70
list_head = {
list_next = 0x30072a5b5f8
list_prev = 0x300758b6c70
}
}
lsize = 0x1f88200
size = 0x3e5c200
hits = 0x44c48ad
mtx = {
_opaque = [ 0 ]
}
}

ARC_mfu::print

{
list = {
list_size = 0x90
list_offset = 0x70
list_head = {
list_next = 0x30099c7a730
list_prev = 0x300dc11fee0
}
}
lsize = 0x2f2e4400
size = 0x318a8400
hits = 0x466bbec
mtx = {
_opaque = [ 0 ]
}
}

ARC_mru_ghost::print

{
list = {
list_size = 0x90
list_offset = 0x70
list_head = {
list_next = 0x300758b6eb0
list_prev = 0x300d65faa10
}
}
lsize = 0x97a3cc00
size = 0x97a3cc00
hits = 0xfa4a49
mtx = {
_opaque = [ 0 ]
}
}

ARC_mfu_ghost::print

{
list = {
list_size = 0x90
list_offset = 0x70
list_head = {
list_next = 0x3006c7c8ce0
list_prev = 0x300d65fa2c0
}
}
lsize = 0x879ddc00
size = 0x879ddc00
hits = 0x3b37c8
mtx = {
_opaque = [ 0 ]
}
}

/Tomas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS related (probably) hangs due to memory exhaustion(?) with snv53

2007-01-03 Thread Mark Maybee

Ah yes!  Thank you Casper.  I knew this looked familiar! :-)
Yes, this is almost certainly what is happening here.   The
bug was introduced in build 51 and fixed in build 54.

[EMAIL PROTECTED] wrote:

Hmmm, so there is lots of evictable cache here (mostly in the MFU
part of the cache)... could you make your core file available?
I would like to take a look at it.


Isn't this just like:
6493923 nfsfind on ZFS filesystem quickly depletes memory in a 1GB system

Which was introduced in b51(or 52) and fixed in snv_54.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Mark Maybee

[EMAIL PROTECTED] wrote:

Hello Casper,

Tuesday, December 12, 2006, 10:54:27 AM, you wrote:



So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will
become corrupt through something that doesn't also make all the data
also corrupt or inaccessible.



CDSC So how does this work for data which is freed and overwritten; does
CDSC the system make sure that none of the data referenced by any of the
CDSC old ueberblocks is ever overwritten?

Why it should? If blocks are not used due to current UB I guess you
can safely assume they are free.




What if a newer UB is corrupted and you fall back to an older one?

Casper


A block freed in transaction group N cannot be reused until transaction
group N+3; so there is no possibility of referencing an overwritten
block unless you have to back off more than two uberblocks.  At this
point, blocks that have been overwritten will show up as corrupted (bad
checksums).

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS compression / ARC interaction

2006-12-07 Thread Mark Maybee

Andrew Miller wrote:
Quick question about the interaction of ZFS filesystem compression and the filesystem cache.  We have an Opensolaris (actually Nexenta alpha-6) box running RRD collection.   These files seem to be quite compressible.  A test filesystem containing about 3,000 of these files shows a compressratio of 12.5x.  


My question is about how the filesystem cache works with compressed files.  
Does the fscache keep a copy of the compressed data, or the uncompressed 
blocks?   To update one of these RRD files, I believe the whole contents are 
read into memory, modified, and then written back out.   If the filesystem 
cache maintained a copy of the compressed data, a lot more, maybe more than 10x 
more, of these files could be maintained in the cache.  That would mean we 
could have a lot more data files without ever needing to do a physical read.

Looking at the source code overview, it looks like the compression happens 
underneath the ARC layer, so by that I am assuming the uncompressed blocks 
are cached, but I wanted to ask to be sure.

Thanks!
-Andy
 

Yup, your assumption is correct.  We currently do compression below the
ARC.  We have contemplated caching data in compressed form, but have not
really explored the idea fully yet.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a drive in a raidz vdev

2006-12-05 Thread Mark Maybee

Jeremy Teo wrote:

On 12/5/06, Bill Sommerfeld [EMAIL PROTECTED] wrote:


On Mon, 2006-12-04 at 13:56 -0500, Krzys wrote:
 mypool2/[EMAIL PROTECTED]  34.4M  -   151G  -
 mypool2/[EMAIL PROTECTED]  141K  -   189G  -
 mypool2/d3 492G   254G  11.5G  legacy

 I am so confused with all of this... Why its taking so long to 
replace that one

 bad disk?

To workaround a bug where a pool traverse gets lost when the snapshot
configuration of a pool changes, both scrubs and resilvers will start
over again any time you create or delete a snapshot.

Unfortunately, this workaround has problems of its own -- If your
inter-snapshot interval is less than the time required to complete a
scrub, the resilver will never complete.

The open bug is:

6343667 scrub/resilver has to start over when a snapshot is taken

if it's not going to be fixed any time soon, perhaps we need a better
workaround:



Anyone internal working on this?


Yes.  But its going to be a few months.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] df -e in ZFS

2006-11-09 Thread Mark Maybee

Robert Milkowski wrote:

Hello John,

Thursday, November 9, 2006, 12:03:58 PM, you wrote:

JC Hi all,
JC When testing our programs, I got a problem. On UFS, we get the number of
JC free inode via 'df -e', then do some things based this value, such as 
JC create an empty file, the value will decrease by 1. But on ZFS, it does

JC not work. I still can get a number via 'df -e', and create a same empty
JC file, the value is not my expectation. So I use a loop to produce empty
JC files and watch the output of 'df -e'. After some long time, the number
JC is 671, then 639, 641, 603, 605, 609, 397, 607...
JC I check the number of files, yes, it increases steadily.

JC Could you explain it?

UFS has static number of inodes in a given file system so it's easy to
say how much free inodes are left.

ZFS creates inodes on demand so you can't say how much inodes you can
create - however I guess one could calculate maximum possible number
of inodes to be created given free space in a pool/fs.


Yup, this is what ZFS does.  It makes a *very rough* estimate of how
many empty files could be created given the amount of available space.
This number may be useful as some sort of upper bound, but no more than
that.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: reproducible zfs panic on Solaris 10 06/06

2006-11-03 Thread Mark Maybee

Matthew Flanagan wrote:

Matt,



Matthew Flanagan wrote:


mkfile 100m /data
zpool create tank /data


...


rm /data


...


panic[cpu0]/thread=2a1011d3cc0: ZFS: I/O failure


(write on unknown off 0: zio 60007432bc0 [L0
unallocated] 4000L/400P DVA[0]=0:b000:400
DVA[1]=0:120a000:400 fletcher4 lzjb BE contiguous
birth=6 fill=0
cksum=672165b9e7:328e78ae25fd:ed007c9008f5f:34c05b1090
0b668): error 6
...


is there a fix for this?


Um, don't do that?

This is a known bug that we're working on.



What is the bugid for this an ETA for fix?


6417779 ZFS: I/O failure (write on ...) -- need to reallocate writes
and
6322646 ZFS should gracefully handle all devices failing (when writing)

These bugs are actively being worked on, but it will probably be a while
before fixes appear.

-Mark

I'm extremely surprised that this kind of bug can make it into a Solaris 
release. This is the second zfs related panic that I've found in testing it in 
our labs. The first was caused to the system to panic when the ZFS volume got 
close to 100% full (Sun case id #10914593).

I've just replicated this panic with a USB flash drive as well by creating the 
zpool and then yanking the drive out. This is probably a common situation for 
desktop/laptop users who would not be impressed that their otherwise robust 
Solaris system crashed.

regards

matthew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Automounting ? (idea ?)

2006-09-27 Thread Mark Maybee

Patrick wrote:

Hi,

So recently, i decided to test out some of the ideas i've been toying
with, and decided to create 50 000 and 100 000 filesystems, the test
machine was a nice V20Z with dual 1.8 opterons, 4gb ram, connecting a
scsi 3310 raid array, via two scsi controllers.

Now creating the mass of filesystems, and the mass of properties i
randomly assigned them was pretty easy, and i must say, i LOVE zfs, i
really do LOVE zfs !

the script i created, basically created /data/clients/clientID, and
then randomly set a quota, as well as randomly decided if compression
was to be on, basically just to set properties for it, and such.
clientID is a numeric value which starts at 1 and continues
upwards.

Now, creating, i was quite surprised to see the ammount of IO
generated on the array's managment console, but never the less it
created them without a hitch, although it took a little while, in the
real world one wouldn't create 100 000 filesystems over night, and
even if one did, one could wait an hour, or two...

The problem came in when, i had to reboot the machine, and well...
yes, a few hours later, it came up :)

So this got me thinking, ZFS makes a perfect solution for massive user
directory type solutions, and gives you the ability to have quota's
and such stored on the filesystem, and then export the root
filesystem, alas, some systems have thousands, if not hundreds of
thousands of users, where that would be an awesome solution, mounting
ALL of those filesystems on boot, becomes a pain.

So ... how about an automounter? Is this even possible? Does it exist ?

Helll!!

Patrick


*sigh*, one of the issues we recognized, when we introduced the new
cheap/fast file system creation, was that this new model would stress
the scalability (or lack thereof) of other parts of the operating
system.  This is a prime example.  I think the notion of an automount
option for zfs directories is an excellent one.  Solaris does support
automount, and it should be possible, by setting the mountpoint property
to legacy, to set up automount tables to achieve what you want now;
but it would be nice if zfs had a property to do this for you
automatically.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zfs panic installing a brandz zone

2006-09-15 Thread Mark Maybee

Yup, its almost certain that this is the bug you are hitting.

-Mark

Alan Hargreaves wrote:
I know, bad form replying to myself, but I am wondering if it might be 
related to


 6438702 error handling in zfs_getpage() can trigger page not 
locked


Which is marked fix in progress with a target of the current build.

alan.

Alan Hargreaves wrote:

Folks, before I start delving too deeply into this crashdump, has 
anyone seen anything like it?


The background is that I'm running a non-debug open build of b49 and 
was in the process of running the zoneadm -z redlx install 


After a bit, the machine panics, initially looking at the crashdump, 
I'm down to 88mb free (out of a gig) and see the following stack.


fe8000de7800 page_unlock+0x3b(180218720)
fe8000de78d0 zfs_getpage+0x236(89b84d80, 12000, 2000, 
fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, 
fe808180a000, 1,

80826dc8)
fe8000de7950 fop_getpage+0x52(89b84d80, 12000, 2000, 
fe8000de7a1c, fe8000de79b8, 2000, fbc29b20, 
fe8081818000, 1,

80826dc8)
fe8000de7a50 segmap_fault+0x1d6(801a6f38, 
fbc29b20, fe8081818000, 2000, 0, 1)
fe8000de7b30 segmap_getmapflt+0x67a(fbc29b20, 
89b84d80, 12000, 2000, 1, 1)

fe8000de7bd0 lofi_strategy_task+0x14b(959d2400)
fe8000de7c60 taskq_thread+0x1a7(84453da8)
fe8000de7c70 thread_start+8()

%rax = 0x %r9  = 0x0300430e
%rbx = 0x000e %r10 = 0x1000
%rcx = 0xfe8081819000 %r11 = 0x113709b0
%rdx = 0xfe8000de7c80 %r12 = 0x000180218720
%rsi = 0x00013000 %r13 = 0xfbc52160 
pse_mutex+0x200

%rdi = 0xfbc52160 pse_mutex+0x200 %r14 = 0x4000
%r8  = 0x0200 %r15 = 0xfe8000de79d8

%rip = 0xfb8474fb page_unlock+0x3b
%rbp = 0xfe8000de7800
%rsp = 0xfe8000de77e0
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=of,df,IF,tf,sf,ZF,af,PF,cf

%cs = 0x0028%ds = 0x0043%es = 0x0043
%trapno = 0xe   %fs = 0xfsbase = 0x8000
   %err = 0x0   %gs = 0x01c3gsbase = 0xfbc27b70

While the panic string says NULL pointer dereference, it appears that 
0x180218720 is not mapped. The dereference looks like the first 
dereference in page_unlock(), which looks at pp-p_selock.


I can spend a little time looking at it, but was wondering if anyone 
had seen this kind of panic previously?


I have two identical crashdumps created in exactly the same way.

alan.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is

2006-09-13 Thread Mark Maybee

Robert Milkowski wrote:

Hello Philippe,

It was recommended to lower ncsize and I did (to default ~128K).
So far it works ok for last days and staying at about 1GB free ram
(fluctuating between 900MB-1,4GB).

Do you think it's a long term solution or with more load and more data
the problem can surface again even with current ncsize value?


Robert,

I don't think this should be impacted too much by load/data, as long
as the DNLC is able to evict, you should be in good shape.  We are
still working on a fix for the root cause of this issue however.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: when zfs enabled java

2006-09-13 Thread Mark Maybee

Jill Manfield wrote:

My customer is running java on a ZFS file system.  His platform is Soalris 10 
x86 SF X4200.  When he enabled ZFS his memory of 18 gigs drops to 2 gigs rather 
quickly.  I had him do a #  ps -e -o pid,vsz,comm | sort -n +1 and it came back:

The culprit application you see is java:
507 89464 /usr/bin/postmaster
515 89944 /usr/bin/postmaster
517 91136 /usr/bin/postmaster
508 96444 /usr/bin/postmaster
516 98088 /usr/bin/postmaster
503 3449580 /usr/jre1.5.0_07/bin/amd64/java
512 3732468 /usr/jre1.5.0_07/bin/amd64/java

Here is what the customer responded:
Well, Java's is a memory hog, but it's not the leak -- it's the
application.  Even after it fails due to lack of memory, the memory is
not reclaimed and we can no longer restart it.
Is there a bug on zfs?  I did not find one in sunsolve but then again I might 
have been searching the wrong thing.

We have done some slueth work and are starting to think our problem
might be ZFS -- the new file system Sun supports.  The documentation for
ZFS states that it tries to cache as much as it can, and it uses kernel
memory for the cache.  That would explain memory gradually disappearing.
ZFS can give memory back, but it does not do so quickly.


Yup, this is likely your problem.  ZFS takes a little time to give
back memory, and the app may fail with ENOMEM before this happens.


So, is there any way to check that?  If turns out to be the problem...

1) Is there a way to limit the size of ZFS's caches?


Well... sort of.  You can set the size of arc.c_max and this will
put an upper bound on the cache.  But this is a bit of a hack.


If not, then

2) Is there a way to clear ZFS's cache?


Try unmounting/mounting the file system, if that does not work,
try export/import of the pool.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Memory Usage

2006-09-12 Thread Mark Maybee

Thomas Burns wrote:

Hi,

We have been using zfs for a couple of months now, and, overall, really
like it.  However, we have run into a major problem -- zfs's memory  
requirements
crowd out our primary application.  Ultimately, we have to reboot the  
machine

so there is enough free memory to start the application.

What I would like is:

1) A way to limit the size of the cache (a gig or two would be fine  for 
us)


2) A way to clear the caches -- hopefully, something faster than  rebooting
the machine.

Is there any way I can do either of these things?

Thanks,
Tom Burns


Tom,

What version of solaris are you running?  In theory, ZFS should not
be hogging your system memory to the point that it crowds out your
primary applications... but this is still an area that we are working
out the kinks in.  If you could provide a core dump of the machine
when it gets to the point that you can't start your app, it would
help us.

As to your questions; I will give you some ways to do these things,
but these are not considered best practice:

1) You should be able to limit your cache max size by setting
arc.c_max.  Its currently initialized to be phys-mem-size - 1GB.

2) First try unmount/remounting your file system to clear the
cache.  If that doesn't work, try exporting/importing your pool.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Memory Usage

2006-09-12 Thread Mark Maybee

Thomas Burns wrote:


On Sep 12, 2006, at 2:04 PM, Mark Maybee wrote:


Thomas Burns wrote:


Hi,
We have been using zfs for a couple of months now, and, overall,  really
like it.  However, we have run into a major problem -- zfs's  memory  
requirements
crowd out our primary application.  Ultimately, we have to reboot  
the  machine

so there is enough free memory to start the application.
What I would like is:
1) A way to limit the size of the cache (a gig or two would be  fine  
for us)
2) A way to clear the caches -- hopefully, something faster than   
rebooting

the machine.
Is there any way I can do either of these things?
Thanks,
Tom Burns



Tom,

What version of solaris are you running?  In theory, ZFS should not
be hogging your system memory to the point that it crowds out your
primary applications... but this is still an area that we are working
out the kinks in.  If you could provide a core dump of the machine
when it gets to the point that you can't start your app, it would
help us.



We are running the jun 06 version of solaris (10/6?).  I don't have a  core
dump now -- but can probably get one in the next week or so.  Where  should
I send it?


You can drop cores via ftp to:

   sunsolve.sun.com
   login as anonymous or ftp
   deposit into /cores

Also, where do I set arc.c_max?  In etc/system?  Out of curiosity,  why 
isn't

limiting arc.c_max considered best practice (I just want to make sure  I am
not missing something about the effect limiting it will have)?  My  
guess is
that in our case (lots of small groups -- 50 people or less --  sharing 
files
over the web)  that file system caches are not that useful.  The  small 
groups
mean that no one file gets used that often and, since access is over  
the web,

their response time will be largely limited by their internet  connection.



We don't want users to need to tune a bunch of knobs to get performance
out of ZFS.  We want it to work well out of the box.  So we are trying
to discourage using these tunables, and instead figure out what the root
problem is and fix it.  There is really no reason why zfs shouldn't be
able to adapt itself appropriately to the available memory.


Thanks a lot for the response!



As to your questions; I will give you some ways to do these things,
but these are not considered best practice:

1) You should be able to limit your cache max size by setting
arc.c_max.  Its currently initialized to be phys-mem-size - 1GB.

2) First try unmount/remounting your file system to clear the
cache.  If that doesn't work, try exporting/importing your pool.

-Mark



Tom Burns



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: ZFS forces system to paging to the point it is

2006-09-07 Thread Mark Maybee

Jürgen Keil wrote:

We are trying to obtain a mutex that is currently held
by another thread trying to get memory.



Hmm, reminds me a bit on the zvol swap hang I got
some time ago:

http://www.opensolaris.org/jive/thread.jspa?threadID=11956tstart=150

I guess if the other thead is stuck trying to get memory, then
it is allocating the memory with KM_SLEEP, while holding
a mutex?
 

Yup, this is essentially another instance of this problem.

-Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem of ZFS ( Sol 10U2 )

2006-09-07 Thread Mark Maybee

Ivan,

What mail clients use your mail server?  You may be seeing the
effects of:

6440499 zil should avoid txg_wait_synced() and use dmu_sync() to issue 
parallel IOs when fsyncing


This bug was fixed in nevada build 43, and I don't think made it into
s10 update 2.  It will, of course, be in update 3 and be available in
a patch at some point.

Ivan Debnár wrote:

Hi,
 
I deployed ZFS on our mailserver recently, hoping for eternal peace after running on UFS and moving files witch each TB added.
 
It is mailserver - it's mdirs are on ZFS pool:

  capacity operationsbandwidth
poolused  avail   read  write   read  write
-  -  -  -  -  -  - 
mailstore  3.54T  2.08T280295  7.10M  5.24M

  mirror590G   106G 34 31   676K   786K
c6t3d0 -  - 14 16   960K   773K
c8t22260001552EFE2Cd0  -  - 16 18  1.06M   786K
  mirror613G  82.9G 51 37  1.44M   838K
c6t3d1 -  - 20 19  1.57M   824K
c5t1d1 -  - 20 24  1.40M   838K
c8t227C0001559A761Bd0  -  -  5101   403K  4.63M
  mirror618G  78.3G133 60  6.23M   361K
c6t3d2 -  - 40 27  3.21M   903K
c4t2d0 -  - 23 81  1.91M  2.98M
c8t221200015599F2CFd0  -  -  6108   442K  4.71M
  mirror613G  83.2G110 51  3.66M   337K
c6t3d3 -  - 36 25  2.72M   906K
c5t2d1 -  - 29 65  1.80M  2.92M
  mirror415G  29.0G 30 28   460K   278K
c6t3d4 -  - 11 19   804K   268K
c4t1d2 -  - 15 22   987K   278K
  mirror255G   441G 26 49   536K  1.02M
c8t22110001552F3C46d0  -  - 12 27   835K  1.02M
c8t224B0001559BB471d0  -  - 12 29   835K  1.02M
  mirror257G   439G 32 52   571K  1.04M
c8t22480001552D7AF8d0  -  - 14 28  1003K  1.04M
c4t1d0 -  - 14 32  1002K  1.04M
  mirror251G   445G 28 53   543K  1.02M
c8t227F0001552CB892d0  -  - 13 28   897K  1.02M
c8t22250001559830A5d0  -  - 13 30   897K  1.02M
  mirror   17.4G   427G 22 38   339K   393K
c8t22FA00015529F784d0  -  -  9 19   648K   393K
c5t2d2 -  -  9 23   647K   393K


It is 3x dual-iSCSI + 2x dual SCSI DAS arrays (RAID0, 13x250).

I have problem however:
The 2 SCSI arrays were able to handle the mail-traffic fine with UFS on them.
The new config with 3 additional arrays seem to have problem using ZFS.
The writes are waiting for 10-15 seconds to get to disk - so queue fills ver 
quickly, reads are quite ok.
I assume this is the problem with ZFS prefering reads to writes.

I also see in 'zpool iostat -v 1' that writes are issued to disk only once in 
10 secs, and then its 2000rq one sec.
Reads are sustained at cca 800rq/s.

Is there a way to tune this read/write ratio? Is this know problem?

I tried to change vq_max_pending as suggested by Eric in 
http://blogs.sun.com/erickustarz/entry/vq_max_pending
But no change in this write behaviour.

Iostat shows cca 20-30ms asvc_t, 0%w, and cca 30% busy on all drives so these 
are not saturated it seems. (before with UTF they had 90%busy, 1%wait).

System is Sol 10 U2, sun x4200, 4GB RAM.

Please if you could give me some hint to really make this working as the way 
back to UFS is almost impossible on live system.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS forces system to paging to the point it is

2006-09-06 Thread Mark Maybee

Hmmm, interesting data.  See comments in-line:

Robert Milkowski wrote:

Yes, server has 8GB of RAM.
Most of the time there's about 1GB of free RAM.

bash-3.00# mdb 0
Loading modules: [ unix krtld genunix dtrace specfs ufs sd md ip sctp usba fcp 
fctl qlc ssd lofs zfs random logindmux ptm cpc nfs ipc ]


arc::print


{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x8b72ae00


We are referencing about 2.2GB of data from the ARC.


p = 0xfe41b00
c = 0xfe51b00


We are trying to get down to our minimum target size of 16MB.
So we are obviously feeling memory pressure and trying to react.


c_min = 0xfe51b00
c_max = 0x1bca36000

...

::kmastat

cachebufbufbufmemory alloc alloc
namesize in use  totalin use   succeed  fail

- -- -- -- - - -
...

vn_cache 240 2400324 2507745 662691840   6307891 0


This is very interesting: 2.4 million vnodes are active.

...

zio_buf_512  512 2388292 2388330 1304346624 176134688 0
zio_buf_10241024 18 96 98304  17058709 0
zio_buf_15361536  0 30 49152   2791254 0
zio_buf_20482048  0 20 40960   1051435 0
zio_buf_25602560  0 33 90112   1716360 0
zio_buf_30723072  0 40122880   1902497 0
zio_buf_35843584  0225819200   3918593 0
zio_buf_40964096  3 34139264  20336550 0
zio_buf_51205120  0144737280   8932632 0
zio_buf_61446144  0 36221184   5274922 0
zio_buf_71687168  0 16114688   3350804 0
zio_buf_81928192  0 11 90112   9131264 0
zio_buf_10240  10240  0 12122880   2268700 0
zio_buf_12288  12288  0  8 98304   3258896 0
zio_buf_14336  14336  0 60860160  15853089 0
zio_buf_16384  16384 142762 142793 2339520512  74889652 0
zio_buf_20480  20480  0  6122880   1299564 0
zio_buf_24576  24576  0  5122880   1063597 0
zio_buf_28672  28672  0  6172032712545 0
zio_buf_32768  32768  0  4131072   1339604 0
zio_buf_40960  40960  0  6245760   1736172 0
zio_buf_49152  49152  0  4196608609853 0
zio_buf_57344  57344  0  5286720428139 0
zio_buf_65536  65536520522  34209792   8839788 0
zio_buf_73728  73728  0  5368640284979 0
zio_buf_81920  81920  0  5409600133392 0
zio_buf_90112  90112  0  6540672 96787 0
zio_buf_98304  98304  0  4393216133942 0
zio_buf_106496106496  0  5532480 91769 0
zio_buf_114688114688  0  5573440 72130 0
zio_buf_122880122880  0  5614400 52151 0
zio_buf_131072131072100107  14024704   7326248 0
dmu_buf_impl_t   328 2531066 2531232 863993856 237052643 0
dnode_t  648 2395209 2395212 1635131392  83304588 0
arc_buf_hdr_t128 142786 390852  50823168 155745359 0
arc_buf_t 40 142786 347333  14016512 160502001 0
zil_lwb_cache208 28468 98304  30507668 0
zfs_znode_cache  192 2388224 2388246 465821696  83149771 0

...
Because of all of those vnodes, we are seeing a lot of extra memory
being used by ZFS:
- about 1.5GB for the dnodes
- another 800MB for dbufs
- plus 1.3GB for the bonus buffers (not accounted for in the arc)
- plus about 400MB for znodes

This totals to another 4GB + .6GB held in vnodes

The question is who is holding these vnodes in memory... Could you do a
::dnlc!wc
and let me know what it comes back with?

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is

2006-09-06 Thread Mark Maybee

Robert Milkowski wrote:



On Wed, 6 Sep 2006, Mark Maybee wrote:


Robert Milkowski wrote:


::dnlc!wc



 1048545 3145811 76522461


Well, that explains half your problem... and maybe all of it:




After I reduced vdev prefetch from 64K to 8K for last few hours system 
is working properly without workaround and free memory stays at about 1GB.


Reducing vdev prefetch to 8K alse reduced read thruoutput 10x.

I belive this is somehow related - maybe vdev cache was so aggressive (I 
got 40-100MB/s of reads) and consuming memory so fast that thread which 
is supposed to regain some memory couldn't keep up?


I suppose, although the data volume doesn't seem that high... maybe you
are just operating at the hairy edge here.  Anyway, I have filed a bug
to track this issue:

6467963 do_dnlc_reduce_cache() can be blocked by ZFS_OBJ_HOLD_ENTER()

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS forces system to paging to the point it is unresponsive

2006-09-05 Thread Mark Maybee

Robert,

I would be interested  in seeing your crash dump.  ZFS will consume much
of your memory *in the absence of memory pressure*, but it should be
responsive to memory pressure, and give up memory when this happens.  It
looks like you have 8GB of memory on your system?  ZFS should never
consume more than 7GB of this under any circumstances.  Note there are
a few outstanding bugs that could be coming into play here:

6456888 zpool scrubbing leads to memory exhaustion and system hang
6416757 zfs could still use less memory
6447701 ZFS hangs when iSCSI Target attempts to initialize its backing store

-Mark

P.S. It would be useful to see the output of:
 arc::print
and
 ::kmastat

Robert Milkowski wrote:

Hi.

   v440, S10U2 + patches

OS and Kernel Version: SunOS X 5.10 Generic_118833-20 sun4u sparc 
SUNW,Sun-Fire-V440

NFS server with ZFS as a local storage.


We were rsyncing UFS filesystem to ZFS filesystem exported over NFS. After some 
time server which exports ZFS over NFS was unresponsive. Operator decided to 
force panic and reboot server. Further examination showed that system was 
heavily paging probably due to ZFS as no other services are running there.

I had just another problem - looks similar to last one.
I decided to put nfsd into RT class.

I guess ZFS is using all memory for its caches and after some time it fails to 
free it and forces system to paging. This is BAD, really BAD.


More details to previous problem.

bash-3.00# savecore /f3-1/
System dump time: Sat Sep 2 03:31:18 2006
Constructing namelist /f3-1//unix.0
Constructing corefile /f3-1//vmcore.0
100% done: 1043993 of 1043993 pages saved
bash-3.00# cd /f3-1/
bash-3.00#
bash-3.00# mdb 0
Loading modules: [ unix krtld genunix dtrace specfs ufs sd md ip sctp usba fcp 
fctl qlc ssd lofs zfs random logindmux ptm
cpc nfs ipc ]


::status


debugging crash dump vmcore.0 (64-bit) from XX
operating system: 5.10 Generic_118833-20 (sun4u)
panic message: sync initiated
dump content: kernel pages only


::spa


ADDR STATE NAME
060001271680 ACTIVE f3-1
060003bd4dc0 ACTIVE f3-2


::memstat


Page Summary Pages MB %Tot
   
Kernel 1016199 7939 98%
Anon 4420 34 0%
Exec and libs 736 5 0%
Page cache 36 0 0%
Free (cachelist) 1962 15 0%
Free (freelist) 18338 143 2%

Total 1041691 8138
Physical 1024836 8006


::swapinfo


ADDR VNODE PAGES FREE NAME
0600034ab5a0 600012ff8c0 1048763 1028489 /dev/md/dsk/d15



We were synchronizing lot of small files over nfs and writing to f3-1/d611. I 
would say that with ZFS it's expected to be
on low memory most of the time but not to the point when host starts to paging.

bash-3.00# sar -g

SunOS X 5.10 Generic_118833-20 sun4u 09/02/2006

00:00:00 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
[...]
02:15:01 0.03 0.04 0.02 0.00 0.00
02:20:00 0.04 0.04 0.02 0.00 0.00
02:25:00 0.02 0.03 0.01 0.00 0.00
02:30:00 0.02 0.03 0.01 0.00 0.00
02:35:00 0.03 0.03 0.01 0.00 0.00
02:40:01 0.03 0.04 0.03 0.00 0.00
02:45:02 5.98 82.77 93.20 65115.59 0.00
03:39:28 unix restarts
03:40:00 0.35 0.61 0.61 0.00 60.00
03:45:00 0.03 0.06 0.06 0.00 0.00
03:50:00 0.02 0.03 0.02 0.00 0.00
03:55:00 0.02 0.02 0.02 0.00 0.00

bash-3.00# sar -u

SunOS  5.10 Generic_118833-20 sun4u 09/02/2006

00:00:00 %usr %sys %wio %idle
[...]
02:00:00 0 1 0 99
02:05:00 0 1 0 99
02:10:00 0 1 0 99
02:15:01 0 1 0 99
02:20:00 0 15 0 85
02:25:00 0 34 0 66
02:30:00 0 20 0 80
02:35:00 0 22 0 78
02:40:01 0 45 0 55
02:45:02 0 61 0 38
03:39:28 unix restarts
03:40:00 5 10 0 84
03:45:00 1 1 0 98
03:50:00 0 0 0 100

bash-3.00# sar -q

SunOS xxx 5.10 Generic_118833-20 sun4u 09/02/2006

00:00:00 runq-sz %runocc swpq-sz %swpocc
[...]
02:00:00 0.0 0 0.0 0
02:05:00 1.0 0 0.0 0
02:10:00 0.0 0 0.0 0
02:15:01 0.0 0 0.0 0
02:20:00 1.1 5 0.0 0
02:25:00 1.4 12 0.0 0
02:30:00 2.1 6 0.0 0
02:35:00 3.4 9 0.0 0
02:40:01 2.8 25 0.0 0
02:45:02 4.0 44 116.6 12
03:39:28 unix restarts
03:40:00 1.0 3 0.0 0
03:45:00 0.0 0 0.0 0
03:50:00 0.0 0 0.0 0


Crashdump could be provided off-the list and not for public eyes.
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD.

2006-08-22 Thread Mark Maybee

Michael Schuster - Sun Microsystems wrote:

Pawel Jakub Dawidek wrote:


On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote:


I don't know much about ZFS, but Sun states this is a 128 bits
filesystem.  How will you handle this in regards to the FreeBSD
kernel interface that is already struggling to be 64 bits
compliant ?  (I'm stating this based on this URL [1], but maybe
it's not fully up-to-date.)



128 bits is not my goal, but I do want all the other goodies:)



are you going to attempt on-disk compatibility?

Michael


Amazing work Pawel!  Please do try to maintain on-disk compatibility!
Let us know if you run into anything that might prevent that (or any
other issues that you run across).

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import - cannot mount [...] directory is not empty

2006-08-16 Thread Mark Maybee

Robert,

Are you sure that nfs-s5-p0/d5110 and nfs-s5-p0/d5111 are mounted
following the import?  These messages imply that the d5110 and d5111
directories in the top-level filesystem of pool nfs-s5-p0 are not
empty.  Could you verify that 'df /nfs-s5-p0/d5110' displays
nfs-s5-p0/d5110 as the Filesystem (and not just nfs-s5-p0)?

-Mark

Robert Milkowski wrote:

All pools were exported than I tried to import one-by-one and got this with 
only a first pool.

bash-3.00# zpool export nfs-s5-p4 nfs-s5-s5 nfs-s5-s6 nfs-s5-s7 nfs-s5-s8
bash-3.00# zpool import nfs-s5-p4
cannot mount '/nfs-s5-p4/d5139': directory is not empty
cannot mount '/nfs-s5-p4/d5141': directory is not empty
cannot mount '/nfs-s5-p4/d5138': directory is not empty
cannot mount '/nfs-s5-p4/d5142': directory is not empty
bash-3.00# df -h /nfs-s5-p4/d5139
Filesystem size   used  avail capacity  Mounted on
nfs-s5-p4/d5139600G   556G44G93%/nfs-s5-p4/d5139
bash-3.00# zpool export nfs-s5-p4
bash-3.00# ls -l /nfs-s5-p4/d5139
/nfs-s5-p4/d5139: No such file or directory
bash-3.00# ls -l /nfs-s5-p4/
total 0
bash-3.00# zpool import nfs-s5-p4
bash-3.00# uname -a
SunOS XXX 5.11 snv_43 sun4u sparc SUNW,Sun-Fire-V240
bash-3.00#

No problem with other pools - all other pools imported without any warnings.

The same on another server (all pools were exported first):

bash-3.00# zpool import nfs-s5-p0
cannot mount '/nfs-s5-p0/d5110': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
cannot mount 'nfs-s5-p0/d5112': mountpoint or dataset is busy
cannot mount '/nfs-s5-p0/d5111': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
bash-3.00# zpool export nfs-s5-p0
bash-3.00# zpool import nfs-s5-p0
cannot mount '/nfs-s5-p0/d5110': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
cannot mount '/nfs-s5-p0/d5111': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
bash-3.00# zpool export nfs-s5-p0
bash-3.00# ls -la /nfs-s5-p0/
total 4
drwxr-xr-x   2 root other512 Jun 14 14:37 .
drwxr-xr-x  40 root root1024 Aug  8 11:00 ..
bash-3.00# zpool import nfs-s5-p0
cannot mount '/nfs-s5-p0/d5110': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
cannot mount 'nfs-s5-p0/d5112': mountpoint or dataset is busy
cannot mount '/nfs-s5-p0/d5111': directory is not empty
use legacy mountpoint to allow this behavior, or use the -O flag
bash-3.00#
bash-3.00# uname -a
SunOS X 5.11 snv_39 sun4v sparc SUNW,Sun-Fire-T200
bash-3.00#

All filesystems from that pool were however mounted.

No problem with other pools - all other pools imported without any warnings.


All filesystems in a pool have set sharenfs (actually sharenfs is set on a pool 
and then inherited by filesystems). Additionally nfs/server was disabled just 
before I exported pools and automatically started when first pool was imported.



I belive there's already open bug for this.
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-08-09 Thread Mark Maybee

Eric Lowe wrote:

Eric Schrock wrote:


Well the fact that it's a level 2 indirect block indicates why it can't
simply be removed.  We don't know what data it refers to, so we can't
free the associated blocks.  The panic on move is quite interesting -
after BFU give it another shot and file a bug if it still happens.



I'm still seeing the panic (build 42) when trying to 'mv' the file with 
corrupt indirect blocks. The problem looks like 6424466 and 6440780, the 
panic string is data after EOF. Email me offline if you would like to 
collect the core from my system.


- Eric


Yup, this is a duplicate of 6424466 (6440780 is also probably a dup of
6424466).  You are seeing this panic on a 'mv' because of some old debug
code in dnode_sync() scanning the dnode contents.  The data after EOF
message is bogus, the real problem is your data corruption.  Anyway,
this is not going to go away until I put back a fix for 6424466.  Sorry
about that.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID

2006-08-08 Thread Mark Maybee

Luke Lonergan wrote:

Robert,

On 8/8/06 9:11 AM, Robert Milkowski [EMAIL PROTECTED] wrote:



1. UFS, noatime, HW RAID5 6 disks, S10U2
70MB/s
2. ZFS, atime=off, HW RAID5 6 disks, S10U2 (the same lun as in #1)
87MB/s
3. ZFS, atime=off, SW RAID-Z 6 disks, S10U2
130MB/s
4. ZFS, atime=off, SW RAID-Z 6 disks, snv_44
133MB/s



Well, the UFS results are miserable, but the ZFS results aren't good - I'd
expect between 250-350MB/s from a 6-disk RAID5 with read() blocksize from
8kb to 32kb.

Most of my ZFS experiments have been with RAID10, but there were some
massive improvements to seq I/O with the fixes I mentioned - I'd expect that
this shows that they aren't in snv44.


Those fixes went into snv_45

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_46: hangs when using zvol swap and the system is low on free memory ?

2006-08-07 Thread Mark Maybee

Jürgen Keil wrote:

I've tried to use dmake lint on on-src-20060731, and was running out of swap 
on my
Tecra S1 laptop, 32-bit x86, 768MB main memory, with a 512MB swap slice.
The FULL KERNEL: global crosschecks: lint run consumes lots (~800MB) of space
in /tmp, so the system was running out of swap space.

To fix this I've tried to add a 512MB (compressed) zvol device as additional 
swap space.


Now the dmake lint hangs the OS, sooner or later.


In a crash dump, I found:

[b]


::pgrep lint


SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
R  10806  10805747712109 0x42004000 e2d158c8 lint
R  10802  10801747712109 0x42004000 e2d1b990 lint


e2d158c8::walk thread|::findstack -v


stack pointer for thread e3806800: dc39adf8
  dc39ae2c swtch+0x168()
  dc39ae64 turnstile_block+0x6a5(da709288, 0, e39d1110, fec045e0, 0, 0)
  dc39aea4 rw_enter_sleep+0x13b(e39d1110, 0)
  dc39aec8 tmp_write+0x2d(e9fc39c0, dc39af3c, 0, daaa3978, 0)
  dc39af04 fop_write+0x2e(e9fc39c0, dc39af3c, 0, daaa3978, 0)
  dc39af84 write+0x2ac()
  dc39afac sys_sysenter+0x104()


e2d1b990::walk thread|::findstack -v


stack pointer for thread e0f11400: d32c47b8
  d32c47e4 swtch+0x168()
  d32c47f4 cv_wait+0x4e(fec1ef42, fec1cf20)
  d32c4820 page_create_throttle+0x123(20, 3)
  d32c488c page_create_va+0x9f(fec20990, da8d1000, 0, 2, 3, d32c48b4)
  d32c48ec segkmem_page_create+0x67(da8d1000, 2, 0, 0)
  d32c4924 segkmem_xalloc+0xa3(da00f690, 0, 2, 0, 0, fe840f08)
  d32c4950 segkmem_alloc+0xa0(da00f690, 2, 0)
  d32c49ec vmem_xalloc+0x405(da01, 2, 1000, 0, 0, 0)
  d32c4a3c vmem_alloc+0x126(da01, 2, 0)
  d32c4a94 kmem_slab_create+0x6e(d3e21030, 0)
  d32c4ac0 kmem_slab_alloc+0x59(d3e21030, 0)
  d32c4af0 kmem_cache_alloc+0x119(d3e21030, 0)
  d32c4b04 zio_buf_alloc+0x1b(2)
  d32c4b40 arc_read+0x332(0, da1f6740, e31f5100, f9a1c2d0, 0, 0)
  d32c4bbc dbuf_prefetch+0x124(d64e1640, 22, 0)
  d32c4bf4 dmu_zfetch_fetch+0x48(d64e1640, 20, 0, 6, 0)
  d32c4c54 dmu_zfetch_dofetch+0x183(d64e179c, ee0ecdb0)
  d32c4ca0 dmu_zfetch_find+0x530(d64e179c, d32c4cc8, 20)
  d32c4d24 dmu_zfetch+0xbf(d64e179c, 18, 0, 2, 0, 20)
  d32c4d5c dbuf_read+0xc9(d89ce4a0, e7d6e500, 32)
  d32c4db4 dmu_buf_hold_array_by_dnode+0x1fe(d64e1640, 18, 0, 2, 0, 1)
  d32c4de4 dmu_buf_hold_array_by_bonus+0x2a(e28686d8, 18, 0, 2, 0, 1)
  d32c4e68 zfs_read+0x17e(ddd27900, d32c4f3c, 0, daaa3978, 0)
  d32c4ea4 fop_read+0x2e(ddd27900, d32c4f3c, 0, daaa3978, 0)
  d32c4f84 read+0x2a1()
  d32c4fac sys_sysenter+0x104()


freemem/D


freemem:
freemem:0
[/b]

arc_read() needs a new buffer, tries to allocate kernel memory with KM_SLEEP.
But there is no more free memory, so the allocation sleeps until resources 
become
available.  It seems that arc_read() is trying to restore a buffer from the arc
ghost cache, and has the arc_buf_hdr_t locked while trying to allocate memory.


At the same time, the pageout deamon seems to be stuck in the zfs code, like 
this:

[b]


::pgrep pageout|::walk thread|::findstack -v


stack pointer for thread d386dc00: d38988a8
  d38988dc swtch+0x168()
  d3898914 turnstile_block+0x6a5(d3da3e90, 0, d3dce0cc, fec03b38, 0, 0)
  d3898974 mutex_vector_enter+0x2dc(d3dce0cc)
  d38989b4 buf_hash_find+0x4d(da1f6740, e5476900, bb305, 0, d38989fc)
  d3898a00 arc_read+0x24(0, da1f6740, e5476900, f9a1c2d0, 0, 0)
  d3898a7c dbuf_prefetch+0x124(e0c9e318, 3832, 0)
  d3898ab4 dmu_zfetch_fetch+0x48(e0c9e318, 3832, 0, 1, 0)
  d3898b14 dmu_zfetch_dofetch+0x183(e0c9e474, d8b46c60)
  d3898b60 dmu_zfetch_find+0x530(e0c9e474, d3898b88, 20)
  d3898be4 dmu_zfetch+0xbf(e0c9e474, 6972000, 0, 2000, 0, 20)
  d3898c10 dbuf_read+0xc9(df5679d8, df43eb00, 22)
  d3898c34 dmu_tx_check_ioerr+0x49(df43eb00, e0c9e318, 0, 34b9, 0)
  d3898c94 dmu_tx_count_write+0x114(d3875c98, 6965000, 0, e000, 0)
  d3898cdc dmu_tx_hold_write+0x52(eef5d5a8, 1, 0, 6965000, 0, e000)
  d3898d5c zvol_strategy+0x184(d972e1e8)
  d3898d78 bdev_strategy+0x4d(d972e1e8)
  d3898d94 spec_startio+0x6e(d8b85240, fde36240, 6965000, 0, e000, 8500)
  d3898dc0 spec_pageio+0x2a(d8b85240, fde36240, 6965000, 0, e000, 8500)
  d3898e0c fop_pageio+0x2d(d8b85240, fde36240, 6965000, 0, e000, 8500)
  d3898e80 tmp_putapage+0x177(e9fc39c0, fdbd63b0, d3898eb8, d3898ef0, 8400, 
da710e68)
  d3898ef4 tmp_putpage+0x1c6(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68)
  d3898f3c fop_putpage+0x27(e9fc39c0, 11c6b000, 0, 1000, 8400, da710e68)
  d3898f94 pageout+0x205(0, 0)
  d3898fa4 thread_start+8()
[/b]

It seems the problem is that arc_read() has part of the buf hash table locked,
then goes to sleep inside some kmem_*alloc(...KM_SLEEP) call.
When the pageout daemon tries to access some zfs backed page that happens
to use the same hash chain that is locked by the previous arc_read() call, the
system is stuck and I have to power cycle it.


I made more tests with uncompressed zvol devices, too. But the problem basically
remains the same.  The pageout deamon becomes stuck, 

Re: [zfs-discuss] ZFS on 32bit x86

2006-06-21 Thread Mark Maybee

Yup, your probably running up against the limitations of 32-bit kernel
addressability.  We are currently very conservative in this environment,
and so tend to end up with a small cache as a result.  It may be
possible to tweak things to get larger cache sizes, but you run the risk
of starving out other processes trying to get memory.

-Mark

Robert Milkowski wrote:

Hello zfs-discuss,

  Simple test 'ptime find /zfs/filesystem /dev/null' with 2GB RAM.
  After second, third, etc. time still it reads a lot from disks while
  find is running (atime is off).

  on x64 (Opteron) it doesn't.

  I guess it's due to 512MB heap limit in kernel for its cache.
  ::memstat shows 469MB for kernel and 1524MB on freelist.


  Is there anything could be done? I guess not but perhaps



  ps. of course there're a lot of files like ~150K.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss