Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

The application I'm working on is a kind of large-scale network-PVR
system for our IPTV services. It records all running TV channels in a
X-hour carrousel (typically 24 or 48-hours), retaining only those bits
which users have marked as being interesting to them. The current setup
I'm doing development on is a small 12TB array, future deployment is
planned on several 96TB X4540 machines.

I agree that I kind of misused the term `sequential' - it really is 77
concurrent sequential writes. However, as I explained, I/O is not the
bottleneck here, as the array is capable of writes around 600MBytes/s,
and the write load I'm putting on it is around 55MBytes/s (430Mbit/s).

The problem is, as Brent explained, that as soon as the OS decides it
wants to write the transaction group to disk, it totally ignores all
other time-critical activity in the system and focuses on just that,
causing an input poll() stall on all network sockets. What I'd need to
do is force it to commit transactions to disk more often so as to even
the load out over a longer period of time, to bring the CPU usage spikes
down to a more manageable and predictable level.

Regards,
- --
Saso

Tim Cook wrote:
 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote:
 


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 --
 --Tim

  

 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.

 
 Once again, Mb or MB?  They're two completely different numbers.  As for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
 If you're saying you got 400MB, that's a different story entirely, and while
 possible with sequential I/O and a proper raid setup, it isn't happening
 with random.
 
 
 The writes lag is noticeable however with ZFS, and the behavior of the
 transaction group writes. If you have a big write that needs to land
 on disk, it seems all other I/O, CPU and niceness is thrown out the
 window in favor of getting all that data on disk.
 I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris
 support, I'll try to find that bug number, but I believe some
 improvements were done in 129 and 130.



 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1y8oACgkQRO8UcfzpOHBkDQCgxScaPPS7d+peoiY16Nafo8lu
1nsAoNMwiUdOdQKCZpdyPGoAWz36IWY5
=T6fy
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Would an upgrade to the development repository of 2010.02 do the same?
I'd like to avoid having to do a complete reinstall, since I've got
quite a bit of custom software in the system already in various places
and recompiling and fine-tuning would take me another 1-2 days.

Regards,
- --
Saso

Leonid Kogan wrote:
 Try b130.
 http://genunix.org/
 
 Cheers,
 LK
 
 
 On 12/26/2009 12:59 AM, Saso Kiselkov wrote:
 Hi,

 I tried it and I got the following error message:

 # zfs set logbias=throughput content
 cannot set property for 'content': invalid property 'logbias'

 Is it because I'm running some older version which does not have this
 feature? (2009.06)

 Regards,
 -- 
 Saso

 Leonid Kogan wrote:
   
 Hi there,
 Try to:
 zfs set logbias=throughputyourdataset

 Good luck,
 LK

  

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC
oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW
=pmGr
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Brent Jones
On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:


 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote:

 
 
 
 
  Hang on... if you've got 77 concurrent threads going, I don't see how
  that's
  a sequential I/O load.  To the backend storage it's going to look like
  the
  equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
  supporting 600MB/sec throughput and would be interested in hearing where
  you
  got those numbers from.
 
  Is your video capture doing 430MB or 430Mbit?
 
  --
  --Tim
 
 

 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.

 Once again, Mb or MB?  They're two completely different numbers.  As for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
 If you're saying you got 400MB, that's a different story entirely, and while
 possible with sequential I/O and a proper raid setup, it isn't happening
 with random.


Mb, megabit.
400 megabit is not terribly high, a single SATA drive could write that
24/7 without a sweat. Which is why he is reporting his issue.

Sequential or random, any modern system should be able to perform that
task without causing disruption to other processes running on the
system (if Windows can, Solaris/ZFS most definitely should be able
to).

I have similar workload on my X4540's, streaming backups from multiple
systems at a time. These are very high end machines, dual quadcore
opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

The write stalls have been a significant problem since ZFS came out,
and hasn't really been addressed in an acceptable fashion yet, though
work has been done to improve it.

I'm still trying to find the case number I have open with Sunsolve or
whatever, it was for exactly this issue, and I believe the fix was to
add dozens more classes to the scheduler, to allow more fair disk
I/O and overall niceness on the system when ZFS commits a
transaction group.




-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brent Jones wrote:
 On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:

 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote:


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 --
 --Tim


 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.
 Once again, Mb or MB?  They're two completely different numbers.  As for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
 If you're saying you got 400MB, that's a different story entirely, and while
 possible with sequential I/O and a proper raid setup, it isn't happening
 with random.

 
 Mb, megabit.
 400 megabit is not terribly high, a single SATA drive could write that
 24/7 without a sweat. Which is why he is reporting his issue.
 
 Sequential or random, any modern system should be able to perform that
 task without causing disruption to other processes running on the
 system (if Windows can, Solaris/ZFS most definitely should be able
 to).
 
 I have similar workload on my X4540's, streaming backups from multiple
 systems at a time. These are very high end machines, dual quadcore
 opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
 
 The write stalls have been a significant problem since ZFS came out,
 and hasn't really been addressed in an acceptable fashion yet, though
 work has been done to improve it.
 
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.

Wow, if there were a production-release solution to the problem, that
would be great! Reading the mailing list I almost gave up hope that I'd
be able to work around this issue without upgrading to the latest
bleeding-edge development version.

Regards,
- --
Saso
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks10xQACgkQRO8UcfzpOHCFUQCeJ0kHwOgM3Vjc6QjIL6XHVip5
ed4AoIYrNGAZR2V69uUk3Gc/MAl3kew3
=5uSX
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Fajar A. Nugraha
On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov skisel...@gmail.com wrote:
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.

 Wow, if there were a production-release solution to the problem, that
 would be great!

Have you checked this thread?
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28704.html

 Reading the mailing list I almost gave up hope that I'd
 be able to work around this issue without upgrading to the latest
 bleeding-edge development version.

Isn't opensolaris already bleeding edge?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thank you, the post you mentioned helped me move a bit forward. I tried
putting:

zfs:zfs_txg_timeout = 1

in /etc/system and now I'm getting much more even write load (a burst
every 5 seconds), which now does not cause any significant poll()
stalling anymore. So far I fail to find the timer in the ZFS source code
which causes the 5-second timeout instead of what I want (1 second).

Another thing that's left on my mind is why I'm still getting a very
slight burst every 60 seconds (causing a poll() delay of around 20-30ms,
instead of the usual 0-2ms). It's not that big a problem, it's just that
I'm curious as to where it's being created. I assume some 60-second
timer is firing, but I don't know where.

Regards,
- --
Saso

Fajar A. Nugraha wrote:
 On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov skisel...@gmail.com wrote:
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.
 Wow, if there were a production-release solution to the problem, that
 would be great!
 
 Have you checked this thread?
 http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28704.html
 
 Reading the mailing list I almost gave up hope that I'd
 be able to work around this issue without upgrading to the latest
 bleeding-edge development version.
 
 Isn't opensolaris already bleeding edge?
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1/+8ACgkQRO8UcfzpOHC6kgCfcTv86Gwh2MvvVQJeJr/BRghe
f6IAn2N1t4QNLfwBdafZHUbXCw0grTRk
=hUJV
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)

2009-12-26 Thread Erik Trimble

Richard Elling wrote:

On Dec 25, 2009, at 4:15 PM, Erik Trimble wrote:


I haven't seen this mentioned before, but the OCZ Vertex Turbo is 
still an MLC-based SSD, and is /substantially/ inferior to an Intel 
X25-E in terms of random write performance, which is what a ZIL 
device does almost exclusively in the case of NFS traffic.


ZIL traffic tends to be sequential on separate logs. But may be of 
different sizes.

 -- richard
Really?  Now that I think about it, that seems to make sense - I was 
assuming that each NFS write would be relatively small, but that's 
likely not a valid general assumption.  I'd still think that the 
MLC-nature of the Vertex isn't optimal for write-heavy applications like 
here, even with a modest SDRAM cache on the SSD.



In fact, I think that the Vertex's sustained random write IOPs 
performance is actually inferior to a 15k SAS drive.


I read a benchmark report yesterday that might be interesting.  It 
seems that
there is a market for modest sized SSDs, which would be perfect for 
separate

logs + OS for servers.
http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=392Itemid=60 


 -- richard
I'm still hoping that vendors realize that there definitely is a (very 
large) market for ~20GB high write IOPS SSDs.I like my 18GB Zeus 
SSD, but it sure would be nice to be able to pay $2/Gb for it, instead 
of 10x that now...


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Bob Friesenhahn

On Fri, 25 Dec 2009, Saso Kiselkov wrote:

sometimes even longer. I figured that I might be able to resolve this by
lowering the txg timeout to something like 1-2 seconds (I need ZFS to
write as soon as data arrives, since it will likely never be
overwritten), but I couldn't find any tunable parameter for it anywhere
on the net. On FreeBSD, I think this can be done via the


While there are some useful tunable parameters, another approach is to 
consider requesting a synchronous write using fdatasync(3RT) or 
fsync(3C) immediately after the final write() request in one of your 
poll() time quantums.  This will cause the data to be written 
immediately.  System behavior will then seem totally different. 
Unfortunately, it will also be less efficient.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Menno Lageman

On 12/26/09 09:53, Brent Jones wrote:

On Fri, Dec 25, 2009 at 9:56 PM, Tim Cookt...@cook.ms  wrote:



On Fri, Dec 25, 2009 at 11:43 PM, Brent Jonesbr...@servuhome.net  wrote:








Hang on... if you've got 77 concurrent threads going, I don't see how
that's
a sequential I/O load.  To the backend storage it's going to look like
the
equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
supporting 600MB/sec throughput and would be interested in hearing where
you
got those numbers from.

Is your video capture doing 430MB or 430Mbit?

--
--Tim




Think he said 430Mbit/sec, which if these are security cameras, would
be a good sized installation (30+ cameras).
We have a similar system, albeit running on Windows. Writing about
400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
working quite well on our system without any frame loss or much
latency.


Once again, Mb or MB?  They're two completely different numbers.  As for
getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
If you're saying you got 400MB, that's a different story entirely, and while
possible with sequential I/O and a proper raid setup, it isn't happening
with random.



Mb, megabit.
400 megabit is not terribly high, a single SATA drive could write that
24/7 without a sweat. Which is why he is reporting his issue.

Sequential or random, any modern system should be able to perform that
task without causing disruption to other processes running on the
system (if Windows can, Solaris/ZFS most definitely should be able
to).

I have similar workload on my X4540's, streaming backups from multiple
systems at a time. These are very high end machines, dual quadcore
opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

The write stalls have been a significant problem since ZFS came out,
and hasn't really been addressed in an acceptable fashion yet, though
work has been done to improve it.

I'm still trying to find the case number I have open with Sunsolve or
whatever, it was for exactly this issue, and I believe the fix was to
add dozens more classes to the scheduler, to allow more fair disk
I/O and overall niceness on the system when ZFS commits a
transaction group.



That would be the new System Duty Cycle Scheduling Class that was 
putback in build 129:


Author: Jonathan Adams jonathan.ad...@sun.com
Repository: /export/onnv-gate
Total changesets: 1

Changeset: 87f3734e64df

Comments:
6881015 ZFS write activity prevents other threads from running in a 
timely manner
6899867 mstate_thread_onproc_time() doesn't account for runnable time 
correctly

PSARC/2009/615 System Duty Cycle Scheduling Class and ZFS IO Observability

See http://arc.opensolaris.org/caselog/PSARC/2009/615/ for more information.

If you're using the dev repository, you can pkg image-update to get 
this new functionality.


Cheers,

Menno
--
Menno Lageman - Sun Microsystems - http://blogs.sun.com/menno
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Richard Elling


On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brent Jones wrote:

On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:


On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones  
br...@servuhome.net wrote:




Hang on... if you've got 77 concurrent threads going, I don't  
see how

that's
a sequential I/O load.  To the backend storage it's going to  
look like

the
equivalent of random I/O.  I'd also be surprised to see 12 1TB  
disks
supporting 600MB/sec throughput and would be interested in  
hearing where

you
got those numbers from.

Is your video capture doing 430MB or 430Mbit?

--
--Tim


Think he said 430Mbit/sec, which if these are security cameras,  
would

be a good sized installation (30+ cameras).
We have a similar system, albeit running on Windows. Writing about
400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
working quite well on our system without any frame loss or much
latency.
Once again, Mb or MB?  They're two completely different numbers.   
As for
getting 400Mbit out of 6 SATA drive, that's not really impressive  
at all.
If you're saying you got 400MB, that's a different story entirely,  
and while
possible with sequential I/O and a proper raid setup, it isn't  
happening

with random.



Mb, megabit.
400 megabit is not terribly high, a single SATA drive could write  
that

24/7 without a sweat. Which is why he is reporting his issue.

Sequential or random, any modern system should be able to perform  
that

task without causing disruption to other processes running on the
system (if Windows can, Solaris/ZFS most definitely should be able
to).

I have similar workload on my X4540's, streaming backups from  
multiple

systems at a time. These are very high end machines, dual quadcore
opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

The write stalls have been a significant problem since ZFS came  
out,

and hasn't really been addressed in an acceptable fashion yet, though
work has been done to improve it.


PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO
Observability was integrated into b129. This creates a scheduling class
for ZFS IO and automatically places the zio threads into that class.   
This
is not really an earth-shattering change, Solaris has had a very  
flexible

scheduler for almost 20 years now. Another example is that on a desktop,
the application which has mouse focus runs in the interactive scheduling
class.  This is completely transparent to most folks and there is no  
tweaking

required.

Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other
threads from running in a timely manner, which is related to the above.



I'm still trying to find the case number I have open with Sunsolve or
whatever, it was for exactly this issue, and I believe the fix was to
add dozens more classes to the scheduler, to allow more fair disk
I/O and overall niceness on the system when ZFS commits a
transaction group.


Wow, if there were a production-release solution to the problem, that
would be great! Reading the mailing list I almost gave up hope that  
I'd

be able to work around this issue without upgrading to the latest
bleeding-edge development version.


Changes have to occur someplace first.  In the OpenSolaris world,
the changes occur first in the dev train and then are back ported to
Solaris 10 (sometimes, not always).

You should try the latest build first -- be sure to follow the release  
notes.
Then, if the problem persists, you might consider tuning  
zfs_txg_timeout,

which can be done on a live system.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)

2009-12-26 Thread Richard Elling

On Dec 25, 2009, at 3:01 PM, Jeroen Roodhart wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

Hi Freddie, list,


Option 4 is to re-do your pool, using fewer disks per raidz2 vdev,
giving more vdevs to the pool, and thus increasing the IOps for the
whole pool.

14 disks in a single raidz2 vdev is going to give horrible IO,
regardless of how fast the individual disks are.

Redoing it with 6-disk raidz2 vdevs, or even 8-drive raidz2 vdevs
will give you much better throughput.


We are aware of the configuration being possibly suboptimal. However,
before we had the SSDs, we did test earlier with 6x7 Z2 and even 2way
mirrorset setups. These gave better IOPS but not significantly enough
improvement (I would expect roughly a bit more than double the
performance in 14x3 vs 6x7) .  In the end it is indeed a choice
between performance, space and security.  Our hope is that the SSD
slogs serialise the  data flow  enough  to make this work. But you
have a fair point and we will also look into the combination of SSDs
and pool-configurations.


For your benchmark, there will not be a significant difference for any
combination of HDDs. They all have at least 4 ms of write latency.
Going from 10 ms down to 4 ms will not be nearly as noticeable as
going from 10 ms to 0.01 ms :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks for the advice. I did an in-place upgrade to the latest
development b130 release and it seems that the change in scheduling
classes for the kernel writer threads worked (not even having to fiddle
around with logbias) - now I'm just getting small delays every 60
seconds (on the order of 20-30ms). I'm not sure these have something to
do with ZFS, though... they happen outside of the write bursts.

Thank you all for the valuable advice!

Regards,
- --
Saso

Richard Elling wrote:
 
 On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Brent Jones wrote:
 On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:

 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net
 wrote:


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to
 look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing
 where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 -- 
 --Tim


 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.
 Once again, Mb or MB?  They're two completely different numbers.  As
 for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at
 all.
 If you're saying you got 400MB, that's a different story entirely,
 and while
 possible with sequential I/O and a proper raid setup, it isn't
 happening
 with random.


 Mb, megabit.
 400 megabit is not terribly high, a single SATA drive could write that
 24/7 without a sweat. Which is why he is reporting his issue.

 Sequential or random, any modern system should be able to perform that
 task without causing disruption to other processes running on the
 system (if Windows can, Solaris/ZFS most definitely should be able
 to).

 I have similar workload on my X4540's, streaming backups from multiple
 systems at a time. These are very high end machines, dual quadcore
 opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

 The write stalls have been a significant problem since ZFS came out,
 and hasn't really been addressed in an acceptable fashion yet, though
 work has been done to improve it.
 
 PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO
 Observability was integrated into b129. This creates a scheduling class
 for ZFS IO and automatically places the zio threads into that class.  This
 is not really an earth-shattering change, Solaris has had a very flexible
 scheduler for almost 20 years now. Another example is that on a desktop,
 the application which has mouse focus runs in the interactive scheduling
 class.  This is completely transparent to most folks and there is no
 tweaking
 required.
 
 Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other
 threads from running in a timely manner, which is related to the above.
 
 
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.

 Wow, if there were a production-release solution to the problem, that
 would be great! Reading the mailing list I almost gave up hope that I'd
 be able to work around this issue without upgrading to the latest
 bleeding-edge development version.
 
 Changes have to occur someplace first.  In the OpenSolaris world,
 the changes occur first in the dev train and then are back ported to
 Solaris 10 (sometimes, not always).
 
 You should try the latest build first -- be sure to follow the release
 notes.
 Then, if the problem persists, you might consider tuning zfs_txg_timeout,
 which can be done on a live system.
  -- richard
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks2RfgACgkQRO8UcfzpOHDhCQCeIrJxcy4TcqgvPwGYm/f97NG9
ac8An2zTTqtz/KCK6a4IzKHzgYdEB0Qe
=9zO8
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-26 Thread tom wagner
I am having the exact same problem after destroying a dataset with a few 
gigabytes of data and dedup.  I type zfs destroy vault/virtualmachines which 
was a zvol with dedup turned on and the server hung, couldn't ping, couldn't 
get on the console.  Next bootup same thing just hangs when importing the 
filesystems.  I removed one of the pool disks and also all the mirrors so that 
I can experiment without losing the original data.  But all of my behaviors are 
acting the exact same way as in this thread, although I can't lose this 
particular data as it's been a week since last backup so I am freaking out a 
little.  I am on build 130 and was going to do another backup but was 
troubleshooting a different issue regarding poor iscsi performance before I did 
the backup.  I deleted that zvol with dedup on and now I'm in the same boat as 
the parent.  same hangs.  and interesting thing that happens is that when I hit 
the power button which usually tells the system to start a shutdown, dur
 ing these hangs it says not enough kernel memory.  

Perhaps a memory leak during the failed destroy is causing the hangups.

But literally every description and in here matches my symptoms and during an 
import I can see the disks get hit pretty hard for a about 3 minutes and then 
stop cold turkey and the system is unresponsive, just a blinking cursor at the 
console and I can hit enter to generate a newline but everything is blank. so 
the console is locked pretty hard.

The pool is made up of 3 mirrored vdevs, a cache and a log device.  everything 
was running great until I destroy that one little deduped zvol.  I was able to 
destroy other zvols right before that one.

Has anybody had a chance to look at the dump the poster sent up?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] repost - high read iops

2009-12-26 Thread Brad
repost - Sorry for ccing the other forums.

I'm running into a issue where there seems to be a high number of read iops 
hitting disks and physical free memory is fluctuating between 200MB - 450MB 
out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and 
slog on another 32GB X25-E ssd.

According to our tester, Oracle writes are extremely slow (high latency).

Below is a snippet of iostat:

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0
401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0
421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0
403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0
406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0
414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0
406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0
404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0
404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0
407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0
407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0
402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0
408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0
9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0
0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0

Is this an indicator that we need more physical memory? From 
http://blogs.sun.com/brendan/entry/test, the order that a read request is 
satisfied is:

1) ARC
2) vdev cache of L2ARC devices
3) L2ARC devices
4) vdev cache of disks
5) disks

Using arc_summary.pl, we determined that prefletch was not helping much so we 
disabled.

CACHE HITS BY DATA TYPE:
Demand Data: 22% 158853174
Prefetch Data: 17% 123009991 ---not helping???
Demand Metadata: 60% 437439104
Prefetch Metadata: 0% 2446824

The write iops started to kick in more and latency reduced on spinning disks:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1
0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0
126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0
129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0
128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0
128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0
125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0
128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0
128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0
128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0
129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0
130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0
126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0
129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0
90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0
0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0


Is it true if your MFU stats start to go over 50% then more memory is needed?
CACHE HITS BY CACHE LIST:
Anon: 10% 74845266 [ New Customer, First Cache Hit ]
Most Recently Used: 19% 140478087 (mru) [ Return Customer ]
Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ]
Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, 
Now Back ]
Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted, 
Now Back ]
CACHE HITS BY DATA TYPE:
Demand Data: 22% 158852935
Prefetch Data: 17% 123009991
Demand Metadata: 60% 437438658
Prefetch Metadata: 0% 2446824

My theory is since there's not enough memory for the arc to cache data, its 
hits the l2arc where it can't find data and has to query the disk for the 
request. This causes contention between reads and writes causing the service 
times to inflate.

uname: 5.10 Generic_141445-09 i86pc i386 i86pc
Sun Fire X4270: 11+1 raidz (SAS)
l2arc Intel X25-E
slog Intel X25-E
Thoughts?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-26 Thread Jack Kielsmeier
I still haven't given up :)

I moved my Virtual Machines to my main rig (which gets rebooted often, so this 
is 'not optimal' to say the least) :)

I have since upgraded to 129. I noticed that even if timeslider/autosnaps are 
disabled, a zpool command still gets generated every 15 minutes. Since all 
zpool/zfs commands freeze during the import, I'd have hundreds of hung zpool 
processes.

I stopped this by commenting out all jobs on the zfssnap crontab as well as the 
auto-snap cleanup job on roots crontab. This did nothing to resolve my issue, 
but I figured I should note it.

I'd copy and past the exact jobs, but my server is once again hung.

I'm going to upgrade my server (new motherboard that supports more than 4GB of 
RAM). I'll have double the RAM, perhaps there is some sort of RAM issue going 
on.

I really wanted to get 16GB of RAM, by my own personal budget will not allow it 
:)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-26 Thread Jack Kielsmeier
Just wondering,

How much RAM is in your system?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss