Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If I remember correctly, ESX always uses synchronous writes over NFS. If
so, adding a dedicated log device (such as a DDRdrive) might help you
out here. You should be able to test it by disabling the ZIL for a short
while and see if performance improves
(http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29).
I'm not sure how reliable the DDRdrive is in practice, but in theory it
should be much better than an SSD, since DRAM doesn't wear.

- --
Saso

On 08/27/2010 07:04 AM, Mark wrote:
 We are using a 7210, 44 disks I believe, 11 stripes of RAIDz sets.  When I 
 installed I selected the best bang for the buck on the speed vs capacity 
 chart.
 
 We run about 30 VM's on it, across 3 ESX 4 servers.  Right now, its all 
 running NFS, and it sucks... sooo slow.
 
 iSCSI was no better.   
 
 I am wondering how I can increase the performance, cause they want to add 
 more vm's... the good news is most are idleish, but even idle vm's create a 
 lot of random chatter to the disks!
 
 So a few options maybe... 
 
 1) Change to iSCSI mounts to ESX, and enable write-cache on the LUN's since 
 the 7210 is on a UPS.
 2) get a Logzilla SSD mirror.  (do ssd's fail, do I really need a mirror?)
 3) reconfigure the NAS to a RAID10 instead of RAIDz
 
 Obviously all 3 would be ideal , though with a SSD can I keep using NFS for 
 the same performance since the R_SYNC's would be satisfied with the SSD?
 
 I am dreadful of getting the OK to spend the $$,$$$ SSD's and then not get 
 the performance increase we want.
 
 How would you weight these?  I noticed in testing on a 5 disk OpenSolaris, 
 that changing from a single RAIDz pool to RAID10 netted a larger IOP increase 
 then adding an Intel SSD as a Logzilla.  That's not going to scale the same 
 though with a 44 disk, 11 raidz striped RAID set.
 
 Some thoughts?  Would simply moving to write-cache enabled iSCSI LUN's 
 without a SSD speed things up a lot by itself?

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx3gMQACgkQRO8UcfzpOHDL7ACfW43C6lkMD389j/vmldqMDK1f
1H0AoNFdhgHfWKCCMaJQ2DJACpkQicU7
=KIyA
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum
is detected, it is taken to be the end of the log, but this kind of
defeats the checksum's original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.

Now that we can use checksums to detect device failure, it would be
possible to implement ZIL-scrub, allowing an environment to detect ZIL
device degradation before it actually results in a catastrophe.

- --
Saso

On 08/26/2010 03:22 PM, Eric Schrock wrote:
 
 On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:

 1) zil needs to report truncated transactions on zilcorruption
 
 As Neil outlined, this isn't possible while preserving current ZIL 
 performance.  There is no way to distinguish the last ZIL block without 
 incurring additional writes for every block.  If it's even possible to 
 implement this paranoid ZIL tunable, are you willing to take a 2-5x 
 performance hit to be able to detect this failure mode?
 
 - Eric
 
 --
 Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX
Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1
=pQJU
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I see, thank you for the clarification. So it is possible to have
something equivalent to main storage self-healing on ZIL, with ZIL-scrub
to activate it. Or is that already implemented also? (Sorry for asking
these obvious questions, but I'm not familiar with ZFS source code.)

- --
Saso

On 08/26/2010 04:31 PM, Darren J Moffat wrote:
 On 26/08/2010 15:08, Saso Kiselkov wrote:
 If I might add my $0.02: it appears that the ZIL is implemented as a
 kind of circular log buffer. As I understand it, when a corrupt checksum
 
 It is NOT circular since that implies limited number of entries that get
 overwritten.
 
 is detected, it is taken to be the end of the log, but this kind of
 defeats the checksum's original purpose, which is to detect device
 failure. Thus we would first need to change this behavior to only be
 used for failure detection. This leaves the question of how to detect
 the end of the log, which I think could be done by using a monotonously
 incrementing counter on the ZIL entries. Once we find an entry where the
 counter != n+1, then we know that the block is the last one in the
 sequence.
 
 See the comment part way down zil_read_log_block about how we do
 something pretty much like that for checking the chain of log blocks:
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block
 
 
 This is the checksum in the BP checksum field.
 
 But before we even got there we checked the ZILOG2 checksum as part of
 doing the zio (in zio_checksum_verify() stage):
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error
 
 
 A ZILOG2 checksum is a embedded  in the block (at the start, the
 original ZILOG was at the end) version of fletcher4.  If that failed -
 ie the block was corrupt we would have returned an error back through
 the dsl_read() of the log block.
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ
BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP
=YMqL
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space on Raidz1 configuration

2010-08-06 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

ZFS and du use binary byte multipliers (1kB = 1024 B, etc.), while
drive manufacturers use decimal conversion (1kB = 1000 B). So your 1.5TB
drives are in fact ~1.36 TiB (binary TB):

7 x 1,36 TiB = 9.52 TiB - 1,36 TiB for parity = 8.16 TiB

- --
Saso

On 08/06/2010 01:29 PM, Per Jorgensen wrote:
 I have 7 * 1,5TB disk in a raidz1 configuration, then the system (how i 
 understanding it) uses 1,5TB ( 1 disk ) for parity, but when i uses df the 
 available space in my newly created pool it says 
 
 FilesystemSize  Used Avail Use% Mounted on
 bf8.0T   36K  8.0T   1% /bf
 
 when I uses zpool list it says
 
 NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
 bf 9.50T   292K  9.50T 0%  ONLINE  -
 
 the pool is created with the following command, and compression is set to off
 
 zpool create -f bf raidz1 c9t0d0 c9t1d0 c9t2d0 c9t3d0 c9t4d0 c9t5d0 c9t6d0
 
 and when I do some calculation 
 
 7 x 1,5TB = 10,5TB - 1,5TB for parity = 9,5 TB , so to my questions 
 
 1. why do I only have 8TB in my bf pool ?
 2. why do zpool list and df reports diffrents disk space avaibable
 
 thanks
 Per Jorgensen

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxb9qkACgkQRO8UcfzpOHArigCghxLGcQjueptfokCXvCA/rm5q
WaQAoKkRAKAcXU/dtazbrahJwwyhUUwk
=Y7SN
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CPU requirements for zfs performance

2010-07-22 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I didn't mean to imply that I use it for my media storage, just that I
occasionally encounter situations when it could be useful.

BR,
- --
Saso

On 07/22/2010 11:23 AM, Roy Sigurd Karlsbakk wrote:
 - Original Message -
 I do encounter situations when I (or somebody from my family)
 accidentally create multiple copies of photo albums. :-)
 
 I wouldn't recommend using dedup on this system. Dedup requires lots of RAM 
 or L2ARC, and I don't think it is suitable for your needs. You may want to 
 svn co http://svn.karlsbakk.net/svn/roy/deduba; and test that script. It's a 
 script that looks through a directory and, using MD5 and SHA256, finds 
 identical files. It's somehow unfinished, but it works.
 
 Vennlige hilsener / Best regards
 
 roy
 --
 Roy Sigurd Karlsbakk
 (+47) 97542685
 r...@karlsbakk.net
 http://blogg.karlsbakk.net/
 --
 I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
 er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
 idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
 relevante synonymer på norsk.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxIDoIACgkQRO8UcfzpOHD9HACghZz27u6JvhJuLBPrSFJCicrX
U00AnA4eDGnK9MGLI07pI3KtABlFKARm
=Gn/l
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CPU requirements for zfs performance

2010-07-21 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I do encounter situations when I (or somebody from my family)
accidentally create multiple copies of photo albums. :-)

- --
Saso

On 07/21/2010 05:20 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Saso Kiselkov

 If you plan on using it as a storage server for multimedia data
 (movies), don't even bother considering compression, as most media
 files
 already come heavily compressed. Dedup might still come in handy,
 though.
 
 If you're storing movies, I agree compression is a waste.  But I think dedup
 will also be a waste, unless you have multiple copies of the same movie on
 your disk for some reason.
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxHEJIACgkQRO8UcfzpOHBbmQCgqYrov99f1WgtELe0I2pkt44v
j0gAoNONmMj6C4fI8l00amJZnhG9rgJz
=eWQ9
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

How about running memtest86+ (http://www.memtest.org/) on the machine
for a while? It doesn't test the arithmetics on the CPU very much, but
it stresses data paths quite a lot. Just a quick suggestion...

- --
Saso

Damon Atkins wrote:
 You could try copying the file to /tmp (ie swap/ram) and do a continues loop 
 of checksums  e.g.
 
 while [ ! -f  ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; 
 A=`sha512sum -b libdlpi.so.1.x` ; [ $A == what it should be 
 libdlpi.so.1.x ]  rm libdlpi.so.1.x ; done ; date
 
 Assume the file never goes to swap, it would tell you if something on the 
 motherboard is playing up.
 
 I have seen CPU randomly set a byte to 0 which should not be 0, think it was 
 an L1 or L2 cache problem.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuqHm8ACgkQRO8UcfzpOHD9PQCgyehtxeAt8tieOlIKfHICQQI9
bFoAnRGzfWayNDsjHj5NdF+5n++Pdqaq
=cru5
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Booting OpenSolaris on ZFS root on Sun Netra 240

2010-02-04 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I'm kind stuck at trying to get my aging Netra 240 machine to boot
OpenSolaris. The live CD and installation worked perfectly, but when I
reboot and try to boot from the installed disk, I get:

Rebooting with command: boot disk0
Boot device: /p...@1c,60/s...@2/d...@0,0  File and args:
|
The file just loaded does not appear to be executable.


I suspect it's due to the fact that my OBP can't boot a ZFS root
(OpenBoot 4.22.19). Is there a to work around this?

Regards,
- --
Saso
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktqz7kACgkQRO8UcfzpOHCqhgCgl8I+5zCTBLb0MUVq9cz5zrqz
9LgAoIurhee3/+nfXtUBwVczkjKxQVaj
=7dXF
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2010-01-07 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Just tried and didn't help :-(.

Regards,
- --
Saso

Brent Jones wrote:
 On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov skisel...@gmail.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Buffering the writes in the OS would work for me as well - I've got RAM
 to spare. Slowing down rm is perhaps one way to go, but definitely not a
 real solution. On rare occasions I could still get lockups, leading to
 screwed up recordings and if its one thing people don't like about IPTV,
 it's packet loss. Eliminating even the possibility of packet loss
 completely would be the best way to go, I think.

 Regards,
 - --
 Saso

 
 I shouldn't dare suggest this, but what about disabling the ZIL? Since
 this sounds like transient data to begin with, any risks would be
 pretty low I'd imagine.
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktFovIACgkQRO8UcfzpOHCWawCfSeXjpYjLvRE/5guwYZaSc0L/
XP8An2Q+5NBMDIurAkq+EF07woVzPuIW
=rLoe
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2010-01-06 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I've encountered a new problem on the opposite end of my app - the
write() calls to disk sometimes block for a terribly long time (5-10
seconds) when I start deleting stuff on the filesystem where my recorder
processes are writing. Looking at iostat I can see that the disk load is
strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes
every second, but when I start deleting stuff (e.g. rm -r *), huge
load spikes appear from time to time, even to the level of blocking all
processes writing to the filesystem and filling up the network input
buffer and starting to drop packets.

Is there a way that I can increase the write I/O priority, or increase
the write buffer in ZFS so that write()s won't block?

Regards,
- --
Saso

Saso Kiselkov wrote:
 Ok, I figured out that apparently I was the idiot in this story, again.
 I forgot to set SO_RCVBUF on my network sockets higher, so that's why I
 was dropping input packets.
 
 The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs
 when commiting data to disk), but by increasing network input buffer
 sizes it seems I was able to cut input packet loss to zero.
 
 Thanks for all the valuable advice!
 
 Regards,
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktE0xoACgkQRO8UcfzpOHBvhwCfSl6Acb2nPvtcFFgzZrkTCIFk
bhEAoKjfv3BWnIRtEsCZt9W0SfKN3xPT
=/f+g
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2010-01-06 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I'm aware of the theory and realize that deleting stuff requires writes.
 I'm also running on the latest b130 and write stuff to disk in large
128k chunks. The thing I was wondering about is whether there is a
mechanism that might lower the I/O scheduling priority of a given
process (e.g. lower the priority of the rm command) in a manner similar
to CPU scheduling priority. Another solution would be to increase the
max size the ZFS write buffer, so that writes would not block.

What I'd specifically like to avoid doing is buffer writes in the
recorder process. Besides being complicated to do (the process
periodically closes and reopens several output files at specific moments
in time and keeping them in sync is a bit hairy), I need the written
data to appear in the filesystem very soon after being received from the
network. The logic behind this is that this is streaming media data
which a user can immediately start playing back while it's being
recorded. It's crucial that the user be able to follow the real-time
recording with at most a 1-2 second delay (in fact, at the moment I can
get down to 1 second behind live TV). If I buffer writes for up to 10
seconds in user-space, other playback processes can fail due to running
out of data.

Regards,
- --
Saso

Bob Friesenhahn wrote:
 On Wed, 6 Jan 2010, Saso Kiselkov wrote:
 
 I've encountered a new problem on the opposite end of my app - the
 write() calls to disk sometimes block for a terribly long time (5-10
 seconds) when I start deleting stuff on the filesystem where my recorder
 processes are writing. Looking at iostat I can see that the disk load is
 strongly uneven - with a lowered zfs_txg_timeout=1 I get normal writes
 every second, but when I start deleting stuff (e.g. rm -r *), huge
 load spikes appear from time to time, even to the level of blocking all
 processes writing to the filesystem and filling up the network input
 buffer and starting to drop packets.

 Is there a way that I can increase the write I/O priority, or increase
 the write buffer in ZFS so that write()s won't block?
 
 Deleting stuff results in many small writes to the pool in order to free
 up blocks and update metadata.  It is one of the most challenging tasks
 that any filesystem will do.
 
 It seems that most recent development OpenSolaris has added use of a new
 scheduling class in order to limit the impact of such load spikes.  I
 am eagerly looking forward to being able to use this.
 
 It is difficult for your application to do much if the network device
 driver fails to work, but your application can do some of its own
 buffering and use multithreading so that even a long delay can be
 handled.  Use of the asynchronous write APIs may also help.  Writes
 should be blocked up to the size of the zfs block (e.g. 128K), and also
 aligned to the zfs block if possible.
 
 Bob
 -- 
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktFAaQACgkQRO8UcfzpOHDsHwCcC4CeWjmZgfINiVYXuyXKAjZg
a24AnA2mXCZMJzcAGlu9w8e81X2duNGI
=T7qS
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2010-01-02 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Be sure to also update to the latest dev b130 release, as that also
helps with a more smooth scheduling class for the zfs threads. If the
upgrade breaks anything, you can always just boot back into the old
environment before the upgrade.

Regards,
- --
Saso

Bill Werner wrote:
 Thanks for this thread!  I was just coming here to discuss this very same 
 problem.  I'm running 2009.06 on a Q6600 with 8GB of RAM.  I have a Windows 
 system writing multiple OTA HD video streams via CIFS to the 2009.06 system 
 running Samba.
 
 I then have multiple clients reading back other HD video streams.  The write 
 client never skips a beat, but the read clients have constant problems 
 getting data when the burst writes occur.
 
 I am now going to try the txg_timeout and see if that helps.   It would be 
 nice if these tunables were settable on a per-pool basis though.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks/sloACgkQRO8UcfzpOHC7ywCffZSGYBwd3hRZE5BAfMZpT/g6
ebsAmQFDJ5VyOcaCXKW1TN6I7wmE9w1O
=Ex5W
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-30 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ok, I figured out that apparently I was the idiot in this story, again.
I forgot to set SO_RCVBUF on my network sockets higher, so that's why I
was dropping input packets.

The zfs_txg_timeout=1 flag is still necessary (or else dropping occurs
when commiting data to disk), but by increasing network input buffer
sizes it seems I was able to cut input packet loss to zero.

Thanks for all the valuable advice!

Regards,
- --
Saso

Saso Kiselkov wrote:
 I tried removing the flow and subjectively packet loss occurs a bit less
 often, but still it is happening. Right now I'm trying to figure out of
 it's due to the load on the server or not - I've left only about 15
 concurrent recording instances, producing  8% load on the system. If
 the packet loss still occurs, I guess I'll have to disregard the loss
 measurements as irrelevant, since at such a load the server should not
 be dropping packets at all... I guess.
 
 Regards,
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks7YhIACgkQRO8UcfzpOHC8RACgrryGDuVNBYg7q7FPzTKbL8UJ
u+YAoJeUhNYGWwXGi3IqOPPIS4jW9x1j
=f+GQ
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-29 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I tried removing the flow and subjectively packet loss occurs a bit less
often, but still it is happening. Right now I'm trying to figure out of
it's due to the load on the server or not - I've left only about 15
concurrent recording instances, producing  8% load on the system. If
the packet loss still occurs, I guess I'll have to disregard the loss
measurements as irrelevant, since at such a load the server should not
be dropping packets at all... I guess.

Regards,
- --
Saso

Robert Milkowski wrote:
 I included networking-discuss@
 
 
 On 28/12/2009 15:50, Saso Kiselkov wrote:
 Thank you for the advice. After trying flowadm the situation improved
 somewhat, but I'm still getting occasional packet overflow (10-100
 packets about every 10-15 minutes). This is somewhat unnerving, because
 I don't know how to track it down.
 
 Here are the flowadm settings I use:
 
 # flowadm show-flow iptv
 FLOWLINKIPADDR   PROTO  LPORT   RPORT
 DSFLD
 iptve1000g1 LCL:224.0.0.0/4  -- -- 
 --  --
 
 # flowadm show-flowprop iptv
 FLOW PROPERTYVALUE  DEFAULTPOSSIBLE
 iptv maxbw   -- -- ?
 iptv priorityhigh   -- high
 
 I also tuned udp_max_buf to 256MB. All recording processes are boosted
 to the RT priority class and zfs_txg_timeout=1 to force the system to
 commit data to disk in smaller and more manageable chunks. Is there any
 further tuning you could recommend?
 
 Regards,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks58KIACgkQRO8UcfzpOHCSJQCePCPVhbbfdogNHL735qz3A3dI
4acAn2jofXsGsveDYCgkelwg1xXKFVId
=UPRk
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-28 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I progressed with testing a bit further and found that I was hitting
another scheduling bottleneck - the network. While the write burst was
running and ZFS was commiting data to disk, the server was dropping
incomming UDP packets (netstat -s | grep udpInOverflows grew by about
1000-2000 packets during every write burst).

To work around that I had to boost the scheduling priority of recorder
processes to the real-time class and I also had to lower
zfs_txg_timeout=1 (there was still minor packet drop after just doing
priocntl on the processes) to even out the CPU load.

Any ideas on why ZFS should completely thrash the network layer and make
it drop incomming packets?

Regards,
- --
Saso

Robert Milkowski wrote:
 On 26/12/2009 12:22, Saso Kiselkov wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Thank you, the post you mentioned helped me move a bit forward. I tried
 putting:

 zfs:zfs_txg_timeout = 1
 btw: you can tune it on a live system without a need to do reboots.
 
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:30
 mi...@r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
 zfs_txg_timeout:0x1e=   0x1
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:1
 mi...@r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
 zfs_txg_timeout:0x1 =   0x1e
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:30
 mi...@r600:~#
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK
zmoAoLeX3Q+avIDbb+CONlh++pAIGOob
=NcRo
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-28 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thank you for the advice. After trying flowadm the situation improved
somewhat, but I'm still getting occasional packet overflow (10-100
packets about every 10-15 minutes). This is somewhat unnerving, because
I don't know how to track it down.

Here are the flowadm settings I use:

# flowadm show-flow iptv
FLOWLINKIPADDR   PROTO  LPORT   RPORT
DSFLD
iptve1000g1 LCL:224.0.0.0/4  -- --  --  --

# flowadm show-flowprop iptv
FLOW PROPERTYVALUE  DEFAULTPOSSIBLE
iptv maxbw   -- -- ?
iptv priorityhigh   -- high

I also tuned udp_max_buf to 256MB. All recording processes are boosted
to the RT priority class and zfs_txg_timeout=1 to force the system to
commit data to disk in smaller and more manageable chunks. Is there any
further tuning you could recommend?

Regards,
- --
Saso

I need all IP multicast input traffic on e1000g1 to get the highest
possible priority.

Markus Kovero wrote:
 Hi, Try to add flow for traffic you want to get prioritized, I noticed that 
 opensolaris tends to drop network connectivity without priority flows 
 defined, I believe this is a feature presented by crossbow itself. flowadm is 
 your friend that is.
 I found this particularly annoying if you monitor servers with icmp-ping and 
 high load causes checks to fail therefore triggering unnecessary alarms.
 
 Yours
 Markus Kovero
 
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org 
 [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Saso Kiselkov
 Sent: 28. joulukuuta 2009 15:25
 To: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls
 
 I progressed with testing a bit further and found that I was hitting
 another scheduling bottleneck - the network. While the write burst was
 running and ZFS was commiting data to disk, the server was dropping
 incomming UDP packets (netstat -s | grep udpInOverflows grew by about
 1000-2000 packets during every write burst).
 
 To work around that I had to boost the scheduling priority of recorder
 processes to the real-time class and I also had to lower
 zfs_txg_timeout=1 (there was still minor packet drop after just doing
 priocntl on the processes) to even out the CPU load.
 
 Any ideas on why ZFS should completely thrash the network layer and make
 it drop incomming packets?
 
 Regards,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij
CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc
=VLoO
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-27 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks for the mdb syntax - I wasn't sure how to set it using mdb at
runtime, which is why I used /etc/system. I was quite intrigued to find
out that the Solaris kernel was in fact designed for being tuned at
runtime using some generic debugging mechanism, rather than like other
traditional kernels, using a defined kernel settings interface (sysctl
comes to mind).

Anyway, upgrading to b130 helped my issue and I hope that by the time we
start selling this product, OpenSolaris 2010.02 comes out, so that I can
tell people to just grab the latest stable OpenSolaris release, rather
than having to go to a development branch or tuning kernel parameters to
even get the software working as it should.

Regards,
- --
Saso

Robert Milkowski wrote:
 On 26/12/2009 12:22, Saso Kiselkov wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Thank you, the post you mentioned helped me move a bit forward. I tried
 putting:

 zfs:zfs_txg_timeout = 1
 btw: you can tune it on a live system without a need to do reboots.
 
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:30
 mi...@r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw
 zfs_txg_timeout:0x1e=   0x1
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:1
 mi...@r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw
 zfs_txg_timeout:0x1 =   0x1e
 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k
 zfs_txg_timeout:
 zfs_txg_timeout:30
 mi...@r600:~#
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks3lGkACgkQRO8UcfzpOHBzcwCgyDlxr94I9r8kHbVEkTt1lu0Y
AOIAmgJnZ5nZw8j7FS+irrJWJ4RBup0Q
=0g8/
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

The application I'm working on is a kind of large-scale network-PVR
system for our IPTV services. It records all running TV channels in a
X-hour carrousel (typically 24 or 48-hours), retaining only those bits
which users have marked as being interesting to them. The current setup
I'm doing development on is a small 12TB array, future deployment is
planned on several 96TB X4540 machines.

I agree that I kind of misused the term `sequential' - it really is 77
concurrent sequential writes. However, as I explained, I/O is not the
bottleneck here, as the array is capable of writes around 600MBytes/s,
and the write load I'm putting on it is around 55MBytes/s (430Mbit/s).

The problem is, as Brent explained, that as soon as the OS decides it
wants to write the transaction group to disk, it totally ignores all
other time-critical activity in the system and focuses on just that,
causing an input poll() stall on all network sockets. What I'd need to
do is force it to commit transactions to disk more often so as to even
the load out over a longer period of time, to bring the CPU usage spikes
down to a more manageable and predictable level.

Regards,
- --
Saso

Tim Cook wrote:
 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote:
 


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 --
 --Tim

  

 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.

 
 Once again, Mb or MB?  They're two completely different numbers.  As for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
 If you're saying you got 400MB, that's a different story entirely, and while
 possible with sequential I/O and a proper raid setup, it isn't happening
 with random.
 
 
 The writes lag is noticeable however with ZFS, and the behavior of the
 transaction group writes. If you have a big write that needs to land
 on disk, it seems all other I/O, CPU and niceness is thrown out the
 window in favor of getting all that data on disk.
 I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris
 support, I'll try to find that bug number, but I believe some
 improvements were done in 129 and 130.



 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1y8oACgkQRO8UcfzpOHBkDQCgxScaPPS7d+peoiY16Nafo8lu
1nsAoNMwiUdOdQKCZpdyPGoAWz36IWY5
=T6fy
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Would an upgrade to the development repository of 2010.02 do the same?
I'd like to avoid having to do a complete reinstall, since I've got
quite a bit of custom software in the system already in various places
and recompiling and fine-tuning would take me another 1-2 days.

Regards,
- --
Saso

Leonid Kogan wrote:
 Try b130.
 http://genunix.org/
 
 Cheers,
 LK
 
 
 On 12/26/2009 12:59 AM, Saso Kiselkov wrote:
 Hi,

 I tried it and I got the following error message:

 # zfs set logbias=throughput content
 cannot set property for 'content': invalid property 'logbias'

 Is it because I'm running some older version which does not have this
 feature? (2009.06)

 Regards,
 -- 
 Saso

 Leonid Kogan wrote:
   
 Hi there,
 Try to:
 zfs set logbias=throughputyourdataset

 Good luck,
 LK

  

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1zCIACgkQRO8UcfzpOHA1SQCaAqK+2v/+lQnuaXPc4pOju7UC
oaIAoNKJO3oOr4DCdCXHCp+vf2/Ri2mW
=pmGr
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brent Jones wrote:
 On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:

 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote:


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 --
 --Tim


 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.
 Once again, Mb or MB?  They're two completely different numbers.  As for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at all.
 If you're saying you got 400MB, that's a different story entirely, and while
 possible with sequential I/O and a proper raid setup, it isn't happening
 with random.

 
 Mb, megabit.
 400 megabit is not terribly high, a single SATA drive could write that
 24/7 without a sweat. Which is why he is reporting his issue.
 
 Sequential or random, any modern system should be able to perform that
 task without causing disruption to other processes running on the
 system (if Windows can, Solaris/ZFS most definitely should be able
 to).
 
 I have similar workload on my X4540's, streaming backups from multiple
 systems at a time. These are very high end machines, dual quadcore
 opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.
 
 The write stalls have been a significant problem since ZFS came out,
 and hasn't really been addressed in an acceptable fashion yet, though
 work has been done to improve it.
 
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.

Wow, if there were a production-release solution to the problem, that
would be great! Reading the mailing list I almost gave up hope that I'd
be able to work around this issue without upgrading to the latest
bleeding-edge development version.

Regards,
- --
Saso
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks10xQACgkQRO8UcfzpOHCFUQCeJ0kHwOgM3Vjc6QjIL6XHVip5
ed4AoIYrNGAZR2V69uUk3Gc/MAl3kew3
=5uSX
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thank you, the post you mentioned helped me move a bit forward. I tried
putting:

zfs:zfs_txg_timeout = 1

in /etc/system and now I'm getting much more even write load (a burst
every 5 seconds), which now does not cause any significant poll()
stalling anymore. So far I fail to find the timer in the ZFS source code
which causes the 5-second timeout instead of what I want (1 second).

Another thing that's left on my mind is why I'm still getting a very
slight burst every 60 seconds (causing a poll() delay of around 20-30ms,
instead of the usual 0-2ms). It's not that big a problem, it's just that
I'm curious as to where it's being created. I assume some 60-second
timer is firing, but I don't know where.

Regards,
- --
Saso

Fajar A. Nugraha wrote:
 On Sat, Dec 26, 2009 at 4:10 PM, Saso Kiselkov skisel...@gmail.com wrote:
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.
 Wow, if there were a production-release solution to the problem, that
 would be great!
 
 Have you checked this thread?
 http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28704.html
 
 Reading the mailing list I almost gave up hope that I'd
 be able to work around this issue without upgrading to the latest
 bleeding-edge development version.
 
 Isn't opensolaris already bleeding edge?
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks1/+8ACgkQRO8UcfzpOHC6kgCfcTv86Gwh2MvvVQJeJr/BRghe
f6IAn2N1t4QNLfwBdafZHUbXCw0grTRk
=hUJV
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks for the advice. I did an in-place upgrade to the latest
development b130 release and it seems that the change in scheduling
classes for the kernel writer threads worked (not even having to fiddle
around with logbias) - now I'm just getting small delays every 60
seconds (on the order of 20-30ms). I'm not sure these have something to
do with ZFS, though... they happen outside of the write bursts.

Thank you all for the valuable advice!

Regards,
- --
Saso

Richard Elling wrote:
 
 On Dec 26, 2009, at 1:10 AM, Saso Kiselkov wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Brent Jones wrote:
 On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote:

 On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net
 wrote:


 Hang on... if you've got 77 concurrent threads going, I don't see how
 that's
 a sequential I/O load.  To the backend storage it's going to
 look like
 the
 equivalent of random I/O.  I'd also be surprised to see 12 1TB disks
 supporting 600MB/sec throughput and would be interested in hearing
 where
 you
 got those numbers from.

 Is your video capture doing 430MB or 430Mbit?

 -- 
 --Tim


 Think he said 430Mbit/sec, which if these are security cameras, would
 be a good sized installation (30+ cameras).
 We have a similar system, albeit running on Windows. Writing about
 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and
 working quite well on our system without any frame loss or much
 latency.
 Once again, Mb or MB?  They're two completely different numbers.  As
 for
 getting 400Mbit out of 6 SATA drive, that's not really impressive at
 all.
 If you're saying you got 400MB, that's a different story entirely,
 and while
 possible with sequential I/O and a proper raid setup, it isn't
 happening
 with random.


 Mb, megabit.
 400 megabit is not terribly high, a single SATA drive could write that
 24/7 without a sweat. Which is why he is reporting his issue.

 Sequential or random, any modern system should be able to perform that
 task without causing disruption to other processes running on the
 system (if Windows can, Solaris/ZFS most definitely should be able
 to).

 I have similar workload on my X4540's, streaming backups from multiple
 systems at a time. These are very high end machines, dual quadcore
 opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs.

 The write stalls have been a significant problem since ZFS came out,
 and hasn't really been addressed in an acceptable fashion yet, though
 work has been done to improve it.
 
 PSARC case 2009/615 : System Duty Cycle Scheduling Class and ZFS IO
 Observability was integrated into b129. This creates a scheduling class
 for ZFS IO and automatically places the zio threads into that class.  This
 is not really an earth-shattering change, Solaris has had a very flexible
 scheduler for almost 20 years now. Another example is that on a desktop,
 the application which has mouse focus runs in the interactive scheduling
 class.  This is completely transparent to most folks and there is no
 tweaking
 required.
 
 Also fixed in b129 is BUG/RFE:6881015ZFS write activity prevents other
 threads from running in a timely manner, which is related to the above.
 
 
 I'm still trying to find the case number I have open with Sunsolve or
 whatever, it was for exactly this issue, and I believe the fix was to
 add dozens more classes to the scheduler, to allow more fair disk
 I/O and overall niceness on the system when ZFS commits a
 transaction group.

 Wow, if there were a production-release solution to the problem, that
 would be great! Reading the mailing list I almost gave up hope that I'd
 be able to work around this issue without upgrading to the latest
 bleeding-edge development version.
 
 Changes have to occur someplace first.  In the OpenSolaris world,
 the changes occur first in the dev train and then are back ported to
 Solaris 10 (sometimes, not always).
 
 You should try the latest build first -- be sure to follow the release
 notes.
 Then, if the problem persists, you might consider tuning zfs_txg_timeout,
 which can be done on a live system.
  -- richard
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks2RfgACgkQRO8UcfzpOHDhCQCeIrJxcy4TcqgvPwGYm/f97NG9
ac8An2zTTqtz/KCK6a4IzKHzgYdEB0Qe
=9zO8
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS write bursts cause short app stalls

2009-12-25 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I've started porting a video streaming application to opensolaris on
ZFS, and am hitting some pretty weird performance issues. The thing I'm
trying to do is run 77 concurrent video capture processes (roughly
430Mbit/s in total) all writing into separate files on a 12TB J4200
storage array. The disks in the array are arranged into a single RAID-0
ZFS volume (though I've tried different RAID levels, none helped). CPU
performance is not an issue (barely hitting 35% utilization on a single
CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
storage array's sequential write performance is around 600MB/s.

The problem is the bursty behavior of ZFS writes. All the capture
processes do, in essence is poll() on a socket and then read() and
write() any available data from it to a file. The poll() call is done
with a timeout of 250ms, expecting that if no data arrives within 0.25
seconds, the input is dead and recording stops (I tried increasing this
value, but the problem still arises, although not as frequently). When
ZFS decides that it wants to commit a transaction group to disk (every
30 seconds), the system stalls for a short amount of time and depending
on the number capture of processes currently running, the poll() call
(which usually blocks for 1-2ms), takes on the order of hundreds of ms,
sometimes even longer. I figured that I might be able to resolve this by
lowering the txg timeout to something like 1-2 seconds (I need ZFS to
write as soon as data arrives, since it will likely never be
overwritten), but I couldn't find any tunable parameter for it anywhere
on the net. On FreeBSD, I think this can be done via the
vfs.zfs.txg_timeout sysctl. A glimpse into the source at
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
on line 40 made me worry that somebody maybe hard-coded this value into
the kernel, in which case I'd be pretty much screwed in opensolaris.

Any help would be greatly appreciated.

Regards,
- --
Saso
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks0/QoACgkQRO8UcfzpOHB9PgCeOuJFVHTCohRzuf7kAEkC1l1i
qBAAn18Jkx+N9OotWVCwpz5iQzNZSsEG
=FCJL
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-25 Thread Saso Kiselkov
Hi,

I'm not sure what b130 means, I'm fairly new to OpenSolaris. How do I
find out?
As for the OS version, it is OpenSolaris 2009.06.

Regards,
--
Saso

Richard Elling wrote:
 On Dec 25, 2009, at 9:57 AM, Saso Kiselkov wrote:

 I've started porting a video streaming application to opensolaris on
 ZFS, and am hitting some pretty weird performance issues. The thing I'm
 trying to do is run 77 concurrent video capture processes (roughly
 430Mbit/s in total) all writing into separate files on a 12TB J4200
 storage array. The disks in the array are arranged into a single RAID-0
 ZFS volume (though I've tried different RAID levels, none helped). CPU
 performance is not an issue (barely hitting 35% utilization on a single
 CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the
 storage array's sequential write performance is around 600MB/s.

 The problem is the bursty behavior of ZFS writes. All the capture
 processes do, in essence is poll() on a socket and then read() and
 write() any available data from it to a file.

  There have been some changes recently, including one in b130 that
  might apply to this workload.  What version of the OS are you running?
  If not b130, try b130.
   -- richard

 The poll() call is done
 with a timeout of 250ms, expecting that if no data arrives within 0.25
 seconds, the input is dead and recording stops (I tried increasing this
 value, but the problem still arises, although not as frequently). When
 ZFS decides that it wants to commit a transaction group to disk (every
 30 seconds), the system stalls for a short amount of time and depending
 on the number capture of processes currently running, the poll() call
 (which usually blocks for 1-2ms), takes on the order of hundreds of ms,
 sometimes even longer. I figured that I might be able to resolve this by
 lowering the txg timeout to something like 1-2 seconds (I need ZFS to
 write as soon as data arrives, since it will likely never be
 overwritten), but I couldn't find any tunable parameter for it anywhere
 on the net. On FreeBSD, I think this can be done via the
 vfs.zfs.txg_timeout sysctl. A glimpse into the source at
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c
 on line 40 made me worry that somebody maybe hard-coded this value into
 the kernel, in which case I'd be pretty much screwed in opensolaris.

 Any help would be greatly appreciated.

 Regards,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-25 Thread Saso Kiselkov
Hi,

I tried it and I got the following error message:

# zfs set logbias=throughput content
cannot set property for 'content': invalid property 'logbias'

Is it because I'm running some older version which does not have this
feature? (2009.06)

Regards,
--
Saso

Leonid Kogan wrote:
 Hi there,
 Try to:
 zfs set logbias=throughput yourdataset

 Good luck,
 LK
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss