Re: [zfs-discuss] ZFS write throttling

2008-02-15 Thread Tao Chen
On 2/15/08, Roch Bourbonnais [EMAIL PROTECTED] wrote:

  Le 15 févr. 08 à 11:38, Philip Beevers a écrit :

[...]
   Obviously this isn't good behaviour, but it's particularly unfortunate
   given that this checkpoint is stuff that I don't want to retain in any
   kind of cache anyway - in fact, preferably I wouldn't pollute the ARC
   with it in the first place. But it seems directio(3C) doesn't work
   with
   ZFS (unsurprisingly as I guess this is implemented in segmap), and
   madvise(..., MADV_DONTNEED) doesn't drop data from the ARC (again, I
   guess, as it's working on segmap/segvn).
  
   Of course, limiting the ARC size to something fairly small makes it
   behave much better. But this isn't really the answer.
  
   I also tried using O_DSYNC, which stops the pathological behaviour but
   makes things pretty slow - I only get a maximum of about 20MBytes/sec,
   which is obviously much less than the hardware can sustain.
  
   It sounds like we could do with different write throttling behaviour
   to
   head this sort of thing off. Of course, the ideal would be to have
   some
   way of telling ZFS not to bother keeping pages in the ARC.
  
   The latter appears to be bug 6429855. But the underlying behaviour
   doesn't really seem desirable; are there plans afoot to do any work on
   ZFS write throttling to address this kind of thing?
  


 Throttling is being addressed.

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205


  BTW, the new code will adjust write speed to disk speed very quickly.
  You will not see those ultra fast initial checkpoints. Is this a
  concern ?

I'll wait for more details on how you address this.
Maybe a blog? like this one:
http://blogs.technet.com/markrussinovich/archive/2008/02/04/2826167.aspx

Inside Vista SP1 File Copy Improvements :-

One of the biggest problems with the engine's implementation is
that for copies involving lots of data, the Cache Manager
write-behind thread on the target system often can't keep up with
the rate at which data is written and cached in memory.
That causes the data to fill up memory, possibly forcing other
useful code and data out, and eventually, the target's system's
memory to become a tunnel through which all the copied data
flows at a rate limited by the disk.

Sounds familiar? ;-)

Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Apple Time Machine

2006-08-07 Thread Tao Chen
I am reading the live coverage of WWDC keynote here:http://www.macrumorslive.com/web/They talked about a new feature in OS X/Leopard: Time Machine.
Does it sound like instant snapshot and rollback to you?I don't know how else this can be implemented.10:37 am with time machine, you can get those files back by 
 entering a date or time
10:35 am ever had time where you work on a doc and you do a  save as and overwrote the wrong one?
10:35 am coolest part - and reason we call it that 
 - whole new way of backing up files10:35 am backup to HD, or server
10:35 am can restore everything, or just one file at a time10:34 am can be right where you were when the HD drive
10:34 am automatically backs up mac you change a file, 
 it automatically backs up photos, music, documents,  files folder, everything then you can restore everything
10:34 am plan to change all of that 
 Time Machine10:33 am how many use automated software to stay always backed up?
 only 4%Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple Time Machine

2006-08-07 Thread Tao Chen
On 8/7/06, Tim Foster [EMAIL PROTECTED] wrote:
David Magda wrote: Well, they've ported Dtrace:  ..now built into Mac OS X Leopard. Xray. Because it's 2006.Uh right and they're actually shipping it in 2007. Apple marketing.
Anyone want to start printing t-shirts:DTrace  Time Machine in OpenSolaris. Because we had it in 2005.Looks like Time Machine is implemented using HFS+:To make Time Machine work, Mac users will need to use a separate HFS+ compatible non-bootable hard drive, Croll said.
http://news.com.com/New+Apple+feature+sends+users+back+in+time/2100-1046_3-6103007.html?tag=nefd.top
As Eric said earlier, it's a standard backup, incremental  after the first  one, a versioning system of some sort.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple Time Machine

2006-08-07 Thread Tao Chen
On 8/7/06, Eric Schrock [EMAIL PROTECTED] wrote:
On Mon, Aug 07, 2006 at 01:19:14PM -1000, David J. Orman wrote:  (actually did they give OpenSolaris a name check at all when they  mentioned DTrace ?) Nope, not that I can see. Apple's pretty notorious for that kind of
 oversight. I used to work for them, I know first hand how hat-tipping doesn't occur very often.Before this progresses much further, it's worth noting that all of teamDTrace is at WWDC, has met with Apple engineers previously, and will be
involved in one or more presentations today.So while the marketingdepartment may not include OpenSolaris in the high level overview, Appleis not ignoring the roots of DTrace, and will not be hiding this fact
from their developers (not that they could).- EricCool.Let'sseehow it works out in the long run,will we (the OpenSolaris community) get anything back from Apple and its community,
how well does CDDL work in the real world, etc. (we all know what happened to Darwin/BSD).In terms of openness, Sun and Apple are going opposite directions   IMHO,   interesting situation :)Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple Time Machine

2006-08-07 Thread Tao Chen
On 8/7/06, Robert Gordon [EMAIL PROTECTED] wrote:
On Aug 7, 2006, at 7:17 PM, Tao Chen wrote: In terms of openness, Sun and Apple are going opposite directions IMHO, interesting situation :) TaoApple just released the Darwin Kernel code 
xnu-792-10.96the equivalent of 10.4.7 for intel machines.-- Robert.You're right, I just saw the announcement,
http://lists.apple.com/archives/Darwin-dev/2006/Aug/msg00067.htmlA good move.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple Time Machine

2006-08-07 Thread Tao Chen
On 8/7/06, Bryan Cantrill [EMAIL PROTECTED] wrote:
We've had a great relationship with Apple at the engineering level -- andindeed, Team DTrace just got back from dinner with the Apple engineersinvolved with the port.More details here:
http://blogs.sun.com/roller/page/bmc?entry=dtrace_on_mac_os_xYour blog should bedigged/slashdotted/osnews'ed/whatever'ed:) Onadifferentnote(sorry this is already off-topic for zfs-discuss), your previous blog  happens to be DTrace on FreeBSD, update 
-are these efforts shared at all between OSX and FreeBSD? Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-19 Thread Tao Chen
On 7/17/06, Jonathan Wheeler [EMAIL PROTECTED] wrote:
Hi All,I've just built an 8 disk zfs storage box, and I'm in the testing phase before I put it into production. I've run into some unusual results, and I was hoping the community could offer some suggestions. I've bascially made the switch to Solaris on the promises of ZFS alone (yes I'm that excited about it!), so naturally I'm looking forward to some great performance - but it appears I'm going to need some help finding all of it.
One major concern Jonathan has is the 7-raidz write performance.
(I see no big surprise in 'read' results.)
The really interesting numbers happen at 7 disks -it's slower then with 4, in all tests.
I randomly picked 3 results from his several runs:
 -Per Char- --Block--- -Rewrite-- MB K/sec %CPU K/sec %CPU K/sec %CPU
  ==  ==
4-disk 8196 57965 67.9 123268 27.6 78712 17.17-disk 8196 49454 57.1 92149 20.1 73013 16.0
8-disk 8196 61345 70.7 139259 28.5 89545 20.8
I looked at the corresponding dtrace data for7 and 8-raidz cases.
(Should have also asked for 4-raidz data.Jonathan, you can still send 4-raidz data to me offline.)
In 7-raidz, each disk had writes in two sizes:
214-block or 85-block, equally.
 DEVICE BLKs COUNT   
 sd1 85 27855 214 27882
 sd2 85 27854 214 27868
 sd3 85 27849 214 27884
 ...In 8-raidz,
sd1,3,5,7 had either 220 or 221-block writes, equally.
sd2,4,6,8 had 100% of 146-block writes. DEVICE BLKs COUNT
   
 sd1 220 16325 221 16338
 sd2 146 49001 sd3 220 16335
 221 16333 sd4 146 49005
 sd5 220 16340 221 16324
 sd6 146 49001 sd7 220 16332
 221 16333 sd8 146 49009
In terms of average write response time,
in 7-raidz DEVICE WRITE AVG.ms
 --- --- -- sd1 63990 54.03
 sd2 64000 53.65 sd3 63898 55.48
 sd4 64190 54.14 sd5 64091 54.81
 sd6 63967 57.83 sd7 64092 54.19
in 8-raidz
 DEVICE WRITE AVG.ms --- --- --
 sd1 42276 6.64 sd2 58467 19.66
 sd3 42287 6.24 sd4 55198 20.01
 sd5 42285 6.64 sd6 58409 22.90
 sd7 42235 6.88 sd8 54967 24.46
At bdev level, 8-raidz shows much better turnaroundtime than 7-raidz, while disk 1,3,5,7 (larger writes) are 
better than 2,4,6,8 (smaller writes).
So 8-raidz wins by larger writes and much better response time for each write, but why these two differences?
and why the disparity between odd- and even-number disks
within 8-raidz?Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-23 Thread Tao Chen
I should copy this to the list.-- Forwarded message --On 6/23/06, 
Joe Little [EMAIL PROTECTED] wrote:

I can post back to Roch what this latency is. I think the latency is aconstant regardless of the zil or not. all that I do by disabling thezil is that I'm able to submit larger chunks at a time (faster) thandoing 1k or worse blocks 3 times per file (the NFS fsync penalty)

Please send the script ( I attached a modified version ) along with the result.
They need to see how it works to trust ( or dispute ) the result.
Rule #1 in performance tuning is do not trust the report from an unproven tool :)

I have some comment on the output below. This is for a bit longer (16 trees of 6250 8k files, again with zil disabled):
Generating report from biorpt.sh.rec ... === Top 5 I/O types ===DEVICETBLKs COUNT-sd2 W 2563095sd1 W 2562843

sd1 W 2 201sd2 W 2 197sd1 W32 185This part tells me majority of I/Os are 128KB writes on sd2 and sd1.

 === Top 5 worst I/O response time ===DEVICETBLKsOFFSETTIMESTAMPTIME.ms-sd2 W 175 52907067185.9338433559.55

sd1 W 256 52109768047.5619183097.21sd1 W 256 52115196854.9442533090.42sd1 W 256 52115222454.9442073090.23sd1 W64 52115248054.944241
3090.21Longest response time are more than 3 seconds, ouch.

=== Top 5 Devices with largest number of I/Os ===DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK-- -- - -- - sd16 
0.340 4948 387.88413 4954 0%sd26 0.250 4230 387.07405 4236 0%cmdk0 23 8.110152 0.84017510%
Average response time of  300ms is bad.
I calculate SEEK rate on 512-byte block basis, since I/Os are mostly 128K, the seek rate is less than 1% ( 0 ), in other words I consider this as mostly sequential I/O. I guess it's debatable whether 512-byte-based calculation is meaningful.
=== Top 5 Devices with largest amount of data transfer ===
DEVICEREAD 
AVG.ms MBWRITE AVG.ms MB Tol.MB MB/s-- -- - -- - sd16 0.340 4948 387.884134134sd26 
0.250 4230 387.074054054cmdk0 23 8.110152 0.84000=== Report saved in biorpt.sh.rec.rpt ===
I calculate the MB/s on per-second basis, meaning as long as there's at least one finished I/O on the device in a second, that second is used in calculating throughput.
Tao




biorpt.sh
Description: Bourne shell script
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

2006-06-23 Thread Tao Chen
On 6/23/06, Richard Elling [EMAIL PROTECTED] wrote:
comment on analysis below...Tao Chen wrote: === Top 5 Devices with largest number of I/Os === DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK
 -- -- - -- -  sd16 0.340 4948 387.88413 4954 0% sd26 0.250 4230 387.07
405 4236 0% cmdk0 23 8.110152 0.84017510% Average response time of  300ms is bad.Average is totally useless with this sort of a distribution.
I'd suggest using a statistical package to explore the distribution.Just a few 3-second latencies will skew the average quite a lot.-- richardA summary report is nothing more than an indication of issues, or non-issue.
So I agree that an average is just, an average.However, a few 3-second latencies will not spoil the result too much when there're more than 4000 I/Os sampled.The script saves the raw data in a .rec file, so you can run whatever statistic tool you have against it. I am currently more worried about how accurate and useful the raw data is, which is generated from a DTrace command in it.
The raw record is in this format:- Timestamp(sec.microsec) - DeviceName- W/R- BLK_NO (offset) - BLK_CNT (I/O size)- IO_Time (I/O elapsed time)Tao (
msec.xx)Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and databases

2006-05-12 Thread Tao Chen

On 5/12/06, Roch Bourbonnais - Performance Engineering
[EMAIL PROTECTED] wrote:


  From: Gregory Shaw [EMAIL PROTECTED]
  Regarding directio and quickio, is there a way with ZFS to skip the
  system buffer cache?  I've seen big benefits for using directio when
  the data files have been segregated from the log files.


Were the benefits coming from extra concurrency (no single writter lock)


Does DIO bypass writter lock on Solaris?
Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks
at filesystem level:
http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582


or avoiding the extra copy to page cache


Certainly. Also to avoid VM overhead (DB does like raw devices).


or from too much readahead that is not used before pages need to be recycled.


Not sure what you mean ( avoid unnecessary readahead? )


ZFS already has the concurrency.


Interesting, would like to find more on this.


The page cache copy is really rather cheap


VM as a whole is certainly not cheap.


and I assert somewhat necessary to insure data integrity.


Not following you.


The extra readahead is somewhat of a bug in UFS (read 2
pages get a maxcontig chunk (1MB)).


Ouch.



ZFS is new, conventional wisdom, may or may not apply.



This (zfs-discuss) is the place where we can be enlightened :-)

Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and databases

2006-05-11 Thread Tao Chen

On 5/11/06, Peter Rival [EMAIL PROTECTED] wrote:

Richard Elling wrote:
 Oracle will zero-fill the tablespace with 128kByte iops -- it is not
 sparse.  I've got a scar.  Has this changed in the past few years?

 Multiple parallel tablespace creates is usually a big pain point for 
filesystem / cache interaction, and also fragmentation once in a while.  The 
latter ZFS should take care of; the former, well, I dunno.



The purpose of zero-filled tablespace is to prevent fragmentation by
future writes, in the case when multiple tablespaces are being
updated/filled on the same disk, correct?
This becomes pointless on ZFS, since it never overwrites the same
pre-allocated block, i.e. the tablespace becomes fragmented in that
case no matter what.

Also, in order to write a partial update to a new block, zfs needs the
rest of the orignal block, hence the notion by Roch:
partial writes to blocks that are not in cache are much slower than
writes to blocks that are.
Fortunately I think DB almost always does aligned full block I/O, or
is that right?

Tao
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss