subject:"Re\: high load, no bottleneck"

Re: high load, no bottleneck

2013-10-01 Thread Michael

Hello,

on Saturday 28 September 2013 04:54:19 you wrote:
 Date:Sat, 28 Sep 2013 09:09:02 +0100
 From:Roland C. Dowdeswell el...@imrryr.org
 Message-ID:  20130928080902.gg4...@roofdrak.imrryr.org

   | I thought quite some time ago that it probably makes sense for us
   | to make the installer mount everything async to extract the sets
   | because, after all, if the install fails you are rather likely to
   | simply re-install rather than manually figure out where it was and
   | fix it.

 I have had the same thought - the one downside for this usage is the
 sync time at umount - for novice users this would appear to be a hang/crash
 after the install was almost finished (since it might take minutes to
 complete) to which a natural response would be to push the reset button
 and try again...   So, I think this would need to be an option to be used
 by those who know to expect that delay, and so default off.

Or just throw up a warning - Hey, this can take a few minutes, don't freak 
out!

have fun
Michael

Re: high load, no bottleneck

2013-09-30 Thread Mouse

 Basically, if we have N pending VOP_FSYNC for a given filesystem,
 all theses requests will be honoured on first flush, but they are
 serialized and will be acknowledged one by one, with the cost of a
 useless flush each time.  Am I right?
 Do you mean all these requests *could* be honored on first flush?
 If so, then yes, I agree.

Last flush, surely?  There may be pending operations between the first
FSYNC and the last which need to be performed before the last FSYNC is
done.

 Looking at the code in XFS that gathers barrier events, it seems to
 be drawing a DAG...

That's exactly what I'd expect: write--flush dependency _is_ a DAG,
with barriers being chokepoints (I forget the graph-theory term for
them) in the DAG.

Mouse

Re: high load, no bottleneck

2013-09-28 Thread Robert Elz

Date:Sat, 28 Sep 2013 14:24:32 +1000
From:matthew green m...@eterna.com.au
Message-ID:  11701.1380342...@splode.eterna.com.au

  | -o async is very dangerous.  there's not even the vaguest
  | guarantee that even fsck can help you after a crash in
  | that case...

All true, still it is remarkably useful - I use it all the time
(incidentally, while the man page says that -o log and -o async
can't be used together, if they are, the result is a panic, rather
than a more graceful error message ... I should point out that I
saw this on a remount, adding -o async to a filesys that had been
mounted -o log - with -o log in fstab ... I haven't been inclined
to panic the system more by running more tests, just removed the -o log
which wasn't really needed for that filesys, it is mostly either -o async
or -o ro).

My strategy is to newfs a filesystem, mount it -o async, extract files
into it (extracting a pkgsrc.tgz can be done in a few seconds if the
system has sufficient ram for a large buffer cache - the subsequent
sync / umount takes ages - but the combined time is still much less
than any other strategy for filling a filesys with lots of files, and the
filesys can be happily used in parallel with the sync - -o async helps less
if the files are relatively big of course, but for pkgsrc it is ideal).

Should the system crash for any reason while all this is happening, I
simply start again, from the newfs - that is so rare (NetBSD being mostly
stable, and with a UPS to guard against power problems) that the extra
delay  work that might be required is irrelevant - and in any case,
I suspect that I could newfs/mount -oasync/tar x/crash/newfs/mount/tar x
faster than a simple tar x on a normally mounted filesystem, even with -olog.

I also mount the filesystem I use for pkg_comp sandboxes with -o async.
Again, should the system crash, I don't care, simply newfs and make
the sandbox(es) again.   This vastly improves compile times (particularly
cleanup times - a newfs followed by repopulating the sandbox is quite
fast ... even a rm -fr on the sandbox, and repopulate, is MUCH faster
that make clean on any sizeable package with many dependencies - I do
that between package builds to guarantee no accidental undesired pollution.)

Of course, to do this, one must believe in filesystems as useful objects,
and not simply a nuisance created out of the necessity of drives that
were too small, which should be avoided wherever possible.   Some of my
systems have approaching 40 mounted filesystems - filesystems are first
class objects - they're the unit for mount options (like -o ro, -o async,
and -o log), they're the unit for exports, they're the unit for dumps.
Using them intelligently makes system management much more flexible.

We are still  lacking some facilities that would make things even better,
including filesystems that could easily grow/shrink as needed, so the
argument about running out of space in one filesystem while there is
plenty available in another could be ignored - it is the only argument against
multiple filesysems with any real merit, and it is true only because
we allow it to remain so, it doesn't have to be (Digital Unix's ADVFS
proved that decades ago).  There's more that could be done to improve things
- including handling fsck better at startup - the system should be able to
come up multi user before all filesystems are checked and mounted, only
some subset (of the system with almost 40, I think it needs about 8 to 
function for 99% of its uses - the rest are specialised) are really needed,
the rest should be checked and when ready, mounted, after the system is
running  --   -o log helps there, but isn't really enough (like for many
of the filesystems I have, if they were never to become available, because
of hardware failure or something, it should not prevent successful multi-user
boot.)

kre

ps: I had been meaning to rant like this for some time, your message just
provided the incentive today!

re: high load, no bottleneck

2013-09-28 Thread matthew green


 ps: I had been meaning to rant like this for some time, your message just
 provided the incentive today!

:-)

i will note that i'm also a fan of using -o async FFS mounts in
the right place.  i just wouldn't do it for a file server :-)

Re: high load, no bottleneck

2013-09-28 Thread Roland C. Dowdeswell

On Sat, Sep 28, 2013 at 05:56:50PM +1000, matthew green wrote:


  ps: I had been meaning to rant like this for some time, your message just
  provided the incentive today!
 
 :-)
 
 i will note that i'm also a fan of using -o async FFS mounts in
 the right place.  i just wouldn't do it for a file server :-)

I thought quite some time ago that it probably makes sense for us
to make the installer mount everything async to extract the sets
because, after all, if the install fails you are rather likely to
simply re-install rather than manually figure out where it was and
fix it.

--
Roland Dowdeswell  http://Imrryr.ORG/~elric/

Re: high load, no bottleneck

2013-09-28 Thread Emmanuel Dreyfus

Robert Elz k...@munnari.oz.au wrote:

 incidentally, while the man page says that -o log and -o async
 can't be used together, if they are, the result is a panic, rather
 than a more graceful error message ...

This could be a real problem on a system that allows unprivilegied users
to mount thumb drives...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-28 Thread Michael van Elst

m...@netbsd.org (Emmanuel Dreyfus) writes:

Basically, if we have N pending VOP_FSYNC for a given filesystem, all
theses requests will be honoured on first flush, but they are serialized
and will be acknowledged one by one, with the cost of a useless flush
each time. Am I right?

That should be trivial to fix then. Don't flush if it isn't dirty.

Re: high load, no bottleneck

2013-09-28 Thread Emmanuel Dreyfus

Michael van Elst mlel...@serpens.de wrote:

 Basically, if we have N pending VOP_FSYNC for a given filesystem, all
 theses requests will be honoured on first flush, but they are serialized
 and will be acknowledged one by one, with the cost of a useless flush
 each time. Am I right?
 
 That should be trivial to fix then. Don't flush if it isn't dirty.



Here are the backtrace I collected for ine of the many stuck process
waiting for I/O:
turnstile_block
rw_vector_enter
wapbl_begin
ffs_write
VOP_WRITE
nfsrv_write
nfssvc_nfsd
sys_nfssvc
syscall


We would add a we_dirty flag to struct wapbl_entry? When would it be set
to 1?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-28 Thread Thor Lancelot Simon

On Sat, Sep 28, 2013 at 06:25:22AM +0200, Emmanuel Dreyfus wrote:
 
 Basically, if we have N pending VOP_FSYNC for a given filesystem, all
 theses requests will be honoured on first flush, but they are serialized
 and will be acknowledged one by one, with the cost of a useless flush
 each time. Am I right?

Do you mean all these requests *could* be honored on first flush?  If
so, then yes, I agree.

Unfortunately, it may be that we lost some of the framework necessary to
do this when we reverted the softdep code.  Looking at the code in XFS
that gathers barrier events, it seems to be drawing a DAG...

Re: high load, no bottleneck

2013-09-27 Thread Emmanuel Dreyfus

Emmanuel Dreyfus m...@netbsd.org wrote:
 
 I tried moving a client NFS mount to async. The result is that the
 server never sees a filesync again from that client. 

Further testing shows that server with -o log / client with -o async has
no performance problem.  OTOH, the client sometimes complain about write
errors. -o async seems dangerous.

Too many fsync() causing too many WAPBL flushes seems confirmed to be
the problem at work here.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-27 Thread Emmanuel Dreyfus

Thor Lancelot Simon t...@panix.com wrote:

 It should be possible to gather those requests and commit many of them
 at once to disk with a single cache flush operation, rather than issuing
 a cache flush for each one.  This is not unlike the problem with nfs3 in
 general, that many clients at once may issue WRITE RPCs followed by COMMIT
 RPCs, and the optimal behavior is to gather the COMMITS, service many at
 a time, then respond to them all -

If I understand correctly, the current situation is that each NFS client
fsync() causes a server WAPBL flush, while other NFS clients fsync() are
waiting. 

The situation is obvious with NFS, but it also probably exists with
local I/O, when VOP_FSYNC cause a WAPBL flush and let others VOP_FSYNC
wait.

Basically, if we have N pending VOP_FSYNC for a given filesystem, all
theses requests will be honoured on first flush, but they are serialized
and will be acknowledged one by one, with the cost of a useless flush
each time. Am I right?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

re: high load, no bottleneck

2013-09-27 Thread matthew green


  I tried moving a client NFS mount to async. The result is that the
  server never sees a filesync again from that client. 
 
 Further testing shows that server with -o log / client with -o async has
 no performance problem.  OTOH, the client sometimes complain about write
 errors. -o async seems dangerous.

-o async is very dangerous.  there's not even the vaguest
guarantee that even fsck can help you after a crash in
that case...


.mrg.

Re: high load, no bottleneck

2013-09-27 Thread Mouse

 I tried moving a client NFS mount to async.  [...]
 Further testing shows that server with -o log / client with -o async
 has no performance problem.  OTOH, the client sometimes complain
 about write errors.  -o async seems dangerous.
 -o async is very dangerous.  there's not even the vaguest guarantee
 that even fsck can help you after a crash in that case...

I think you're confusing -o async on the server-side (FFS, etc) mount
with -o async on the client-side (NFS) mount.  The former, yes, agreed
(though fsck will be able to put things back together just about often
enough to lull people into a false sense of security...).  The latter,
things aren't quite so dismal - depending on what the client is doing
and what kind of damage it's prepared to tolerate, -o async on the NFS
mount (client with -o async) may actually be a sane thing to do.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Re: high load, no bottleneck

2013-09-25 Thread Emmanuel Dreyfus

Emmanuel Dreyfus m...@netbsd.org wrote:

  async   Assume that unstable write requests have actually been committed
  to stable storage on the server, and thus will not require
  resending in the event that the server crashes.  Use of this
  option may improve performance but only at the risk of data loss
  if the server crashes.  Note: this mount option will only be hon-
  ored if the nfs.client.allow_async option in nfs.conf(5) is also
  enabled.

I tried moving a client NFS mount to async. The result is that the
server never sees a filesync again from that client. 

What are the consequences? I understand that if I use -o log
server-side, I will still benefit regular flushes. I will loose the
guarantee that client fsync(2) push data to stable storage, but I will
not have corrupted files on server crash. Is that right?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-25 Thread Paul Goyette


On Thu, 26 Sep 2013, Emmanuel Dreyfus wrote:


Emmanuel Dreyfus m...@netbsd.org wrote:


 async   Assume that unstable write requests have actually been committed
 to stable storage on the server, and thus will not require
 resending in the event that the server crashes.  Use of this
 option may improve performance but only at the risk of data loss
 if the server crashes.  Note: this mount option will only be hon-
 ored if the nfs.client.allow_async option in nfs.conf(5) is also
 enabled.


I tried moving a client NFS mount to async. The result is that the
server never sees a filesync again from that client.

What are the consequences? I understand that if I use -o log
server-side, I will still benefit regular flushes. I will loose the
guarantee that client fsync(2) push data to stable storage, but I will
not have corrupted files on server crash. Is that right?


As I understand things, -o log (wapbl) doesn't guarantee file content, 
only fs metadata.



-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-

Re: high load, no bottleneck

2013-09-24 Thread Edgar Fuß

 I have no idea wether [several journal flushes per second] is high or low.
It should be killing you.
So the main question is who is issuing these small sync writes.
As already mentioned, per filesync write you get a WAPBL journal flush which 
ends up in two disc flushes (one before and one after).

 This is a two-disk RAID 1
Wait, this is a Level 1 RAID? The you don't have RMW.
You may compare write througput to read throughput.

Re: high load, no bottleneck

2013-09-24 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

  I have no idea wether [several journal flushes per second] is high or low.
 It should be killing you.
 So the main question is who is issuing these small sync writes.
 As already mentioned, per filesync write you get a WAPBL journal flush
 which ends up in two disc flushes (one before and one after).

I understand the rationale, but I do not see how that could be fixed. We
want fsync to do a disk sync, and client are unlikely to be fixable.

  This is a two-disk RAID 1
 Wait, this is a Level 1 RAID? The you don't have RMW.

RMW?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-24 Thread Edgar Fuß

 We want fsync to do a disk sync, and client are unlikely to be fixable.
In my case, the culprit was SQLite used by browsers and dropbox.
As these were not fixable, I ended up writing a system that re-directs these
SQLite files to local storage (http://www.math.uni-bonn.de/people/ef/dotcache).

 RMW?
Read-Modify-Write.
On a RAID 4/5, writing anything that's not an entire stripe needs either to
read the rest of the stripe (to be able to compute the new parity) before 
writing the modified part and the parity; or it (if you modify less than half 
the stripe) reads both the old data and old parity to compute the new parity.
You don't have that on RAID 1, of course.

Re: high load, no bottleneck

2013-09-24 Thread David Brownlee

http://www.math.uni-bonn.de/people/ef/dotcache/ has a typo in the
first subheading Dotache :)

On 24 September 2013 13:38, Edgar Fuß e...@math.uni-bonn.de wrote:
 We want fsync to do a disk sync, and client are unlikely to be fixable.
 In my case, the culprit was SQLite used by browsers and dropbox.
 As these were not fixable, I ended up writing a system that re-directs these
 SQLite files to local storage 
 (http://www.math.uni-bonn.de/people/ef/dotcache).

 RMW?
 Read-Modify-Write.
 On a RAID 4/5, writing anything that's not an entire stripe needs either to
 read the rest of the stripe (to be able to compute the new parity) before
 writing the modified part and the parity; or it (if you modify less than half
 the stripe) reads both the old data and old parity to compute the new parity.
 You don't have that on RAID 1, of course.

Re: high load, no bottleneck

2013-09-24 Thread David Brownlee

crap, apologies for the non checked return address.

In the interest of trying to make a relevant reply - doesn't nfs3
support differing COMMIT sync levels which could be leveraged for
this? (assuming your server is stable :)

aside
I recall using NFS for file storage at Dreamworks in the late '90s and
discovering the reason that the SGI file servers boxes outperformed
everything else is that they lied to the client and indicated data has
been synced to disk as soon as it hit memory.

Wonderful performance feature... until someone insisted in putting
known buggy ATM drivers into production which could give up to a GB of
lost data when the fileservers paniced...
/aside

Re: high load, no bottleneck

2013-09-24 Thread Emmanuel Dreyfus

David Brownlee a...@netbsd.org wrote:

 In the interest of trying to make a relevant reply - doesn't nfs3
 support differing COMMIT sync levels which could be leveraged for
 this? (assuming your server is stable :)

It would be that mount_nfs option? (from MacOS X man page)

 async   Assume that unstable write requests have actually been committed
 to stable storage on the server, and thus will not require
 resending in the event that the server crashes.  Use of this
 option may improve performance but only at the risk of data loss
 if the server crashes.  Note: this mount option will only be hon-
 ored if the nfs.client.allow_async option in nfs.conf(5) is also
 enabled.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-23 Thread Edgar Fuß

EF However, the amount of filesync writes may still be the problem.
EF The missing data (for me) is how often your WAPBL journal gets flushed
ED How that can be retreived? 
Look at the WAPBL debug output in syslog (which has time stamps).

EF How large are your stripes, btw.?
ED It is the sectPerSU in raidctl -s output, right?
Multiplied by the number of data discs (i.e. discs minus one for level 5).

Re: high load, no bottleneck

2013-09-23 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 EF However, the amount of filesync writes may still be the problem.
 EF The missing data (for me) is how often your WAPBL journal gets flushed
 ED How that can be retreived? 
 Look at the WAPBL debug output in syslog (which has time stamps).

min: 1 flush/s
max: 6 flush/s
mean: 3.2 flush/s, standard deviation: 0.33
 
I have no idea wether this is high or low.

 EF How large are your stripes, btw.?
 ED It is the sectPerSU in raidctl -s output, right?
 Multiplied by the number of data discs (i.e. discs minus one for level 5).

This is a two-disk RAID 1 with this raidctl -s output
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 972799936

The answer would therefore be 32 * 2 =64


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-22 Thread Edgar Fuß

 I confirm. Indeed this is weird.
Yes. I would try finding out who causes this.

 But how small write could kill WAPBL performances?
I don't think it would. However, the amount of filesync writes may still be the 
problem. The missing data (for me) is how often your WAPBL journal gets flushed
(because of these filesync writes). If this happens once a second or even more
often, it may completly stall your discs. One filesync write causes one journal
flush which causes two disc cache flushes.

Additionally, every single small sync write will cause one RAID stripe RMW.
How large are your stripes, btw.?

Re: high load, no bottleneck

2013-09-22 Thread Greg Troxel


Edgar Fuß e...@math.uni-bonn.de writes:

 I myself can't make sense out of the combination of
 -- vfs.wapbl.flush_disk_cache=0 mitigating the problem
 -- neither the RAID set nor its components showing busy in iostat
 Maybe during a flush, the discs are not regarded busy?
 Do you have physical access to the server during the test? Then you
 could have a look whether the discs are really idle (as iostat says)
 or busy (as disabling flushing mitigating the problem suggests).

I have a WD Elements external USB drive, and it seems to take most of a
second to execute a cache flush (via wapbl).  It was basically unusable
with rdiff-backup (which calls fsync which does a log truncation) with
cache flush enabled.   The real fix was disabling the fsync in
rdiff-backup, but disabling wapbl cache flush ehlped.


pgplLAzo0Kl0W.pgp
Description: PGP signature

Re: high load, no bottleneck

2013-09-22 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 I don't think it would. However, the amount of filesync writes may still
 be the problem. The missing data (for me) is how often your WAPBL journal
 gets flushed (because of these filesync writes). I

How that can be retreived? 
 
 Additionally, every single small sync write will cause one RAID stripe RMW.
 How large are your stripes, btw.?

It is the sectPerSU in raidctl -s output, right?

   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 972799936
   RAID Level: 1


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-21 Thread Edgar Fuß

ED 2908 getattr
EF During which timeframe?
ED 22.9 seconds.
So that's 100 getattrs per second.

 Indeed [lots of 549-byte write requests] is weird.
 But how small write could kill WAPBL performances?
No idea. I think I'm out of luck now, but maybe it rings a bell with someone 
else.

It would probably help finding out (with WAPL logging) how often the journal 
flushes happen.

I myself can't make sense out of the combination of
-- vfs.wapbl.flush_disk_cache=0 mitigating the problem
-- neither the RAID set nor its components showing busy in iostat
Maybe during a flush, the discs are not regarded busy?
Do you have physical access to the server during the test? Then you could have 
a look whether the discs are really idle (as iostat says) or busy (as disabling 
flushing mitigating the problem suggests).

Any experts to explain what exactly busy means in the iostat time sense?

Maybe also you get some hint by trying to find out whether the problem is NFS 
or client related. Are you able to reproduce it locally? Can you make it happen 
(to a lesser extent, of course) with a single process?

Re: high load, no bottleneck

2013-09-21 Thread Manuel Bouyer

On Sat, Sep 21, 2013 at 11:44:26AM +0200, Edgar Fuß wrote:
 ED 2908 getattr
 EF During which timeframe?
 ED 22.9 seconds.
 So that's 100 getattrs per second.
 
  Indeed [lots of 549-byte write requests] is weird.
  But how small write could kill WAPBL performances?
 No idea. I think I'm out of luck now, but maybe it rings a bell with someone 
 else.
 
 It would probably help finding out (with WAPL logging) how often the journal 
 flushes happen.
 
 I myself can't make sense out of the combination of
 -- vfs.wapbl.flush_disk_cache=0 mitigating the problem
 -- neither the RAID set nor its components showing busy in iostat
 Maybe during a flush, the discs are not regarded busy?

I suspect that indeed, while a fluch cache command is running, the
disk is not considered busy. Only read and write commands are tracked.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: high load, no bottleneck

2013-09-21 Thread Edgar Fuß

 I suspect that indeed, while a fluch cache command is running, the
 disk is not considered busy. Only read and write commands are tracked.
Would it a) make sense b) be possible to implement that flushes are counted 
as busy?

Re: high load, no bottleneck

2013-09-21 Thread Manuel Bouyer

On Sat, Sep 21, 2013 at 12:01:45PM +0200, Edgar Fuß wrote:
  I suspect that indeed, while a fluch cache command is running, the
  disk is not considered busy. Only read and write commands are tracked.
 Would it a) make sense b) be possible to implement that flushes are counted 
 as busy?

It would probably make sense. But requires a bit of work:
right now, disk_busy()/disk_unbusy() takes as argument the byte count and
the op type (read/write). I think we should separate flushes from writes
so that writes don't get counted twice (also byte count cannot be
guessed for flushes, we don't know how much data there's in the disk cache).
Maybe we could count flushes as 0-byte writes, but I'm not sure that
disk_busy()/disk_unbusy() would handle this properly.
Or we could add a third operation type, but then disk_busy()/disk_unbusy()
and userland tools would need to be chanegd to handle it ...

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: high load, no bottleneck

2013-09-21 Thread Manuel Bouyer

On Sat, Sep 21, 2013 at 09:58:30AM +, Michael van Elst wrote:
 e...@math.uni-bonn.de (Edgar =?iso-8859-1?B?RnXf?=) writes:
 
 I myself can't make sense out of the combination of
 -- vfs.wapbl.flush_disk_cache=0 mitigating the problem
 -- neither the RAID set nor its components showing busy in iostat
 Maybe during a flush, the discs are not regarded busy?
 
 busy means that the driver is executing I/O requests or waiting for
 results from such operartions. The cache flush operation doesn't
 count by itself, but it slows down parallel I/O operations, so the
 disk appears busy as long as there are such I/O operations.

At last for wd(4) (and I suspect it's true for sd(4) as well),
new I/Os while a flush cache is running are stalled in the buffer queue, and
so disk_busy() is only called once the flush cache has completed.
So, the disk is effectively counted as unbusy while a cache flush is
in progress.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: high load, no bottleneck

2013-09-21 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 Maybe also you get some hint by trying to find out whether the problem is
 NFS or client related. Are you able to reproduce it locally? Can you
 make it happen (to a lesser extent, of course) with a single process?

I tried various ways but I could not obtain the same phenomenon locally:
if I get high load, it is because of CPU usage.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-20 Thread Edgar Fuß

 Here is an excerpt:
 device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
 wd0  0.00  0   0.00 0.00  13.74 27   0.00 0.36
 wd1  0.00  0   0.00 0.00  13.74 27   0.00 0.36
 raid10.00  0   0.00 0.00  20.61 18   0.00 0.36
Provided raid1 is where tha data lives, both the RAID and its components are 
quiet.

 Output of your first script:
 2908 getattr
During which timeframe?

 Output of your second script:
  167 549 unstable
   28 549 filesync
So you mostly get 549-byte write requests?
Could you manually double-check this? It sounds so weird that I'm afraid of an 
error in my script mis-interpreting your tcpdump's output format.

In any case, I'm afraid you are not facing the kind of problems I inadvertently 
became a sort-of-expert in.
Those 549-byte writes, should they prove real, may (others) give a hint what's 
going wrong, though.

Re: high load, no bottleneck

2013-09-20 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

  Output of your first script:
  2908 getattr
 During which timeframe?

22.9 seconds.
 
  Output of your second script:
   167 549 unstable
28 549 filesync
 So you mostly get 549-byte write requests?
 Could you manually double-check this? 

I confirm. Indeed this is weird. But how small write could kill WAPBL
performances? The load does not get beyond 1 if mounted without -o
log...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Edgar Fuß

 I re-enabled -o log and did the dd test again on NetBSD 6.0 with the
 patch you posted and vfs.wapbl.verbose_commit=2
I wouldn't expect anything interesting from this, but maybe hannken@ does.

 Running my stress test, which drives load to insane values:
How often do these log flushes occur?

During the stess phase, what does
iostat -D -x -w 1
show for the raid and for the components, especially in the time column?

During the stress, do you see small synchronous writes in NFS traffic?

I attach two small shell scripts I wrote to extract statistics from tcpdump 
output. Both operate either on a pcap file, e.g. the output of
tcpdump ... -s slen -w file port nfs
(where I don't remember what the minimum slen is to capture all relevant info)
or a textual tcpdump -vvv output (which is usually smaller), e.g.
tcpdump ... -s 0 -vvv port nfs file

As output, you get a list of NFS calls and a count (nfsstat) or a list of write 
calls ordered by size/sync and their counts.


nfsstat.sh
Description: Bourne shell script


nfswritestat.sh
Description: Bourne shell script

Re: high load, no bottleneck

2013-09-19 Thread Thor Lancelot Simon

On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote:
 Emmanuel Dreyfus m...@netbsd.org wrote:
 
  Thank you for saving my day. But now what happens?
  I note the SATA disks are in IDE emulation mode, and not AHCI. This is
  something I need to try changing:
 
 Switched to AHCI. Here is below how hard disks are discovered (the relevant 
 raid
 is RAID1 on wd0 and wd1)
 
 In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0
 and -current. I assume there must be something bad with WAPBL/RAIDframe

There is at least one thing: RAIDframe doesn't allow enough simultaneously
pending transactions, so everything *really* backs up behind the cache flush.

Fixing that would require allowing RAIDframe to eat more RAM.  Last time I
proposed that, I got a rather negative response here.

Thor

Re: high load, no bottleneck

2013-09-19 Thread Thor Lancelot Simon

On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote:
 Emmanuel Dreyfus m...@netbsd.org wrote:
 
  Thank you for saving my day. But now what happens?
  I note the SATA disks are in IDE emulation mode, and not AHCI. This is
  something I need to try changing:
 
 Switched to AHCI. Here is below how hard disks are discovered (the relevant 
 raid
 is RAID1 on wd0 and wd1)

The other thing is, *if* we had support for more modern features (tagged
queueing, FUA, etc.) in our ATA code, switching to AHCI mode could potentially
have much more benefit.  But it doesn't now -- and the development work
required to add support for those features to the ATA code and to use them
in FFS and WAPBL is not small.

Thor

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

On Thu, Sep 19, 2013 at 08:13:42AM -0400, Thor Lancelot Simon wrote:
 There is at least one thing: RAIDframe doesn't allow enough simultaneously
 pending transactions, so everything *really* backs up behind the cache flush.
 
 Fixing that would require allowing RAIDframe to eat more RAM.  Last time I
 proposed that, I got a rather negative response here.

It could be optionnal so that everyone is happy, couldn't it?

-- 
Emmanuel Dreyfus
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Greg Oster

On Thu, 19 Sep 2013 10:29:55 -0400
chris...@zoulas.com (Christos Zoulas) wrote:

 On Sep 19,  8:13am, t...@panix.com (Thor Lancelot Simon) wrote:
 -- Subject: Re: high load, no bottleneck
 
 | On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote:
 |  Emmanuel Dreyfus m...@netbsd.org wrote:
 |  
 |   Thank you for saving my day. But now what happens?
 |   I note the SATA disks are in IDE emulation mode, and not AHCI.
 This is |   something I need to try changing:
 |  
 |  Switched to AHCI. Here is below how hard disks are discovered
 (the relevant raid |  is RAID1 on wd0 and wd1)
 |  
 |  In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads,
 on both 6.0 |  and -current. I assume there must be something bad
 with WAPBL/RAIDframe | 
 | There is at least one thing: RAIDframe doesn't allow enough
 simultaneously | pending transactions, so everything *really* backs
 up behind the cache flush. | 
 | Fixing that would require allowing RAIDframe to eat more RAM.  Last
 time I | proposed that, I got a rather negative response here.
 
 sysctl to the rescue.

The appropriate 'bit to twiddle' is likely raidPtr-openings.
Increasing the value can be done while holding raidPtr-mutex.
Decreasing the value can also be done while holding raidPtr-mutex, but
will need some care if attempting to decrease it by more than the
number of outstanding IOs.

I'm happy to review any changes to this, but won't have time to code it
myself, unfortunately :( 

Later...

Greg Oster

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

Greg Oster os...@cs.usask.ca wrote:

  sysctl to the rescue.
 
 The appropriate 'bit to twiddle' is likely raidPtr-openings.
 Increasing the value can be done while holding raidPtr-mutex.
 Decreasing the value can also be done while holding raidPtr-mutex, but
 will need some care if attempting to decrease it by more than the
 number of outstanding IOs.

This suggests that in my problem, RAIDframe would be the bottleneck
given too many concurent I/O sent by WAPBL. But how is it possible?
Aren't WAPBL flushes serialized?

The change you sugest would be set by raidctl rather than sysctl, right?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Brian Buhrow

Hello.  thor's right. The raidframe driver defaults to a rediculously
low number of maximum outstanding transactions for today's environment.
This is not a criticism of how the number was chosen initially, but things
have changed.  In my production kernels around here, I include the
following option which is a number I derived from a bit of impirical
testing.  I found that for arrays of raid5 disks, I didn't get much benefit
with higher numbers, but numbers below this did show a marked decline in
performance.
For example, on an amd64 machine with 32G of ram, I have  a raid5 set
with 12 disks running on 2 mpt(4) buses.  I get the following read and
write numbers written to a filesystem with softdep enabled on top of a
dk(4) wedge built on the raid5 set:
(This is NetBSD-5.1)

test# dd if=/dev/zero of=testfile bs=64k count=65535
65535+0 records in
65535+0 records out
4294901760 bytes transferred in 125.486 secs (34226142 bytes/sec)


test# dd if=testfile of=/dev/null bs=64k count=65535
65535+0 records in
65535+0 records out
4294901760 bytes transferred in 5.994 secs (716533493 bytes/sec)

The line I include in my config files is:

options  RAIDOUTSTANDING=40 #try and enhance raid performance.

Re: high load, no bottleneck

2013-09-19 Thread Brian Buhrow

Hello.  the worst case scenario is when a raid set is running in
degraded mode.  Greg sent me some notes on how to calculate the memory
utilization in this instance.  I'll go dig them out and send them along in
a bit.  In theory, if all your raid sets are in degraded mode at once, and
i/o is busy, you could be highly impacted, since you can have up to 40
i/o's outstanding for each raid set with my configuration option.  However,
even on machines with multiple raid5 sets, with 2 of them running in
degraded mode, I've not seen a memory bottleneck.  I don't recommend this,
of course, but somethimes stuff happens.  In any case, except for the
potential memory utilization, there's no down side to setting this number
in the kernel and not worrying about it anymore.  In fact, this is what I
do for all our machines around here regardless of whether the machine is
hosting raid1 sets, raid5 sets or a combination of the two.
-Brian

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

Greg Oster os...@cs.usask.ca wrote:

 It's probably easier to do by raidctl right now.  I'm not opposed to
 having RAIDframe grow a sysctl interface as well if folks think that
 makes sense. The 'openings' value is currently set on a per-RAID basis,
 so a sysctl would need to be able to handle individual RAID sets as
 well as overall configuration parameters.

IMO raidctl makes more sense here, as it is the place where one is
looking for RAID stuff.

While I am there: fsck takes an infinite time while RAIDframe is
rebuilding parity. I need to renice the raidctl process that does it in
order to complete fsck. Would raising the outstanding write value also
help here?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Greg Oster

On Thu, 19 Sep 2013 20:53:30 +0200
m...@netbsd.org (Emmanuel Dreyfus) wrote:

 Greg Oster os...@cs.usask.ca wrote:
 
  It's probably easier to do by raidctl right now.  I'm not opposed to
  having RAIDframe grow a sysctl interface as well if folks think that
  makes sense. The 'openings' value is currently set on a per-RAID
  basis, so a sysctl would need to be able to handle individual RAID
  sets as well as overall configuration parameters.
 
 IMO raidctl makes more sense here, as it is the place where one is
 looking for RAID stuff.
 
 While I am there: fsck takes an infinite time while RAIDframe is
 rebuilding parity. I need to renice the raidctl process that does it
 in order to complete fsck. Would raising the outstanding write value
 also help here?

Any additional load you have on the RAID set while rebuilding parity is
just going to make things worse...  What you really want to do is turn
on the parity logging stuff, and reduce the amount of effort spent
checking parity by orders of magnitude...

Later...

Greg Oster

Re: high load, no bottleneck

2013-09-19 Thread Greg Oster

On Thu, 19 Sep 2013 11:26:21 -0700 (PDT)
Paul Goyette p...@whooppee.com wrote:

 On Thu, 19 Sep 2013, Brian Buhrow wrote:
 
  The line I include in my config files is:
 
  options  RAIDOUTSTANDING=40 #try and enhance raid performance.
 
 Is this likely to have any impact on a system with multiple raid-1 
 mirrors?

Yes, it would, provided you have more than 6 concurrent IOs to each RAID
set..

Later...

Greg Oster

Re: high load, no bottleneck

2013-09-19 Thread Christos Zoulas

On Sep 19, 11:35am, buh...@nfbcal.org (Brian Buhrow) wrote:
-- Subject: Re: high load, no bottleneck

|   Hello.  the worst case scenario is when a raid set is running in
| degraded mode.  Greg sent me some notes on how to calculate the memory
| utilization in this instance.  I'll go dig them out and send them along in
| a bit.  In theory, if all your raid sets are in degraded mode at once, and
| i/o is busy, you could be highly impacted, since you can have up to 40
| i/o's outstanding for each raid set with my configuration option.  However,
| even on machines with multiple raid5 sets, with 2 of them running in
| degraded mode, I've not seen a memory bottleneck.  I don't recommend this,
| of course, but somethimes stuff happens.  In any case, except for the
| potential memory utilization, there's no down side to setting this number
| in the kernel and not worrying about it anymore.  In fact, this is what I
| do for all our machines around here regardless of whether the machine is
| hosting raid1 sets, raid5 sets or a combination of the two.

If we are going to add a sysctl, we might also put a different value for
the raid-degraded condition? Ideally I prefer if things autotuned, but that
is much more difficult.

christos

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

Greg Oster os...@cs.usask.ca wrote:

 Any additional load you have on the RAID set while rebuilding parity is
 just going to make things worse...  What you really want to do is turn
 on the parity logging stuff, and reduce the amount of effort spent
 checking parity by orders of magnitude...

You mean raidctl -M yes, right?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

Brian Buhrow buh...@nfbcal.org wrote:

 options  RAIDOUTSTANDING=40 #try and enhance raid performance.

I gave it a try, and even with RAIDOUTSTANDING set to 800 on a
NetBSD-6.1 kernel, my stress test raises load over 10 with -o log,
whereas it remains below 1 without -o log

Therefore it must be something else.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 How often do these log flushes occur?

On a 6.1 kernel with RAIDOUTSTANDING=800 and -o log. Stress test raises load
to around 10.
 
 During the stess phase, what does
   iostat -D -x -w 1
 show for the raid and for the components, especially in the time column?

Here is an excerpt:
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.00 0.00  13.74 27   0.00 0.36
wd1  0.00  0   0.00 0.00  13.74 27   0.00 0.36
raid10.00  0   0.00 0.00  20.61 18   0.00 0.36
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.06 0.00  12.70306   0.06 3.80
wd1  0.00  0   0.14 0.00  12.70306   0.14 3.80
raid10.00  0   0.14 0.00  18.51210   0.14 3.80
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.04 0.00  13.53285   0.04 3.77
wd1  0.00  0   0.04 0.00  13.53285   0.04 3.77
raid10.00  0   0.05 0.00  20.18191   0.05 3.77
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.04 0.00  13.34304   0.04 3.96
wd1 16.00  1   0.04 0.02  13.34304   0.04 3.96
raid1   16.00  1   0.05 0.02  20.28200   0.05 3.96
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.06 0.00  12.97242   0.06 3.06
wd1  0.00  0   0.13 0.00  12.97242   0.13 3.06
raid10.00  0   0.13 0.00  19.42161   0.13 3.06


 During the stress, do you see small synchronous writes in NFS traffic?

Output of your first script:
2908 getattr
1140 lookup
 969 access
 273 fsstat
 195 write
 145 commit
 110 create
 102 setattr
  94 remove
  23 read

Output of your second script:
 167 549 unstable
  28 549 filesync

And here is the iostat without -o log (load barely raise to 1)

device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.00 0.00  14.86 48   0.00 0.69
wd1  0.00  0   0.00 0.00  14.86 48   0.00 0.69
raid10.00  0   0.00 0.00  22.30 32   0.00 0.69
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.07 0.00   7.75227   0.07 1.72
wd1  0.00  0   0.06 0.00   7.75227   0.06 1.72
raid10.00  0   0.09 0.00   7.85224   0.09 1.72
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.01 0.00  14.08 50   0.01 0.69
wd1  2.00  1   0.01 0.00  14.08 50   0.01 0.69
raid12.00  1   0.02 0.00  32.64 22   0.02 0.69
device  read KB/tr/s   time MB/s write KB/tw/s   time MB/s
wd0  0.00  0   0.00 0.00  10.40 20   0.00 0.20
wd1 16.00  1   0.05 0.02  10.40 20   0.05 0.20
raid1   16.00  1   0.05 0.02  14.86 14   0.05 0.20


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-19 Thread Greg Oster

On Fri, 20 Sep 2013 01:37:20 +0200
m...@netbsd.org (Emmanuel Dreyfus) wrote:

 Greg Oster os...@cs.usask.ca wrote:
 
  Any additional load you have on the RAID set while rebuilding
  parity is just going to make things worse...  What you really want
  to do is turn on the parity logging stuff, and reduce the amount of
  effort spent checking parity by orders of magnitude...
 
 You mean raidctl -M yes, right?

Correct.

Later...

Greg Oster

Re: high load, no bottleneck

2013-09-18 Thread Manuel Bouyer

On Wed, Sep 18, 2013 at 03:34:19AM +0200, Emmanuel Dreyfus wrote:
 Christos Zoulas chris...@zoulas.com wrote:
 
  On large filesystems with many files fsck can take a really long time after
  a crash. In my personal experience power outages are much less frequent than
  crashes (I crash quite a lot since I always fiddle with things). If you
  don't care about fsck time, you don't need WAPBL. 
 
 But you just told me that I will need a fsck after crash now I am
 running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not
 just mount without -o log. What are WAPBL benefits when running with
 vfs.wapbl.flush_disk_cache=0?

For a NFS server, I'm not sure there's any benefit ...

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--

Re: high load, no bottleneck

2013-09-18 Thread Thor Lancelot Simon

On Tue, Sep 17, 2013 at 09:48:49PM +0200, Emmanuel Dreyfus wrote:
 
 Thank you for saving my day. But now what happens?
 I note the SATA disks are in IDE emulation mode, and not AHCI. This is
 something I need to try changing:

In AHCI mode, you might be able to use ordered tags or force unit access
(does SATA have this concept per command?) to force individual transactions
or series of transactions out, rather than flushing out all the data every
time just to get the metadata into the journal on-disk.  But that would
take some work on our ATA subsystem.

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Thor Lancelot Simon t...@panix.com wrote:

 In AHCI mode, you might be able to use ordered tags or force unit access
 (does SATA have this concept per command?) to force individual transactions
 or series of transactions out, rather than flushing out all the data every
 time just to get the metadata into the journal on-disk.  But that would
 take some work on our ATA subsystem.

On another machine, I already had performance problems with SATA
controllers in IDE emulation mode (recognized as piixide), which
disapeared when selecting AHCI mode in the BIOS (turning it to be
recognized as ahcisata)

I will report on this as soon as the server will be idle enough to be
rebooted.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-18 Thread Christos Zoulas

On Sep 18,  3:34am, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: high load, no bottleneck

| Christos Zoulas chris...@zoulas.com wrote:
| 
|  On large filesystems with many files fsck can take a really long time after
|  a crash. In my personal experience power outages are much less frequent than
|  crashes (I crash quite a lot since I always fiddle with things). If you
|  don't care about fsck time, you don't need WAPBL. 
| 
| But you just told me that I will need a fsck after crash now I am
| running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not
| just mount without -o log. What are WAPBL benefits when running with
| vfs.wapbl.flush_disk_cache=0?

You *might* need an fsck after power loss. If you crash and the disk syncs
then you should be ok if the disk flushed (which it probably did if you
say syncing disks after the panic).

christos

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Christos Zoulas chris...@zoulas.com wrote:

 You *might* need an fsck after power loss. If you crash and the disk syncs
 then you should be ok if the disk flushed (which it probably did if you
 say syncing disks after the panic).

I am not sure I ever encountered a crash where syncing disk after panic
did not lock up the machine forever :-)

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Emmanuel Dreyfus m...@netbsd.org wrote:

 Thank you for saving my day. But now what happens?
 I note the SATA disks are in IDE emulation mode, and not AHCI. This is
 something I need to try changing:

Switched to AHCI. Here is below how hard disks are discovered (the relevant raid
is RAID1 on wd0 and wd1)

In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0
and -current. I assume there must be something bad with WAPBL/RAIDframe

ahcisata0 at pci0 dev 31 function 2: vendor 0x8086 product 0x2922 (rev. 0x02)
ahcisata0: interrupting at ioapic0 pin 17
ahcisata0: AHCI revision 1.20, 4 ports, 32 slots, CAP
0xe322ffe3SXS,EMS,CCCS,PSC,SSC,PMD,SPM,ISS=0x2=Gen2,SCLO,SAL,SSNTF,SNCQ,S64A
atabus2 at ahcisata0 channel 0
atabus3 at ahcisata0 channel 1
atabus4 at ahcisata0 channel 2
atabus5 at ahcisata0 channel 3
ahcisata0 port 0: device present, speed: 3.0Gb/s
ahcisata0 port 1: device present, speed: 3.0Gb/s
ahcisata0 port 2: device present, speed: 3.0Gb/s
ahcisata0 port 3: device present, speed: 3.0Gb/s
wd0 at atabus2 drive 0
wd0: WDC WD5000AAJS-55A8B0
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
(using DMA)
wd1 at atabus3 drive 0
wd1: WDC WD5000AAJS-55A8B0
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
(using DMA)
wd2 at atabus4 drive 0
wd2: WDC WD5000AAJS-55A8B0
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
(using DMA)
wd3 at atabus5 drive 0
wd3: WDC WD5000AADS-00S9B0
wd3: drive supports 16-sector PIO transfers, LBA48 addressing
wd3: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd3(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
(using DMA)


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-18 Thread Edgar Fuß

 In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0
 and -current.
 I assume there must be something bad with WAPBL/RAIDframe
Everything up to and including 6.0 is broken in this respect.
Thanks to hannken@, 6.1 does align journal flushes.

How fast can you write to the file system in question?
Does your NFS load include a large amount of small syncrounous (filesync) write 
operations?

The attached patch (by hannken@) and vfs.wapbl.verbose_commit=2 will tell you 
how long the journal flushes take.
Don't activate (i.e. set verbose_commit) for longer than a few seconds without 
monitoring syslog size.
Index: vfs_wapbl.c
===
RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v
retrieving revision 1.51.2.2
diff -u -r1.51.2.2 vfs_wapbl.c
--- vfs_wapbl.c 2 Jan 2013 23:23:15 -   1.51.2.2
+++ vfs_wapbl.c 18 Sep 2013 16:19:40 -
@@ -1456,6 +1456,8 @@
size_t flushsize;
size_t reserved;
int error = 0;
+struct bintime start_time;
+flushsize = 0;
 
/*
 * Do a quick check to see if a full flush can be skipped
@@ -1479,6 +1481,7 @@
 * if we want to call flush from inside a transaction
 */
rw_enter(wl-wl_rwlock, RW_WRITER);
+bintime(start_time);
wl-wl_flush(wl-wl_mount, wl-wl_deallocblks, wl-wl_dealloclens,
wl-wl_dealloccnt);
 
@@ -1712,6 +1715,24 @@
}
 #endif
 
+if (wapbl_verbose_commit)
+{
+   struct bintime d;
+   struct timespec ts;
+   int kbsec, msec;
+
+   bintime(d);
+   bintime_sub(d, start_time);
+   bintime2timespec(d, ts);
+   msec = ts.tv_nsec/100+1000*ts.tv_sec;
+   if (msec == 0)
+   kbsec = 0;
+   else
+   kbsec = flushsize / msec;
+   printf(%s %zu bytes %d.%03d secs %d.%03d MB/sec\n,
+   wl-wl_mount-mnt_stat.f_mntonname, flushsize,
+   msec/1000, msec%1000, kbsec/1000, kbsec%1000);
+}
rw_exit(wl-wl_rwlock);
return error;
 }

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 How fast can you write to the file system in question?

What test do you want me to perform?

 Does your NFS load include a large amount of small syncrounous (filesync)
 write operations?

Yes, I run 24 concurent tar -czf as a test.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-18 Thread Edgar Fuß

EF How fast can you write to the file system in question?
ED What test do you want me to perform?
dd if=/dev/zero bs=64k

EF Does your NFS load include a large amount of small syncrounous (filesync)
EF write operations?
ED Yes, I run 24 concurent tar -czf as a test.
But those shouldn't do small synchronous writes, should they?

Anyway, hannken@'s logging patch should reveal if slow log flushing is indeed 
the problem.


P.S.: With us, the log flush alingment patch helped a lot, but a bunch of NFS 
clients running Thunderbird still locked up the File Server. SQLite loves to do 
4k sync writes which kill WAPBL. I ended up writing a system that relocated the 
Mozilla profiles to local volatile storage, periodically syncing them to NFS. 
Oh, and Dropbox also uses SQLite.

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

 EF How fast can you write to the file system in question?
 ED What test do you want me to perform?
 dd if=/dev/zero bs=64k

helvede# dd if=/dev/zero bs=64k of=out count=1  
1+0 records in
1+0 records out
65536 bytes transferred in 18.365 secs (35685270 bytes/sec)

Note I removed -o log

 EF Does your NFS load include a large amount of small syncrounous (filesync)
 EF write operations?
 ED Yes, I run 24 concurent tar -czf as a test.
 But those shouldn't do small synchronous writes, should they?

Indeed they should not.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-18 Thread Mouse

 Yes, I run 24 concurent tar -czf as a test.
 But those shouldn't do small synchronous writes, should they?

Depends.  Is the filesystem mounted noatime (or read-only)?  If not,
there are going to be atime updates, and don't all inode updates get
done synchronously?  Or am I misunderstanding something?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Re: high load, no bottleneck

2013-09-18 Thread Emmanuel Dreyfus

Edgar Fuß e...@math.uni-bonn.de wrote:

  35685270 bytes/sec
 That's OK.
 
  Note I removed -o log
 Shouldn't make a difference, I think.

I re-enabled -o log and did the dd test again on NetBSD 6.0 with the
patch you posted and vfs.wapbl.verbose_commit=2

# dd if=/dev/zero bs=64k of=out count=1   
1+0 records in
1+0 records out
65536 bytes transferred in 17.331 secs (37814321 bytes/sec)

kernel log:
wapbl_flush: 1379547406.828798896 this transaction = 168960 bytes
wapbl_cache_sync: 1: dev 0x1208 0.141346518
wapbl_cache_sync: 2: dev 0x1208 0.052974853
/ 168960 bytes 0.468 secs 0.361 MB/sec
wapbl_flush: 1379547409.299850220 this transaction = 323072 bytes
wapbl_cache_sync: 1: dev 0x120c 0.253761121
wapbl_cache_sync: 2: dev 0x120c 0.022943043
/home 323072 bytes 0.719 secs 0.449 MB/sec
wapbl_flush: 1379547417.023140298 this transaction = 136192 bytes
wapbl_cache_sync: 1: dev 0x1208 0.239226048
wapbl_cache_sync: 2: dev 0x1208 0.058130346
/ 136192 bytes 0.480 secs 0.283 MB/sec
wapbl_flush: 1379547420.514618291 this transaction = 338944 bytes
wapbl_cache_sync: 1: dev 0x120c 0.207321357
wapbl_cache_sync: 2: dev 0x120c 0.022987705
/home 338944 bytes 0.563 secs 0.602 MB/sec

Running my stress test, which drives load to insane values:
wapbl_flush: 1379547625.571507421 this transaction = 373760 bytes
wapbl_cache_sync: 1: dev 0x120c 0.099539954
wapbl_cache_sync: 2: dev 0x120c 0.009327683
/home 373760 bytes 0.132 secs 2.831 MB/sec
wapbl_flush: 1379547625.741582309 this transaction = 341504 bytes
wapbl_cache_sync: 1: dev 0x120c 0.136495561
wapbl_cache_sync: 2: dev 0x120c 0.009992682
/home 341504 bytes 0.168 secs 2.032 MB/sec
wapbl_flush: 1379547625.921656139 this transaction = 273408 bytes
wapbl_cache_sync: 1: dev 0x120c 0.147418701
wapbl_cache_sync: 2: dev 0x120c 0.010158968
/home 273408 bytes 0.178 secs 1.536 MB/sec
wapbl_flush: 1379547626.111737705 this transaction = 375808 bytes
wapbl_cache_sync: 1: dev 0x120c 0.116060286
wapbl_cache_sync: 2: dev 0x120c 0.006006430
/home 375808 bytes 0.159 secs 2.363 MB/sec

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread J. Hannken-Illjes


On Sep 17, 2013, at 5:39 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 I was suggested this should be better posted on tech-kern. It happened
 on NetBSD-6.0, and I tried to upgrade the kernel to -current, with the
 same result.
 
 On Tue, Sep 17, 2013 at 12:54:59PM +, Emmanuel Dreyfus wrote:
 I have a NFS server that exhibit a high load (20-30) when supporting
 about 30 clients, while there is no apparent bottleneck: low disck 
 activity, CPU idle most of the time, plenty of available RAM.
 
 Of course service is crapy, with many timouts. Any hint of what can be
 going on?
 
 I found the bottleneck. ps does not show it because it happens within
 the differen threads of nfsd.  DDB tells me that almost all nfsd threads 
 are waiting on tstile with this backtrace:
 
 turnstile_block
 rw_vector_enter
 genfs_lock
 VOP_LOCK
 vn_lock
 vget
 ufs_ihashget
 ffs_vget
 ufs_fhtovp
 VFS_FHTOVP
 nfsrv_fhtovp
 nfsrv_write
 nfssvc_nfsd
 sys_nfssvc

What are your clients doing?

Which vnode(s) are your nfsd threads waiting on (first arg to vn_lock)?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:

 What are your clients doing?

MacOS X machines opening sessions kill the server. I can reproduce the
problem with just concurent tar -czf on the NFS volume

 Which vnode(s) are your nfsd threads waiting on (first arg to vn_lock)?

Here is an example:

vn_lock(c5a24b08,2,0,c03a238e,4,c4ce9ed4,6,2ec4bcfb,3d90d5,0) at
netbsd:vn_lock+0x7c

db{0} show vnode c5a24b08
OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16

VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST
mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f
data 0xc5a25d74 writecount 0 holdcnt 2
tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480
v_lock 0xc5a24bac

db{0} show ncache c5a24b08
name .nfs.2005108f.1174
parent user22

After some experiments, it seems -current is much more resistant, I
cannot raise the load to 40 on it.


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

Emmanuel Dreyfus m...@netbsd.org wrote:

 db{0} show vnode c5a24b08
 OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16
 
 VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST
 mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f
 data 0xc5a25d74 writecount 0 holdcnt 2
 tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480
 v_lock 0xc5a24bac

While many threads are waiting, another nfsd thread holds the lock with
this backtrace:
turnstile_block
rw_vector_enter
wapbl_begin
ffs_write
VOP_WRITE
nfsrv_write
nfssvc_nfsd
sys_nfssvc
syscall

I understand it is waiting for another process to complete I/O before
passing the entering rwlock in wapbl_begin

I have a first-class suspect with this other nfsd thread which is
engaged in I/O:
sleepq_block
wdc_exec_command
wd_flushcache
wdioctl
bdev_ioctl
spec_ioctl
VOP_IOCTL
rf_sync_component_caches
raidioctl
bdev_ioctl
spec_ioctl
VOP_IOCTL
wapbl_cache_sync

Is it a nasty interraction between RAIDframe, NFS and WAPBL?



-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Christos Zoulas

In article 1l9czcn.y6kr35aruvzvm%m...@netbsd.org,
Emmanuel Dreyfus m...@netbsd.org wrote:
Emmanuel Dreyfus m...@netbsd.org wrote:

 db{0} show vnode c5a24b08
 OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16
 
 VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST
 mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f
 data 0xc5a25d74 writecount 0 holdcnt 2
 tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480
 v_lock 0xc5a24bac

While many threads are waiting, another nfsd thread holds the lock with
this backtrace:
turnstile_block
rw_vector_enter
wapbl_begin
ffs_write
VOP_WRITE
nfsrv_write
nfssvc_nfsd
sys_nfssvc
syscall

I understand it is waiting for another process to complete I/O before
passing the entering rwlock in wapbl_begin

I have a first-class suspect with this other nfsd thread which is
engaged in I/O:
sleepq_block
wdc_exec_command
wd_flushcache
wdioctl
bdev_ioctl
spec_ioctl
VOP_IOCTL
rf_sync_component_caches
raidioctl
bdev_ioctl
spec_ioctl
VOP_IOCTL
wapbl_cache_sync

Is it a nasty interraction between RAIDframe, NFS and WAPBL?

My suggestion is to try:

sysctl -w vfs.wapbl.flush_disk_cache=0

for now...

christos

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

I was suggested this should be better posted on tech-kern. It happened
on NetBSD-6.0, and I tried to upgrade the kernel to -current, with the
same result.

On Tue, Sep 17, 2013 at 12:54:59PM +, Emmanuel Dreyfus wrote:
 I have a NFS server that exhibit a high load (20-30) when supporting
 about 30 clients, while there is no apparent bottleneck: low disck 
 activity, CPU idle most of the time, plenty of available RAM.
 
 Of course service is crapy, with many timouts. Any hint of what can be
 going on?

I found the bottleneck. ps does not show it because it happens within
the differen threads of nfsd.  DDB tells me that almost all nfsd threads 
are waiting on tstile with this backtrace:

turnstile_block
rw_vector_enter
genfs_lock
VOP_LOCK
vn_lock
vget
ufs_ihashget
ffs_vget
ufs_fhtovp
VFS_FHTOVP
nfsrv_fhtovp
nfsrv_write
nfssvc_nfsd
sys_nfssvc


-- 
Emmanuel Dreyfus
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Christos Zoulas

On Sep 17,  9:48pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: high load, no bottleneck

| Excellent: the load does not go over 2 now (compared to 50).
| 
| Thank you for saving my day. But now what happens?
| I note the SATA disks are in IDE emulation mode, and not AHCI. This is
| something I need to try changing:

What happens highly depends on the drive (how frequently it flushes
cache to disk internally and how long does it keep data in-cache),
but it is never good. The best case scenario is would be that WAPBL
writes are ordered properly and that cache-flush is only send
occasionally between transactionally safe metadata commit points, but
it seems that this is not happening (because we are getting too many
flushes).

The case to worry about is the scenario where the machine
suddently loses power, the data never makes it to the physical media,
and gets lost from the cache. In this case you might end up with a
filesystem that has inconsistent metadata, so the next reboot might
end up causing a panic when the filesystem is used. The solution there
is to reboot and force an fsck. If you have a UPS I would not worry
too much about it; even if your system panics the kernel should issue
the flush commands to the disk.

BTW I hope that everyone realizes that WAPBL deals only with metadata
and not the actual file data, so if you crash/lose power you typically
end up with garbage in the active files (usually bits and pieces of files
form other files, or NUL's).

christos

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

Christos Zoulas chris...@astron.com wrote:

 My suggestion is to try:
 
 sysctl -w vfs.wapbl.flush_disk_cache=0
 
 for now...

Excellent: the load does not go over 2 now (compared to 50).

Thank you for saving my day. But now what happens?
I note the SATA disks are in IDE emulation mode, and not AHCI. This is
something I need to try changing:

piixide0 at pci0 dev 31 function 2: Intel 82801I Serial ATA Controller
(ICH9) (rev. 0x02)
piixide0: bus-master DMA support present
piixide0: primary channel configured to compatibility mode
piixide0: primary channel interrupting at ioapic0 pin 14
piixide0: secondary channel configured to compatibility mode
piixide0: secondary channel interrupting at ioapic0 pin 15

atabus2 at piixide0 channel 0

wd0 at atabus2 drive 0

atabus3 at piixide0 channel 1

wd2 at atabus3 drive 0

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

Christos Zoulas chris...@zoulas.com wrote:

 The case to worry about is the scenario where the machine
 suddently loses power, the data never makes it to the physical media,
 and gets lost from the cache. In this case you might end up with a
 filesystem that has inconsistent metadata, so the next reboot might
 end up causing a panic when the filesystem is used. The solution there
 is to reboot and force an fsck. 

It seems the system would be better without WAPBL enabled in this case.
Is there any befenit left?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Christos Zoulas

On Sep 18,  2:22am, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: high load, no bottleneck

|  The case to worry about is the scenario where the machine
|  suddently loses power, the data never makes it to the physical media,
|  and gets lost from the cache. In this case you might end up with a
|  filesystem that has inconsistent metadata, so the next reboot might
|  end up causing a panic when the filesystem is used. The solution there
|  is to reboot and force an fsck. 
| 
| It seems the system would be better without WAPBL enabled in this case.
| Is there any befenit left?

On large filesystems with many files fsck can take a really long time after
a crash. In my personal experience power outages are much less frequent than
crashes (I crash quite a lot since I always fiddle with things). If you
don't care about fsck time, you don't need WAPBL. Another easy thing you can
try is to put the WAPBL log in a flash drive and re-enable the cache flushes.

christos

Re: high load, no bottleneck

2013-09-17 Thread Brian Buhrow

hello.  How do you  move the  wapbl log to a drive other than the one
on which the filesystem that's being logged is runing?  In other words, I
thought the log existed on the same media as the filesystem.  Is that not
the case?
On Sep 17,  8:34pm, Christos Zoulas wrote:
} Subject: Re: high load, no bottleneck
} On Sep 18,  2:22am, m...@netbsd.org (Emmanuel Dreyfus) wrote:
} -- Subject: Re: high load, no bottleneck
} 
} |  The case to worry about is the scenario where the machine
} |  suddently loses power, the data never makes it to the physical media,
} |  and gets lost from the cache. In this case you might end up with a
} |  filesystem that has inconsistent metadata, so the next reboot might
} |  end up causing a panic when the filesystem is used. The solution there
} |  is to reboot and force an fsck. 
} | 
} | It seems the system would be better without WAPBL enabled in this case.
} | Is there any befenit left?
} 
} On large filesystems with many files fsck can take a really long time after
} a crash. In my personal experience power outages are much less frequent than
} crashes (I crash quite a lot since I always fiddle with things). If you
} don't care about fsck time, you don't need WAPBL. Another easy thing you can
} try is to put the WAPBL log in a flash drive and re-enable the cache flushes.
} 
} christos
-- End of excerpt from Christos Zoulas

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

Christos Zoulas chris...@zoulas.com wrote:

 On large filesystems with many files fsck can take a really long time after
 a crash. In my personal experience power outages are much less frequent than
 crashes (I crash quite a lot since I always fiddle with things). If you
 don't care about fsck time, you don't need WAPBL. 

But you just told me that I will need a fsck after crash now I am
running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not
just mount without -o log. What are WAPBL benefits when running with
vfs.wapbl.flush_disk_cache=0?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread David Holland

On Wed, Sep 18, 2013 at 03:34:19AM +0200, Emmanuel Dreyfus wrote:
  Christos Zoulas chris...@zoulas.com wrote:
  
   On large filesystems with many files fsck can take a really long time after
   a crash. In my personal experience power outages are much less frequent 
   than
   crashes (I crash quite a lot since I always fiddle with things). If you
   don't care about fsck time, you don't need WAPBL. 
  
  But you just told me that I will need a fsck after crash now I am
  running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not
  just mount without -o log. What are WAPBL benefits when running with
  vfs.wapbl.flush_disk_cache=0?

To the extent it's correct (which may vary) it's much faster.

The downside is that without the cache flushing there's some chance
that fsck won't be able to repair things afterwards.

The only real solution is to figure out why it's being slow.

(as far as I know wapbl doesn't support an external journal, and even
if it did putting one on an SSD that doesn't have power failure
recovery is worse than useless)

-- 
David A. Holland
dholl...@netbsd.org

Re: high load, no bottleneck

2013-09-17 Thread Emmanuel Dreyfus

David Holland dholland-t...@netbsd.org wrote:

 The downside is that without the cache flushing there's some chance
 that fsck won't be able to repair things afterwards.

This is scary. If this is a WAPBL/NFS/RAIDframe interaction, I think I'd
rather dump the RAID than the insurence to have a working fsck after a
crash.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

75 matches

Mail list logo