subject:"Re\: Distributed storage."

Re: Distributed storage. Move away from char device ioctls.

2007-10-26 Thread Evgeniy Polyakov

Returning back to this, since block based storage, which can act as a
shared storage/transport layer, is ready with 5'th release of the DST.

My couple of notes on proposed data distribution algorithm in FS.

On Sun, Sep 16, 2007 at 03:07:11AM -0400, Kyle Moffett ([EMAIL PROTECTED]) 
wrote:
 I actually think there is a place for this - and improvements are  
 definitely welcome.  Even Lustre needs block-device level  
 redundancy currently, though we will be working to make Lustre- 
 level redundancy available in the future (the problem is WAY harder  
 than it seems at first glance, if you allow writeback caches at the  
 clients and servers).
 
 I really think that to get proper non-block-device-level filesystem  
 redundancy you need to base it on something similar to the GIT  
 model.  Data replication is done in specific-sized chunks indexed by  
 SHA-1 sum and you actually have a sort of merge algorithm for when  
 local and remote changes differ.  The OS would only implement a very  
 limited list of merge algorithms, IE one of:
 
 (A)  Don't merge, each client gets its own branch and merges are manual
 (B)  Most recent changed version is made the master every X-seconds/ 
 open/close/write/other-event.
 (C)  The tree at X (usually a particular client/server) is always  
 used as the master when there are conflicts.

This looks like a good way to work with offline clients (where I recall
Coda), after offline node modified data, it should be merged back to the
cluster with different algorithms.

Data (supposed to be) written to the failed node during its offline time
will be resynced from other nodes when failed one is online, there are
no problems and/or special algorithms to be used here.

Filesystem replication is not a 100% 'git way' - git tree contains
already combined objects - i.e. the last blob for given path does not
contain its history, only ready to be used data, while filesystem,
especially that one which requires simultaneous write from different
threads/nodes, should implement copy-on-write semantics, essentially
putting all new data (git commit) to the new location and then collect
it from different extents to present a ready file.

At least that is how I see the filesystem I'm working on.

...

 There's a lot of other technical details which would need resolution  
 in an actual implementation, but this is enough of a summary to give  
 you the gist of the concept.  Most likely there will be some major  
 flaw which makes it impossible to produce reliably, but the concept  
 contains the things I would be interested in for a real networked  
 filesystem.

Git semantics and copy-on-write has actually quite a lot in common (on
high enough level of abstraction), but SHA1 index is not a possible 
issue in filesystem - even besides amount of data to be hashed before
key is ready. Key should also contain enough information about what
underlying data is - git does not store that information (tree, blob or
whatever) in its keys, since it does not require it. At least that is
how I see it to be implemented.

Overall I see this new project as a true copy-on-write FS.

Thanks.

 Cheers,
 Kyle Moffett

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Security attributes and ducumentation update.

2007-09-22 Thread Pavel Machek

Hi!

 I'm pleased to announce third release of the distributed storage
 subsystem, which allows to form a storage on top of remote and local
 nodes, which in turn can be exported to another storage as a node to
 form tree-like storages.

How is this different from raid0/1 over nbd? Or raid0/1 over
ata-over-ethernet?

 +| DST storate ---|

storage?


Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Security attributes and ducumentation update.

2007-09-22 Thread Evgeniy Polyakov

Hi Pavel.

On Mon, Sep 17, 2007 at 06:22:30PM +, Pavel Machek ([EMAIL PROTECTED]) 
wrote:
  I'm pleased to announce third release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
 How is this different from raid0/1 over nbd? Or raid0/1 over
 ata-over-ethernet?

I will repeate a quote I made for previous release:

It has number of advantages, outlined in the first release and on the
project homepage, namely:

* non-blocking processing without busy loops (compared to iSCSI and NBD)
* small, plugable architecture
* failover recovery (reconnect to remote target)
* autoconfiguration
* no additional allocatins (not including network  part) - at least two
in device mapper for fast path
* very simple - try to compare with iSCSI
* works with different network protocols
* storage can be formed on top of remote nodes and be exported
simultaneously (iSCSI is peer-to-peer only, NBD
requires device mapper, is synchronous and wants
special userspace thread)

DST allows to remove any nodes and then turn it
back into the storage without
breaking the dataflow, dst core will
reconnect automatically to the
failed remote nodes, it allows to work
with detouched devices just like
with usual filesystems (in case it was
not formed as a part of linear
storage, since in that case meta
information is spreaded between nodes).

It does not require special processes on
behalf of network connection,
everything will be performed
automatically on behalf of DST core
workers, it allows to export new device,
created on top of mirror or
linear combination of the others, which
in turn can be formed on top of
another and so on...

This was designed to allow to create a
distributed storage with
completely transparent failover
recovery, with ability to detouch remote
nodes from mirror array to became
standalone realtime backups (or
snapshots) and turn it back into the
storage without stopping main
device node. 

  +| DST storate ---|
 
 storage?

Yep, thanks.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-16 Thread Kyle Moffett


On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:

On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
Yes, block device itself is not able to scale well, but it is the  
place for redundancy, since filesystem will just fail if  
underlying device does not work correctly and FS actually does not  
know about where it should place redundancy bits - it might happen  
to be the same broken disk, so I created a low-level device which  
distribute requests itself.


I actually think there is a place for this - and improvements are  
definitely welcome.  Even Lustre needs block-device level  
redundancy currently, though we will be working to make Lustre- 
level redundancy available in the future (the problem is WAY harder  
than it seems at first glance, if you allow writeback caches at the  
clients and servers).


I really think that to get proper non-block-device-level filesystem  
redundancy you need to base it on something similar to the GIT  
model.  Data replication is done in specific-sized chunks indexed by  
SHA-1 sum and you actually have a sort of merge algorithm for when  
local and remote changes differ.  The OS would only implement a very  
limited list of merge algorithms, IE one of:


(A)  Don't merge, each client gets its own branch and merges are manual
(B)  Most recent changed version is made the master every X-seconds/ 
open/close/write/other-event.
(C)  The tree at X (usually a particular client/server) is always  
used as the master when there are conflicts.


This lets you implement whatever replication policy you want:  You  
can require that some files are replicated (cached) on *EVERY*  
system, you can require that other files are cached on at least X  
systems.  You can say this needs to be replicated on at least X% of  
the online systems, or at most Y.  Moreover, the replication could  
be done pretty easily from userspace via a couple syscalls.  You also  
automatically keep track of history with some default purge policy.


The main point is that for efficiency and speed things are *not*  
always replicated; this also allows for offline operation.  You would  
of course have userspace merge drivers which notice that the tree  
on your laptop is not a subset/superset of the tree on your desktop  
and do various merges based on per-file metadata.  My address-book,  
for example, would have a custom little merge program which knows  
about how to merge changes between two address book files, asking me  
useful questions along the way.  Since a lot of this merging is  
mechanical, some of the code from GIT could easily be made into a  
merge library which knows how to do such things.


Moreover, this would allow me to have a shared root filesystem on  
my laptop and desktop.  It would have 'sub-project'-type trees, so  
that / would be an independent branch on each system. /etc would  
be separate branches but manually merged git-style as I make  
changes.  /home/* folders would be auto-created as separate  
subtrees so each user can version their own individually.  Specific  
subfolders (like address-book, email, etc) would be adjusted by the  
GUI programs that manage them to be separate subtrees with manual- 
merging controlled by that GUI program.


Backups/dumps/archival of such a system would be easy.  You would  
just need to clone the significant commits/trees/etc to a DVD and  
replace the old SHA-1-indexed objects to tiny object-deleted stubs;  
to rollback to an archived version you insert the DVD, mount it  
into the existing kernel SHA-1 index, and then mount the appropriate  
commit as a read-only volume somewhere to access.  The same procedure  
would also work for wide-area-network backups and such.


The effective result would be the ability to do things like the  
following:
  (A)  Have my homedir synced between both systems mostly- 
automatically as I make changes to different files on both systems
  (B)  Easily have 2 copies of all my files, so if one system's disk  
goes kaput I can just re-clone from the other.
  (C)  Keep archived copies of the last 5 years worth of work,  
including change history, on a stack of DVDs.
  (D)  Synchronize work between locations over a relatively slow  
link without much work.


As long as files were indirectly indexed by sub-block SHA1 (with the  
index depth based on the size of the file), and each individually- 
SHA1-ed object could have references, you could trivially have a 4TB- 
sized file where you modify 4 bytes at a thousand random locations  
throughout the file and only have to update about 5MB worth of on- 
disk data.  The actual overhead for that kind of operation under any  
existing filesystem would be 100% seek-dominated regardless whereas  
with this mechanism you would not directly be overwriting data and so  
you could append all the updates as a single 5MB chunk.  Data reads  
would be much more seek-y, but you could trivially have an on-line  
defragmenter tool which notices fragmented

Re: Distributed storage. Move away from char device ioctls.

2007-09-16 Thread Evgeniy Polyakov

On Sat, Sep 15, 2007 at 11:24:46AM -0600, Andreas Dilger ([EMAIL PROTECTED]) 
wrote:
  When Chris Mason announced btrfs, I found that quite a few new ideas 
  are already implemented there, so I postponed project (although
  direction of the developement of the btrfs seems to move to the zfs side
  with some questionable imho points, so I think I can jump to the wagon
  of new filesystems right now). 
 
 This is an area I'm always a bit sad about in OSS development - the need
 everyone has to make a new {fs, editor, gui, etc} themselves instead of
 spending more time improving the work we already have.  Imagine where the

If that would be true, we would be still in the stone age. 
Or not, actually I think the first cell in the universe would not bother 
itself dividing into the two just because it could spent infinite time 
trying to make itself better.

 internet would be (or not) if there were 50 different network protocols
 instead of TCP/IP?  If you don't like some things about btrfs, maybe you
 can fix them?

When some idea is implemented it is virtually impossible to change it,
only recreate new one with fixed issues. So, we have multiple ext,
reiser and many others. I do not say btrfs is broken or has design
problems, it is really interesting filesystem, but all we have our own 
opinions about how things should be done, that's it.

Btw, we do have so many network protocols for different purposes, that
number of (storage) filesystems is negligebly small compared to it. 
Internet as is popular today is just a subset of where network is used.

And we do invent new protocols each time we need something new, which
does not fit into existing models (for example TCP by design can not
work with very long-distance links with tooo long RTT). We have sctp to
fix some tcp issues. Number of IP layer 'neighbours' is even more.
Physical media layer has many different protocols too.
And that is just what exists in the linux tree...

 To be honest, developing a new filesystem that is actually widely useful
 and used is a very time consuming task (see Reiserfs and Reiser4).  It
 takes many years before the code is reliable enough for people to trust it,
 so most likely any effort you put into this would be wasted unless you can
 come up with something that is dramatically better than something existing.

Yep, I know. 
Wasting my time is one of the most pleasant things I ever tried in my life.

 The part that bothers me is that this same effort could have been used to
 improve something that more people would use (btrfs in this case).  Of
 course, sometimes the new code is substantially better than what currently
 exists, and I think btrfs may have laid claim to the current generation of
 filesystems.

Call me greedy bastard, but I do not care about world happiness, it is
just impossible to achieve. So I like what I do right now.
If it will be rest under the layer of dust I do not care, I like the
process of creating, so if it will fail, I just will get new knowledge.

:)

 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Evgeniy Polyakov

Hi Jeff.

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik ([EMAIL PROTECTED]) wrote:
 Further TODO list includes:
 * implement optional saving of mirroring/linear information on the remote
  nodes (simple)
 * new redundancy algorithm (complex)
 * some thoughts about distributed filesystem tightly connected to DST
  (far-far planes so far)
 
 Homepage:
 http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
 Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
 
 My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
 one of the people who really reads all your patches, and examines your 
 code and proposals closely.  So, with that in mind...

:)

 I question the value of distributed block services (DBS), whether its 
 your version or the others out there.  DBS are not very useful, because 
 it still relies on a useful filesystem sitting on top of the DBS.  It 
 devolves into one of two cases:  (1) multi-path much like today's SCSI, 
 with distributed filesystem arbitrarion to ensure coherency, or (2) the 
 filesystem running on top of the DBS is on a single host, and thus, a 
 single point of failure (SPOF).
 
 It is quite logical to extend the concepts of RAID across the network, 
 but ultimately you are still bound by the inflexibility and simplicity 
 of the block device.

Yes, block device itself is not able to scale well, but it is the place
for redundancy, since filesystem will just fail if underlying device
does not work correctly and FS actually does not know about where it
should place redundancy bits - it might happen to be the same broken 
disk, so I created a low-level device which distribute requests itself.
It is not allowed to mount it via multiple points, that is where
distributed filesystem must enter the show - multiple remote nodes
export its devices via network, each client gets address of the remote
node to work with, connect to it and process requests. All those bits
are already in the DST, next logical step is to connect it with
higher-layer filesystem.

 In contrast, a distributed filesystem offers far more scalability, 
 eliminates single points of failure, and offers more room for 
 optimization and redundancy across the cluster.
 
 A distributed filesystem is also much more complex, which is why 
 distributed block devices are so appealing :)
 
 With a redundant, distributed filesystem, you simply do not need any 
 complexity at all at the block device level.  You don't even need RAID.
 
 It is my hope that you will put your skills towards a distributed 
 filesystem :)  Of the current solutions, GFS (currently in kernel) 
 scales poorly, and NFS v4.1 is amazingly bloated and overly complex.
 
 I've been waiting for years for a smart person to come along and write a 
 POSIX-only distributed filesystem.

Well, originally (about half a year ago) I started to draft a generic
filesystem which would be just superior to existing designs, not
overbloated like zfs, and just faster.
I do believe it can be implemented.
Further I added network capabilities (since what I saw that time 
(AFS was proposed) I did not like - I'm not saying it is bad or 
something like that at all, but I would implement things differently) 
into design drafts.

When Chris Mason announced btrfs, I found that quite a few new ideas 
are already implemented there, so I postponed project (although
direction of the developement of the btrfs seems to move to the zfs side
with some questionable imho points, so I think I can jump to the wagon
of new filesystems right now). 

DST is low level for my (theoretical so far) filesystem (actually its 
network part) like kevent was a low level system for network AIO (originally).

No matter what filesystem works with network it implements some kind 
of logic completed in DST.
Sometimes it is very simple, sometimes a bit more complex, but
eventually it is a network entity with parts of stuff I put into DST.
Since I postponed the project (looking at btrfs and its results), I
completed DST as a standalone block device.

So, essentially, a filesystem with simple distributed facilities is on
(my) radar, but so far you are first who requested it :)

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Evgeniy Polyakov

Hi Mike.

On Fri, Sep 14, 2007 at 10:54:56PM -0400, Mike Snitzer ([EMAIL PROTECTED]) 
wrote:
 This distributed storage is very much needed; even if it were to act
 as a more capable/performant replacement for NBD (or MD+NBD) in the
 near term.  Many high availability applications don't _need_ all the
 additional complexity of a full distributed filesystem.  So given
 that, its discouraging to see you trying to gently push Evgeniy away
 from all the promising work he has published.
 
 Evgeniy, please continue your current work.

Thanks Mike, I work on this and will until feel it is completed.

Distributed filesystem is a logical continuation of the whoe idea of
storing data on the several remote nodes - DST and FS must exist
together for the maximum performance, but that does not mean that block
layer itself should be abandoned. As you probably noticed from my mail
to Jeff, distributed storage was originally part of the overall
filesystem design.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Robin Humble

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.

cheers,
robin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Jeff Garzik


Robin Humble wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.


I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.


it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.


Lustre is tilted far too much towards high-priced storage, and needs 
improvement before it could be considered for mainline.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Robin Humble

On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
Robin Humble wrote:
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.
it's called Lustre.
works well, scales well, is widely used, is GPL.
sadly it's not in mainline.
Lustre is tilted far too much towards high-priced storage,

many (most?) Lustre deployments are with SATA and md raid5 and GigE -
can't get much cheaper than that.

if you want storage node failover capabilities (which larger sites often
do) or want to saturate an IB link then the price of the storage goes
up but this is a consequence of wanting more reliability or performance,
not anything to do with lustre.

interestingly, one of the ways to provide dual-attached storage behind
a failover pair of lustre servers (apart from buying SAS) would be via
a networked-raid-1 device like Evgeniy's, so I don't see distributed
block devices and distributed filesystems as being mutually exclusive.
iSER (almost in http://stgt.berlios.de/) is also intriguing.

and needs 
improvement before it could be considered for mainline.

quite likely.
from what I understand (hopefully I am mistaken) they consider a merge
task to be too daunting as the number of kernel subsystems that any
scalable distributed filesystem touches is necessarily large.

roadmaps indicate that parts of lustre are likely to move to userspace
(partly to ease solaris and ZFS ports) so perhaps those performance
critical parts that remain kernel space will be easier to merge.

cheers,
robin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger

On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
 Yes, block device itself is not able to scale well, but it is the place
 for redundancy, since filesystem will just fail if underlying device
 does not work correctly and FS actually does not know about where it
 should place redundancy bits - it might happen to be the same broken 
 disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are
definitely welcome.  Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).

 When Chris Mason announced btrfs, I found that quite a few new ideas 
 are already implemented there, so I postponed project (although
 direction of the developement of the btrfs seems to move to the zfs side
 with some questionable imho points, so I think I can jump to the wagon
 of new filesystems right now). 

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have.  Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP?  If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4).  It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case).  Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger

On Sep 15, 2007  12:20 -0400, Robin Humble wrote:
 On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
 Lustre is tilted far too much towards high-priced storage,
 
 many (most?) Lustre deployments are with SATA and md raid5 and GigE -
 can't get much cheaper than that.

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-).  I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices
for testing w/o problems.

 interestingly, one of the ways to provide dual-attached storage behind
 a failover pair of lustre servers (apart from buying SAS) would be via
 a networked-raid-1 device like Evgeniy's, so I don't see distributed
 block devices and distributed filesystems as being mutually exclusive.

That is definitely true, and there are a number of users who run in
this mode.  We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all.  I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST
due to a single disk failure, but that would definitely not be required.

 and needs improvement before it could be considered for mainline.

It's definitely true, and we are always working at improving it.  It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes.  Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore.  That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so
we are able to add/change features without breaking old clients

 from what I understand (hopefully I am mistaken) they consider a merge
 task to be too daunting as the number of kernel subsystems that any
 scalable distributed filesystem touches is necessarily large.

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate.  Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems
(especially network fs) could have benefitted from also.

 roadmaps indicate that parts of lustre are likely to move to userspace
 (partly to ease solaris and ZFS ports) so perhaps those performance
 critical parts that remain kernel space will be easier to merge.

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers.  With some network stacks there
is even direct userspace acceleration.  We'll use RDMA and direct IO to
avoid doing any user-kernel data copies.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik


Evgeniy Polyakov wrote:

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move 
to it (in error path).


Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
one of the people who really reads all your patches, and examines your 
code and proposals closely.  So, with that in mind...


I question the value of distributed block services (DBS), whether its 
your version or the others out there.  DBS are not very useful, because 
it still relies on a useful filesystem sitting on top of the DBS.  It 
devolves into one of two cases:  (1) multi-path much like today's SCSI, 
with distributed filesystem arbitrarion to ensure coherency, or (2) the 
filesystem running on top of the DBS is on a single host, and thus, a 
single point of failure (SPOF).


It is quite logical to extend the concepts of RAID across the network, 
but ultimately you are still bound by the inflexibility and simplicity 
of the block device.


In contrast, a distributed filesystem offers far more scalability, 
eliminates single points of failure, and offers more room for 
optimization and redundancy across the cluster.


A distributed filesystem is also much more complex, which is why 
distributed block devices are so appealing :)


With a redundant, distributed filesystem, you simply do not need any 
complexity at all at the block device level.  You don't even need RAID.


It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.


I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Al Boldi

Jeff Garzik wrote:
 Evgeniy Polyakov wrote:
  Hi.
 
  I'm pleased to announce fourth release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
  This release includes new configuration interface (kernel connector over
  netlink socket) and number of fixes of various bugs found during move
  to it (in error path).
 
  Further TODO list includes:
  * implement optional saving of mirroring/linear information on the
  remote nodes (simple)
  * new redundancy algorithm (complex)
  * some thoughts about distributed filesystem tightly connected to DST
  (far-far planes so far)
 
  Homepage:
  http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
  Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

 My thoughts.  But first a disclaimer:   Perhaps you will recall me as
 one of the people who really reads all your patches, and examines your
 code and proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its
 your version or the others out there.  DBS are not very useful, because
 it still relies on a useful filesystem sitting on top of the DBS.  It
 devolves into one of two cases:  (1) multi-path much like today's SCSI,
 with distributed filesystem arbitrarion to ensure coherency, or (2) the
 filesystem running on top of the DBS is on a single host, and thus, a
 single point of failure (SPOF).

 It is quite logical to extend the concepts of RAID across the network,
 but ultimately you are still bound by the inflexibility and simplicity
 of the block device.

 In contrast, a distributed filesystem offers far more scalability,
 eliminates single points of failure, and offers more room for
 optimization and redundancy across the cluster.

 A distributed filesystem is also much more complex, which is why
 distributed block devices are so appealing :)

 With a redundant, distributed filesystem, you simply do not need any
 complexity at all at the block device level.  You don't even need RAID.

 It is my hope that you will put your skills towards a distributed
 filesystem :)  Of the current solutions, GFS (currently in kernel)
 scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

 I've been waiting for years for a smart person to come along and write a
 POSIX-only distributed filesystem.

This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching 
that goal.


Thanks!

--
Al
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 I've been waiting for years for a smart person to come along and write a 
 POSIX-only distributed filesystem.

 What exactly do you mean by POSIX-only?

 Don't bother supporting attributes, file modes, and other details not 
 supported by POSIX.  The prime example being NFSv4, which is larded down 
 with Windows features.

I am sympathetic  Cutting those out may still leave you with
something pretty complicated, though.

 NFSv4.1 adds to the fun, by throwing interoperability completely out the 
 window.

What parts are you worried about in particular?

--b.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 My thoughts.  But first a disclaimer:   Perhaps you will recall me as one 
 of the people who really reads all your patches, and examines your code and 
 proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its your 
 version or the others out there.  DBS are not very useful, because it still 
 relies on a useful filesystem sitting on top of the DBS.  It devolves into 
 one of two cases:  (1) multi-path much like today's SCSI, with distributed 
 filesystem arbitrarion to ensure coherency, or (2) the filesystem running 
 on top of the DBS is on a single host, and thus, a single point of failure 
 (SPOF).

 It is quite logical to extend the concepts of RAID across the network, but 
 ultimately you are still bound by the inflexibility and simplicity of the 
 block device.

 In contrast, a distributed filesystem offers far more scalability, 
 eliminates single points of failure, and offers more room for optimization 
 and redundancy across the cluster.

 A distributed filesystem is also much more complex, which is why 
 distributed block devices are so appealing :)

 With a redundant, distributed filesystem, you simply do not need any 
 complexity at all at the block device level.  You don't even need RAID.

 It is my hope that you will put your skills towards a distributed 
 filesystem :)  Of the current solutions, GFS (currently in kernel) scales 
 poorly, and NFS v4.1 is amazingly bloated and overly complex.

 I've been waiting for years for a smart person to come along and write a 
 POSIX-only distributed filesystem.

What exactly do you mean by POSIX-only?

--b.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik


J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.



What exactly do you mean by POSIX-only?


Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik


J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

What exactly do you mean by POSIX-only?
Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


I am sympathetic  Cutting those out may still leave you with
something pretty complicated, though.


Far less complicated than NFSv4.1 though (which is easy :))


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


What parts are you worried about in particular?


I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of 
NFS:  the specification is defined such that interoperability is 
-impossible- to guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.


Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 
anyway), there is no guarantee at all that Linux will be able to store 
and retrieve data, since it's entirely possible that a proprietary 
protocol is required to access your data.


NFSv4.1 is no longer a completely open architecture.

Jeff




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields

On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 I've been waiting for years for a smart person to come along and write 
 a POSIX-only distributed filesystem.
 What exactly do you mean by POSIX-only?
 Don't bother supporting attributes, file modes, and other details not 
 supported by POSIX.  The prime example being NFSv4, which is larded down 
 with Windows features.
 I am sympathetic  Cutting those out may still leave you with
 something pretty complicated, though.

 Far less complicated than NFSv4.1 though (which is easy :))

One would hope so.

 NFSv4.1 adds to the fun, by throwing interoperability completely out the 
 window.
 What parts are you worried about in particular?

 I'm not worried; I'm stating facts as they exist today (draft 13):

 NFS v4.1 does something completely without precedent in the history of NFS: 
  the specification is defined such that interoperability is -impossible- to 
 guarantee.

 pNFS permits private and unspecified layout types.  This means it is 
 impossible to guarantee that one NFSv4.1 implementation will be able to 
 talk another NFSv4.1 implementation.

No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?

--b.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Mike Snitzer

On 9/14/07, Jeff Garzik [EMAIL PROTECTED] wrote:
 Evgeniy Polyakov wrote:
  Hi.
 
  I'm pleased to announce fourth release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
  This release includes new configuration interface (kernel connector over
  netlink socket) and number of fixes of various bugs found during move
  to it (in error path).
 
  Further TODO list includes:
  * implement optional saving of mirroring/linear information on the remote
nodes (simple)
  * new redundancy algorithm (complex)
  * some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)
 
  Homepage:
  http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
  Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

 My thoughts.  But first a disclaimer:   Perhaps you will recall me as
 one of the people who really reads all your patches, and examines your
 code and proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its
 your version or the others out there.  DBS are not very useful, because
 it still relies on a useful filesystem sitting on top of the DBS.  It
 devolves into one of two cases:  (1) multi-path much like today's SCSI,
 with distributed filesystem arbitrarion to ensure coherency, or (2) the
 filesystem running on top of the DBS is on a single host, and thus, a
 single point of failure (SPOF).

This distributed storage is very much needed; even if it were to act
as a more capable/performant replacement for NBD (or MD+NBD) in the
near term.  Many high availability applications don't _need_ all the
additional complexity of a full distributed filesystem.  So given
that, its discouraging to see you trying to gently push Evgeniy away
from all the promising work he has published.

Evgeniy, please continue your current work.

Mike
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik


J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.



What parts are you worried about in particular?



I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of NFS: 
 the specification is defined such that interoperability is -impossible- to 
guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.



No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?


I just sorta assumed you could fall back to the NFSv4.0 mode of 
operation, going through the metadata server for all data accesses.


But look at that choice in practice:  you can either ditch pNFS 
completely, or use a proprietary solution.  The market incentives are 
CLEARLY tilted in favor of makers of proprietary solutions.  But it's a 
poor choice (really little choice at all).


Overall, my main concern is that NFSv4.1 is no longer an open 
architecture solution.  The no-pNFS or proprietary platform choice 
merely illustrate one of many negative aspects of this architecture.


One of NFS's biggest value propositions is its interoperability.  To 
quote some Wall Street guys, NFS is like crack.  It Just Works.  We 
love it.


Now, for the first time in NFS's history (AFAIK), the protocol is no 
longer completely specified, completely known.  No longer a closed 
loop.  Private layout types mean that it is _highly_ unlikely that any 
OS or appliance or implementation will be able to claim full NFS 
compatibility.


And when the proprietary portion of the spec involves something as basic 
as accessing one's own data, I consider that a fundamental flaw.  NFS is 
no longer completely open.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields

On Sat, Sep 15, 2007 at 12:08:42AM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 No, servers are required to support ordinary nfs operations to the
 metadata server.
 At least, that's the way it was last I heard, which was a while ago.  I
 agree that it'd stink (for any number of reasons) if you ever *had* to
 get a layout to access some file.
 Was that your main concern?

 I just sorta assumed you could fall back to the NFSv4.0 mode of operation, 
 going through the metadata server for all data accesses.

Right.  So any two pNFS implementations *will* be able to talk to each
other; they just may not be able to use the (possibly higher-bandwidth)
read/write path that pNFS gives them.

 But look at that choice in practice:  you can either ditch pNFS completely, 
 or use a proprietary solution.  The market incentives are CLEARLY tilted in 
 favor of makers of proprietary solutions.

I doubt somebody would go to all the trouble to implement pNFS and then
present their customers with that kind of choice.  But maybe I'm missing
something.  What market incentives do you see that would make that more
attractive than either 1) using a standard fully-specified layout type,
or 2) just implementing your own proprietary protocol instead of pNFS?

 Overall, my main concern is that NFSv4.1 is no longer an open architecture 
 solution.  The no-pNFS or proprietary platform choice merely illustrate 
 one of many negative aspects of this architecture.

It's always been possible to extend NFS in various ways if you want.
You could use sideband protocols with v2 and v3, for example.  People
have done that.  Some of them have been standardized and widely
implemented, some haven't.  You could probably add your own compound ops
to v4 if you wanted, I guess.

And there's advantages to experimenting with extensions first and then
standardizing when you figure out what works.  I wish it happened that
way more often.

 Now, for the first time in NFS's history (AFAIK), the protocol is no longer 
 completely specified, completely known.  No longer a closed loop.  
 Private layout types mean that it is _highly_ unlikely that any OS or 
 appliance or implementation will be able to claim full NFS compatibility.

Do you know of any such private layout types?

This is kind of a boring argument, isn't it?  I'd rather hear whatever
ideas you have for a new distributed filesystem protocol.

--b.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Security attributes and ducumentation update.

2007-09-13 Thread Evgeniy Polyakov

Hi Paul.

On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL PROTECTED]) 
wrote:
  Further TODO list includes:
  * implement optional saving of mirroring/linear information on the remote
  nodes (simple)
  * implement netlink based setup (simple)
  * new redundancy algorithm (complex)
  
  Homepage:
  http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
 A couple questions below, but otherwise looks good from an RCU viewpoint.
 
   Thanx, Paul

Thanks for your comments, and sorry for late reply I was at KS/London
trip.
  +   if (--num) {
  +   list_for_each_entry_rcu(n, node-shared, shared) {
 
 This function is called under rcu_read_lock() or similar, right?
 (Can't tell from this patch.)  It is also OK to call it from under the
 update-side mutex, of course.

Actually not, but it does not require it, since entry can not be removed
during this operations since appropriate reference counter for given node is
being held. It should not be RCU at all.

  +static int dst_mirror_read(struct dst_request *req)
  +{
  +   struct dst_node *node = req-node, *n, *min_dist_node;
  +   struct dst_mirror_priv *priv = node-priv;
  +   u64 dist, d;
  +   int err;
  +
  +   req-bio_endio = dst_mirror_read_endio;
  +
  +   do {
  +   err = -ENODEV;
  +   min_dist_node = NULL;
  +   dist = -1ULL;
  + 
  +   /*
  +* Reading is never performed from the node under resync.
  +* If this will cause any troubles (like all nodes must be
  +* resynced between each other), this check can be removed
  +* and per-chunk dirty bit can be tested instead.
  +*/
  +
  +   if (!test_bit(DST_NODE_NOTSYNC, node-flags)) {
  +   priv = node-priv;
  +   if (req-start  priv-last_start)
  +   dist = req-start - priv-last_start;
  +   else
  +   dist = priv-last_start - req-start;
  +   min_dist_node = req-node;
  +   }
  +
  +   list_for_each_entry_rcu(n, node-shared, shared) {
 
 I see one call to this function that appears to be under the update-side
 mutex, but I cannot tell if the other calls are safe.  (Safe as in either
 under the update-side mutex or under rcu_read_lock() and friends.)

The same here - those processing function are called from
generic_make_request() from any lock on top of them. Each node is linked
into the list of the first added node, which reference counter is
increased in higher layer. Right now there is no way to add or remove
nodes after array was started, such functionality requires storage tree
lock to be taken and RCU can not be used (since it requires sleeping and
I did not investigate sleepable RCU for this purpose).

So, essentially RCU is not used in DST :)

Thanks for review, Paul.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Security attributes and ducumentation update.

2007-09-13 Thread Paul E. McKenney

On Thu, Sep 13, 2007 at 04:22:59PM +0400, Evgeniy Polyakov wrote:
 Hi Paul.
 
 On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL 
 PROTECTED]) wrote:
   Further TODO list includes:
   * implement optional saving of mirroring/linear information on the remote
 nodes (simple)
   * implement netlink based setup (simple)
   * new redundancy algorithm (complex)
   
   Homepage:
   http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
  
  A couple questions below, but otherwise looks good from an RCU viewpoint.
  
  Thanx, Paul
 
 Thanks for your comments, and sorry for late reply I was at KS/London
 trip.
   + if (--num) {
   + list_for_each_entry_rcu(n, node-shared, shared) {
  
  This function is called under rcu_read_lock() or similar, right?
  (Can't tell from this patch.)  It is also OK to call it from under the
  update-side mutex, of course.
 
 Actually not, but it does not require it, since entry can not be removed
 during this operations since appropriate reference counter for given node is
 being held. It should not be RCU at all.

Ah!  Yes, it is OK to use _rcu in this case, but should be avoided
unless doing so eliminates duplicate code or some such.  So, agree
with dropping _rcu in this case.

   +static int dst_mirror_read(struct dst_request *req)
   +{
   + struct dst_node *node = req-node, *n, *min_dist_node;
   + struct dst_mirror_priv *priv = node-priv;
   + u64 dist, d;
   + int err;
   +
   + req-bio_endio = dst_mirror_read_endio;
   +
   + do {
   + err = -ENODEV;
   + min_dist_node = NULL;
   + dist = -1ULL;
   + 
   + /*
   +  * Reading is never performed from the node under resync.
   +  * If this will cause any troubles (like all nodes must be
   +  * resynced between each other), this check can be removed
   +  * and per-chunk dirty bit can be tested instead.
   +  */
   +
   + if (!test_bit(DST_NODE_NOTSYNC, node-flags)) {
   + priv = node-priv;
   + if (req-start  priv-last_start)
   + dist = req-start - priv-last_start;
   + else
   + dist = priv-last_start - req-start;
   + min_dist_node = req-node;
   + }
   +
   + list_for_each_entry_rcu(n, node-shared, shared) {
  
  I see one call to this function that appears to be under the update-side
  mutex, but I cannot tell if the other calls are safe.  (Safe as in either
  under the update-side mutex or under rcu_read_lock() and friends.)
 
 The same here - those processing function are called from
 generic_make_request() from any lock on top of them. Each node is linked
 into the list of the first added node, which reference counter is
 increased in higher layer. Right now there is no way to add or remove
 nodes after array was started, such functionality requires storage tree
 lock to be taken and RCU can not be used (since it requires sleeping and
 I did not investigate sleepable RCU for this purpose).
 
 So, essentially RCU is not used in DST :)

Works for me!  Use the right tool for the job!

 Thanks for review, Paul.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-09-01 Thread Daniel Phillips

On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
 On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
  Resubmitting a bio or submitting a dependent bio from
  inside a block driver does not need to be throttled because all
  resources required to guarantee completion must have been obtained
  _before_ the bio was allowed to proceed into the block layer.

 I'm toying with the idea of keeping track of the maximum device stack
 depth for each stacked device, and only permitting it to increase in
 controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Evgeniy Polyakov

Hi Daniel.

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
  Then, if of course you will want, which I doubt, you can reread
  previous mails and find that it was pointed to that race and
  possibilities to solve it way too long ago.
 
 What still bothers me about your response is that, while you know the 
 race exists and do not disagree with my example, you don't seem to see 
 that that race can eventually lock up the block device by repeatedly 
 losing throttle counts which are never recovered.  What prevents that?

I posted a trivial hack with pointed possible errors and a question
about should it be further extended (and race fixed by any of the
possible methods and so on) or new one should be developed (like in your
approach when only high level device is charged), instead I got replies
that it contains bugs, whcih will stop system and kill gene pool of the
mankind. I know how it works and where problems are. And if we are going
with this approach I will fix pointed issues.

   --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0
   -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0
   -0700 @@ -3237,6 +3237,15 @@ end_io:
 */
void generic_make_request(struct bio *bio)
{
   + struct request_queue *q = bdev_get_queue(bio-bi_bdev);
   +
   + if (q  q-metric) {
   + int need = bio-bi_reserved = q-metric(bio);
   + bio-queue = q;
 
  In case you have stacked device, this entry will be rewritten and you
  will lost all your account data.
 
 It is a weakness all right.  Well,
 
 - if (q  q-metric) {
 + if (q  q-metric  !bio-queue) {
 
 which fixes that problem.  Maybe there is a better fix possible.  Thanks 
 for the catch!

Yes, it should.

 The original conception was that this block throttling would apply only 
 to the highest level submission of the bio, the one that crosses the 
 boundary between filesystem (or direct block device application) and 
 block layer.  Resubmitting a bio or submitting a dependent bio from 
 inside a block driver does not need to be throttled because all 
 resources required to guarantee completion must have been obtained 
 _before_ the bio was allowed to proceed into the block layer.

We still did not come to the conclusion, but I do not want to start a
flamewar, you believe that throttling must be done on the top level
device, so you need to extend bio and convince others that idea worth
it.

 The other principle we are trying to satisfy is that the throttling 
 should not be released until bio-endio, which I am not completely sure 
 about with the patch as modified above.  Your earlier idea of having 
 the throttle protection only cover the actual bio submission is 
 interesting and may be effective in some cases, in fact it may cover 
 the specific case of ddsnap.  But we don't have to look any further 
 than ddraid (distributed raid) to find a case it doesn't cover - the 
 additional memory allocated to hold parity data has to be reserved 
 until parity data is deallocated, long after the submission completes.
 So while you manage to avoid some logistical difficulties, it also looks 
 like you didn't solve the general problem.

Block layer does not know and should not be bothered with underlying
device nature - if you think that in endio callback limit should not be
rechardged, then provide your own layer on top of bio and thus call
endio callback only when you think it is ready to be completed.

 Hopefully I will be able to report on whether my patch actually works 
 soon, when I get back from vacation.  The mechanism in ddsnap this is 
 supposed to replace is effective, it is just ugly and tricky to verify.
 
 Regards,
 
 Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Alasdair G Kergon

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
 Resubmitting a bio or submitting a dependent bio from 
 inside a block driver does not need to be throttled because all 
 resources required to guarantee completion must have been obtained 
 _before_ the bio was allowed to proceed into the block layer.

I'm toying with the idea of keeping track of the maximum device stack
depth for each stacked device, and only permitting it to increase in
controlled circumstances.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-30 Thread Daniel Phillips

On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
 Then, if of course you will want, which I doubt, you can reread
 previous mails and find that it was pointed to that race and
 possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the 
race exists and do not disagree with my example, you don't seem to see 
that that race can eventually lock up the block device by repeatedly 
losing throttle counts which are never recovered.  What prevents that?

  --- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0
  -0700 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0
  -0700 @@ -3237,6 +3237,15 @@ end_io:
*/
   void generic_make_request(struct bio *bio)
   {
  +   struct request_queue *q = bdev_get_queue(bio-bi_bdev);
  +
  +   if (q  q-metric) {
  +   int need = bio-bi_reserved = q-metric(bio);
  +   bio-queue = q;

 In case you have stacked device, this entry will be rewritten and you
 will lost all your account data.

It is a weakness all right.  Well,

-   if (q  q-metric) {
+   if (q  q-metric  !bio-queue) {

which fixes that problem.  Maybe there is a better fix possible.  Thanks 
for the catch!

The original conception was that this block throttling would apply only 
to the highest level submission of the bio, the one that crosses the 
boundary between filesystem (or direct block device application) and 
block layer.  Resubmitting a bio or submitting a dependent bio from 
inside a block driver does not need to be throttled because all 
resources required to guarantee completion must have been obtained 
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling 
should not be released until bio-endio, which I am not completely sure 
about with the patch as modified above.  Your earlier idea of having 
the throttle protection only cover the actual bio submission is 
interesting and may be effective in some cases, in fact it may cover 
the specific case of ddsnap.  But we don't have to look any further 
than ddraid (distributed raid) to find a case it doesn't cover - the 
additional memory allocated to hold parity data has to be reserved 
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks 
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works 
soon, when I get back from vacation.  The mechanism in ddsnap this is 
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-29 Thread Evgeniy Polyakov

On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
  On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL 
  PROTECTED]) wrote:
We do not care about one cpu being able to increase its counter
higher than the limit, such inaccuracy (maximum bios in flight
thus can be more than limit, difference is equal to the number of
CPUs - 1) is a price for removing atomic operation. I thought I
pointed it in the original description, but might forget, that if
it will be an issue, that atomic operations can be introduced
there. Any uber-precise measurements in the case when we are
close to the edge will not give us any benefit at all, since were
are already in the grey area.
  
   This is not just inaccurate, it is suicide.  Keep leaking throttle
   counts and eventually all of them will be gone.  No more IO
   on that block device!
 
  First, because number of increased and decreased operations are the
  same, so it will dance around limit in both directions.
 
 No.  Please go and read it the description of the race again.  A count
 gets irretrievably lost because the write operation of the first
 decrement is overwritten by the second. Data gets lost.  Atomic 
 operations exist to prevent that sort of thing.  You either need to use 
 them or have a deep understanding of SMP read and write ordering in 
 order to preserve data integrity by some equivalent algorithm.

I think you should complete your emotional email with decription of how
atomic types are operated and how processors access data. Just to give a
lesson to those who never knew how SMP works, but create patches and
have the conscience to send them and even discuss.
Then, if of course you will want, which I doubt, you can reread previous 
mails and find that it was pointed to that race and possibilities to 
solve it way too long ago. 
Anyway, I prefer to look like I do not know how SMP and atomic operation
work and thus stay away from this discussion.

 --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700
 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0 -0700
 @@ -3237,6 +3237,15 @@ end_io:
   */
  void generic_make_request(struct bio *bio)
  {
 + struct request_queue *q = bdev_get_queue(bio-bi_bdev);
 +
 + if (q  q-metric) {
 + int need = bio-bi_reserved = q-metric(bio);
 + bio-queue = q;

In case you have stacked device, this entry will be rewritten and you
will lost all your account data.

 + wait_event_interruptible(q-throttle_wait, 
 atomic_read(q-available) = need);
 + atomic_sub(q-available, need);
 + }

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov

On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 Say Evgeniy, something I was curious about but forgot to ask you 
 earlier...
 
 On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
  ...All oerations are not atomic, since we do not care about precise
  number of bios, but a fact, that we are close or close enough to the
  limit. 
  ... in bio-endio
  +   q-bio_queued--;
 
 In your proposed patch, what prevents the race:
 
   cpu1cpu2
 
   read q-bio_queued
   
 q-bio_queued--
   write q-bio_queued - 1
   Whoops! We leaked a throttle count.

We do not care about one cpu being able to increase its counter higher
than the limit, such inaccuracy (maximum bios in flight thus can be more
than limit, difference is equal to the number of CPUs - 1) is a price
for removing atomic operation. I thought I pointed it in the original
description, but might forget, that if it will be an issue, that atomic
operations can be introduced there. Any uber-precise measurements in the
case when we are close to the edge will not give us any benefit at all,
since were are already in the grey area.

Another possibility is to create a queue/device pointer in the bio
structure to hold original device and then in its backing dev structure
add a callback to recalculate the limit, but it increases the size of
the bio. Do we need this?

 Regards,
 
 Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-28 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) 
wrote:
 On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 
  TODO list currently includes following main items:
  * redundancy algorithm (drop me a request of your own, but it is highly
  unlikley that Reed-Solomon based will ever be used - it is too slow
  for distributed RAID, I consider WEAVER codes)
 
 
 LDPC codes[1][2] have been replacing Turbo code[3] with regards to
 communication links and we have been seeing that transition. (maybe
 helpful, came to mind seeing the mention of Turbo code) Don't know how
 weaver compares to LDPC, though found some comparisons [4][5] But
 looking at fault tolerance figures, i guess Weaver is much better.
 
 [1] http://www.ldpc-codes.com/
 [2] http://portal.acm.org/citation.cfm?id=1240497
 [3] http://en.wikipedia.org/wiki/Turbo_code
 [4] 
 http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf
 [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf

I've studied and implemented LDPC encoder/decoder (hard decoding belief 
propagation algo only though) in userspace and found that any such 
probabilistic codes generally are not suitable for redundant or 
distributed data storages, because of its per-bit nature and probabilistic
error recovery.
Interested reader can find similar to Dr. Plank's iteractive decoding 
presentation and some of my analysis about codes and all sources at 
project homepage and in blog:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=ldpc

So I consider weaver codes, as a superior decision for distributed
storages.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov

On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  We do not care about one cpu being able to increase its counter
  higher than the limit, such inaccuracy (maximum bios in flight thus
  can be more than limit, difference is equal to the number of CPUs -
  1) is a price for removing atomic operation. I thought I pointed it
  in the original description, but might forget, that if it will be an
  issue, that atomic operations can be introduced there. Any
  uber-precise measurements in the case when we are close to the edge
  will not give us any benefit at all, since were are already in the
  grey area.
 
 This is not just inaccurate, it is suicide.  Keep leaking throttle 
 counts and eventually all of them will be gone.  No more IO
 on that block device!

First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions. Second, I
wrote about this race and there is number of ways to deal with it, from
atomic operations to separated counters for in-flight and completed
bios (which can be racy too, but in different angle). Third, if people
can not agree even on much higher layer detail about should bio
structure be increased or not, how we can discuss details of
the preliminary implementation with known issues.

So I can not agree with fatality of the issue, but of course it exists,
and was highlighted.

Let's solve problems in order of their appearence. If bio structure will
be allowed to grow, then the whole patches can be done better, if not,
then there are issues with performance (although the more I think, the
more I become sure that since bio itself is very rarely shared, and thus
requires cloning and alocation/freeing, which itself is much more costly
operation than atomic_sub/dec, it can safely host additional operation).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
 On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  Say Evgeniy, something I was curious about but forgot to ask you
  earlier...
 
  On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
   ...All oerations are not atomic, since we do not care about
   precise number of bios, but a fact, that we are close or close
   enough to the limit.
   ... in bio-endio
   + q-bio_queued--;
 
  In your proposed patch, what prevents the race:
 
  cpu1cpu2
 
  read q-bio_queued
  
  q-bio_queued--
  write q-bio_queued - 1
  Whoops! We leaked a throttle count.

 We do not care about one cpu being able to increase its counter
 higher than the limit, such inaccuracy (maximum bios in flight thus
 can be more than limit, difference is equal to the number of CPUs -
 1) is a price for removing atomic operation. I thought I pointed it
 in the original description, but might forget, that if it will be an
 issue, that atomic operations can be introduced there. Any
 uber-precise measurements in the case when we are close to the edge
 will not give us any benefit at all, since were are already in the
 grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

 Another possibility is to create a queue/device pointer in the bio
 structure to hold original device and then in its backing dev
 structure add a callback to recalculate the limit, but it increases
 the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
 On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
 wrote:
   We do not care about one cpu being able to increase its counter
   higher than the limit, such inaccuracy (maximum bios in flight
   thus can be more than limit, difference is equal to the number of
   CPUs - 1) is a price for removing atomic operation. I thought I
   pointed it in the original description, but might forget, that if
   it will be an issue, that atomic operations can be introduced
   there. Any uber-precise measurements in the case when we are
   close to the edge will not give us any benefit at all, since were
   are already in the grey area.
 
  This is not just inaccurate, it is suicide.  Keep leaking throttle
  counts and eventually all of them will be gone.  No more IO
  on that block device!

 First, because number of increased and decreased operations are the
 same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

 Let's solve problems in order of their appearence. If bio structure
 will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio-bi_bdev);
+
+   if (q  q-metric) {
+   int need = bio-bi_reserved = q-metric(bio);
+   bio-queue = q;
+   wait_event_interruptible(q-throttle_wait, 
atomic_read(q-available) = need);
+   atomic_sub(q-available, need);
+   }
+
if (current-bio_tail) {
/* make_request is active */
*(current-bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio-bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md-queue-queuedata = md;
+   md-queue-metric = dm_metric;
+   atomic_set(md-queue-available, md-queue-capacity = 1000);
+   init_waitqueue_head(md-queue-throttle_wait);
+
md-queue-backing_dev_info.congested_fn = dm_any_congested;
md-queue-backing_dev_info.congested_data = md;
blk_queue_make_request(md-queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio-bi_size;
}
 
-   bio-bi_size -= bytes_done;
+   if (!(bio-bi_size -= bytes_done)  bio-bi_reserved) {
+   struct request_queue *q = bio-queue;
+   atomic_add(q-available, bio-bi_reserved);
+   bio-bi_reserved = 0; /* just in case */
+   wake_up(q-throttle_wait);
+   }
bio-bi_sector += (bytes_done  9);
 
if (bio-bi_end_io)
--- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700
+++ 2.6.22/include/linux/bio.h  2007-08-24 11:53:51.0 -0700
@@ -109,6 +109,9 @@ struct bio {
bio_end_io_t*bi_end_io;
atomic_tbi_cnt; /* pin count */
 
+   struct request_queue*queue; /* for throttling */
+   unsigned

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-27 Thread Daniel Phillips

Say Evgeniy, something I was curious about but forgot to ask you 
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
 ...All oerations are not atomic, since we do not care about precise
 number of bios, but a fact, that we are close or close enough to the
 limit. 
 ... in bio-endio
 + q-bio_queued--;

In your proposed patch, what prevents the race:

cpu1cpu2

read q-bio_queued

q-bio_queued--
write q-bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 Perhaps you never worried about the resources that the device mapper 
 mapping function allocates to handle each bio and so did not consider 
 this hole significant.  These resources can be significant, as is the 
 case with ddsnap.  It is essential to close that window through with 
 the virtual device's queue limit may be violated.  Not doing so will 
 allow deadlock.

This is not a bug, this is special kind of calculation - total limit is
number of physical devices multiplied by theirs limits. It was done 
_on purpose_ to allow different device to have different limits (for 
example in distributed storage project it is possible to have both remote 
and local node in the same device, but local device should not have _any_
limit at all, but network one should).

Virtual device essentially has _no_ limit. And that as done on purpose.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
 On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  Perhaps you never worried about the resources that the device
  mapper mapping function allocates to handle each bio and so did not
  consider this hole significant.  These resources can be
  significant, as is the case with ddsnap.  It is essential to close
  that window through with the virtual device's queue limit may be
  violated.  Not doing so will allow deadlock.

 This is not a bug, this is special kind of calculation - total limit
 is number of physical devices multiplied by theirs limits. It was
 done _on purpose_ to allow different device to have different limits
 (for example in distributed storage project it is possible to have
 both remote and local node in the same device, but local device
 should not have _any_ limit at all, but network one should).

 Virtual device essentially has _no_ limit. And that as done on
 purpose.

And it will not solve the deadlock problem in general.  (Maybe it works 
for your virtual device, but I wonder...)  If the virtual device 
allocates memory during generic_make_request then the memory needs to 
be throttled.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov

On Tue, Aug 14, 2007 at 04:13:10AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
  On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
 ([EMAIL PROTECTED]) wrote:
   Perhaps you never worried about the resources that the device
   mapper mapping function allocates to handle each bio and so did not
   consider this hole significant.  These resources can be
   significant, as is the case with ddsnap.  It is essential to close
   that window through with the virtual device's queue limit may be
   violated.  Not doing so will allow deadlock.
 
  This is not a bug, this is special kind of calculation - total limit
  is number of physical devices multiplied by theirs limits. It was
  done _on purpose_ to allow different device to have different limits
  (for example in distributed storage project it is possible to have
  both remote and local node in the same device, but local device
  should not have _any_ limit at all, but network one should).
 
  Virtual device essentially has _no_ limit. And that as done on
  purpose.
 
 And it will not solve the deadlock problem in general.  (Maybe it works 
 for your virtual device, but I wonder...)  If the virtual device 
 allocates memory during generic_make_request then the memory needs to 
 be throttled.

Daniel, if device process bio by itself, it has a limit and thus it will
wait in generic_make_request(), if it queues it to different device,
then the same logic applies there. If virutal device does not process
bio, its limit will always be recharged to underlying devices, and
overall limit is equal to number of physical device (or devices which do
process bio) multiplied by theirs limits. This does _work_ and I showed
example how limits are processed and who and where will sleep. This
solution is not narrow fix, please check my examples I showed before.

 Regards,
 
 Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
  And it will not solve the deadlock problem in general.  (Maybe it
  works for your virtual device, but I wonder...)  If the virtual
  device allocates memory during generic_make_request then the memory
  needs to be throttled.

 Daniel, if device process bio by itself, it has a limit and thus it
 will wait in generic_make_request()

What will make it wait?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov

On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
   And it will not solve the deadlock problem in general.  (Maybe it
   works for your virtual device, but I wonder...)  If the virtual
   device allocates memory during generic_make_request then the memory
   needs to be throttled.
 
  Daniel, if device process bio by itself, it has a limit and thus it
  will wait in generic_make_request()
 
 What will make it wait?

gneric_make_request() for given block device.

Example:

   virt_device - do_smth_with_bio -bio_endio().
   |
  / \
 phys0   phys1

Each of three devices above works with bio, each one eventually calls
bio_endio() and bio-bi_bdev will be one of the three above devices.

Thus, when system calls generic_make_request(bio-bi_bdev == virt_device),
one of the three limits will be charged, depending on the fact, that
virtual device forward bio to physical devices or not. Actually virtual
device limit will be charged too first, but if bio is forwarded, its 
portion will be reduced from virtual device's limit.

Now, if virtual device allocates bio itself (like device mapper), then
this new bio will be forwarded to physical devices via
gneric_make_request() and thus it will sleep in the physical device's
queue, if it is filled.

So, if each of three devices has a limit of 10 bios, then actual number
of bios in flight is maximum 3 * 10, since each device will be charged
up to _its_ maximum limit, not limit for the first device in the chain.

So, you set 10 to virtual device and its can process bio itself (like
send it to network), then this is number of bios in flight, which are
processed by _this_ device and not forwarded further. Actual number of
bios you can flush into virtual device is its own limit plus limits of
all physical devices atached to it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
 On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
And it will not solve the deadlock problem in general.  (Maybe
it works for your virtual device, but I wonder...)  If the
virtual device allocates memory during generic_make_request
then the memory needs to be throttled.
  
   Daniel, if device process bio by itself, it has a limit and thus
   it will wait in generic_make_request()
 
  What will make it wait?

 gneric_make_request() for given block device.

Not good enough, that only makes one thread wait.  Look here:

http://lkml.org/lkml/2007/8/13/788

An unlimited number of threads can come in, each consuming resources of 
the virtual device, and violating the throttling rules.

The throttling of the virtual device must begin in generic_make_request 
and last to -endio.  You release the throttle of the virtual device at 
the point you remap the bio to an underlying device, which you have 
convinced yourself is ok, but it is not.  You seem to miss the fact 
that whatever resources the virtual device has allocated are no longer 
protected by the throttle count *of the virtual device*, or you do not 
see why that is a bad thing.  It is a very bad thing, roughly like 
leaving some shared data outside a spin_lock/unlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov

On Tue, Aug 14, 2007 at 05:32:29AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
  On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
 ([EMAIL PROTECTED]) wrote:
   On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
 And it will not solve the deadlock problem in general.  (Maybe
 it works for your virtual device, but I wonder...)  If the
 virtual device allocates memory during generic_make_request
 then the memory needs to be throttled.
   
Daniel, if device process bio by itself, it has a limit and thus
it will wait in generic_make_request()
  
   What will make it wait?
 
  gneric_make_request() for given block device.
 
 Not good enough, that only makes one thread wait.  Look here:
 
 http://lkml.org/lkml/2007/8/13/788
 
 An unlimited number of threads can come in, each consuming resources of 
 the virtual device, and violating the throttling rules.
 
 The throttling of the virtual device must begin in generic_make_request 
 and last to -endio.  You release the throttle of the virtual device at 
 the point you remap the bio to an underlying device, which you have 
 convinced yourself is ok, but it is not.  You seem to miss the fact 
 that whatever resources the virtual device has allocated are no longer 
 protected by the throttle count *of the virtual device*, or you do not 

Because it is charged to another device. No matter how many of them are
chained, limit is applied to the last device being used.
So, if you have unlimited number of threads, each one allocates a
request, forward it down to low-level devices, each one will eventually
sleep, but yes, each one _can_ allocate _one_ request before it goes
sleeping. It is done to allow fain-grained limits, since some devices
(like locally attached disks) do not require throttling.

Here is an example with threads you mentioned:
http://article.gmane.org/gmane.linux.file-systems/17644

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote:
  The throttling of the virtual device must begin in
  generic_make_request and last to -endio.  You release the throttle
  of the virtual device at the point you remap the bio to an
  underlying device, which you have convinced yourself is ok, but it
  is not.  You seem to miss the fact that whatever resources the
  virtual device has allocated are no longer protected by the
  throttle count *of the virtual device*, or you do not

 Because it is charged to another device.

Great.  You charged the resource to another device, but you did not 
limit the amount of resources that the first device can consume.  Which 
misses the whole point.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Mirroring to any number of devices.

2007-08-14 Thread Jan Engelhardt


On Aug 14 2007 20:29, Evgeniy Polyakov wrote:

I'm pleased to announce second release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

I'll be quick: what is it good for, are there any users, and what could
it have to do with DRBD and all the other distribution storage talk
that has come up lately (namely NBD w/Raid1)?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Mirroring to any number of devices.

2007-08-14 Thread Evgeniy Polyakov

On Tue, Aug 14, 2007 at 07:20:49PM +0200, Jan Engelhardt ([EMAIL PROTECTED]) 
wrote:
 I'm pleased to announce second release of the distributed storage
 subsystem, which allows to form a storage on top of remote and local
 nodes, which in turn can be exported to another storage as a node to
 form tree-like storages.
 
 I'll be quick: what is it good for, are there any users, and what could
 it have to do with DRBD and all the other distribution storage talk
 that has come up lately (namely NBD w/Raid1)?

It has number of advantages, outlined in the first release and on the
project homepage, namely:
* non-blocking processing without busy loops (compared to iSCSI and NBD)
* small, plugable architecture
* failover recovery (reconnect to remote target)
* autoconfiguration
* no additional allocatins (not including network  part) - at least two 
in device mapper for fast path
* very simple - try to compare with iSCSI
* works with different network protocols
* storage can be formed on top of remote nodes and be exported 
simultaneously (iSCSI is peer-to-peer only, NBD
requires device mapper, is synchronous and wants special
userspace thread)

Compared to DRBD, which is a mirroring of the local requests to remote
node, and raid on top of NBD, DST supports multiple remote nodes, it allows 
to remove any of them and then turn it back into the storage without
breaking the dataflow, dst core will reconnect automatically to the
failed remote nodes, it allows to work with detouched devices just like 
with usual filesystems (in case it was not formed as a part of linear 
storage, since in that case meta information is spreaded between nodes).
It does not require special processes on behalf of network connection,
everything will be performed automatically on behalf of DST core
workers, it allows to export new device, created on top of mirror or
linear combination of the others, which in turn can be formed on top of
another and so on...

This was designed to allow to create a distributed storage with
completely transparent failover recovery, with ability to detouch remote
nodes from mirror array to became standalone realtime backups (or
snapshots) and turn it back into the storage without stopping main 
device node.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Sunday 12 August 2007 22:36, I wrote:
 Note!  There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself 
deadlock.  Let me see if I can remember how it goes.

  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc-alloc_pages-shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you 
can see the danger.  In ddsnap we kludge around this problem by not 
throttling any bio submitted in PF_MEMALLOC mode, which effectively 
increases our reserve requirement by the amount of IO that mm will 
submit to a given block device before deciding the device is congested 
and should be left alone.  This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our 
throttle accounting in backing_dev_info so it will not push IO to our 
device when it knows that the IO will just block on congestion.  
Instead, shrink_caches will find some other less congested block device 
or give up, causing alloc_pages to draw from the memalloc reserve to 
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the 
right bits in the backing_dev_info state flags.  I think Peter posted a 
patch set that included this feature at some point.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Sun, Aug 12 2007, Daniel Phillips wrote:
 On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
  I don't like structure bloat, but I do like nice design. Overloading
  is a necessary evil sometimes, though. Even today, there isn't enough
  room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
  so that concern can be scratched. If you read bio.h, that much is
  obvious.
 
 Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
 this lives in struct bio instead of struct request?

If you don't, you have to pass them down. You can make that very
statement about basically any member of struct bio, until we end up with
a submit_bio() path and down taking 16 arguments.

  If you check up on the iommu virtual merging, you'll understand the
  front and back size members. They may smell dubious to you, but
  please take the time to understand why it looks the way it does.
 
 Virtual merging is only needed at the physical device, so why do these 
 fields live in struct bio instead of struct request?

A bio does exist outside of a struct request, and bio buildup also
happens before it gets attached to such.

  Changing the number of bvecs is integral to how bio buildup current
  works.
 
 Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
 derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
 in the standard way.

That would only be feasible, if we ruled that any bio in the system must
originate from the standard pools.

 This leaves a little bit of clean up to do for bios not allocated from
 a standard pool.

Please suggest how to do such a cleanup.

 Incidentally, why does the bvl need to be memset to zero on allocation?  
 bi_vcnt already tells you which bvecs are valid and the only field in a 
 bvec that can reasonably default to zero is the offset, which ought to 
 be set set every time a bvec is initialized anyway.

We could probably skip that, but that's an unrelated subject.

   bi_destructor could be combined.  I don't see a lot of users of
   bi_idx,
 
  bi_idx is integral to partial io completions.
 
 Struct request has a remaining submission sector count so what does 
 bi_idx do that is different?

Struct request has remaining IO count. You still need to know where to
start in the bio.

   that looks like a soft target.  See what happened to struct page
   when a couple of folks got serious about attacking it, some really
   deep hacks were done to pare off a few bytes here and there.  But
   struct bio as a space waster is not nearly in the same ballpark.
 
  So show some concrete patches and examples, hand waving and
  assumptions is just a waste of everyones time.
 
 Average struct bio memory footprint ranks near the bottom of the list of 
 things that suck most about Linux storage.  At idle I see 8K in use 
 (reserves); during updatedb it spikes occasionally to 50K; under a 
 heavy  load generated by ddsnap on a storage box it sometimes goes to 
 100K with bio throttling in place.  Really not moving the needle.

Then, again, stop wasting time on this subject. Just because struct bio
isn't a huge bloat is absolutely no justification for adding extra
members to it. It's not just about system wide bloat.

 On the other hand, vm writeout deadlock ranks smack dab at the top of 
 the list, so that is where the patching effort must go for the 
 forseeable future.  Without bio throttling, the ddsnap load can go to 
 24 MB for struct bio alone.  That definitely moves the needle.  in 
 short, we save 3,200 times more memory by putting decent throttling in 
 place than by saving an int in struct bio.

Then fix the damn vm writeout. I always thought it was silly to depend
on the block layer for any sort of throttling. If it's not a system wide
problem, then throttle the io count in the make_request_fn handler of
that problematic driver.

 That said, I did a little analysis to get an idea of where the soft 
 targets are in struct bio, and to get to know the bio layer a little 
 better.  Maybe these few hints will get somebody interested enough to 
 look further.
 
   It would be interesting to see if bi_bdev could be made read only.
   Generally, each stage in the block device stack knows what the next
   stage is going to be, so why do we have to write that in the bio? 
   For error reporting from interrupt context?  Anyway, if Evgeniy
   wants to do the patch, I will happily unload the task of convincing
   you that random fields are/are not needed in struct bio :-)
 
  It's a trade off, otherwise you'd have to pass the block device
  around a lot.
 
 Which costs very little, probably less than trashing an extra field's 
 worth of cache.

Again, you can make that argument for most of the members. It's a
non-starter.

  And it's, again, a design issue. A bio contains 
  destination information, that means device/offset/size information.
  I'm all for shaving structure bytes where it matters, but not for the
  sake of sacrificing code

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Mon, Aug 13 2007, Jens Axboe wrote:
  You did not comment on the one about putting the bio destructor in 
  the -endio handler, which looks dead simple.  The majority of cases 
  just use the default endio handler and the default destructor.  Of the 
  remaining cases, where a specialized destructor is needed, typically a 
  specialized endio handler is too, so combining is free.  There are few 
  if any cases where a new specialized endio handler would need to be 
  written.
 
 We could do that without too much work, I agree.

But that idea fails as well, since reference counts and IO completion
are two completely seperate entities. So unless end IO just happens to
be the last user holding a reference to the bio, you cannot free it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov

On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Sunday 12 August 2007 22:36, I wrote:
  Note!  There are two more issues I forgot to mention earlier.
 
 Oops, and there is also:
 
 3) The bio throttle, which is supposed to prevent deadlock, can itself 
 deadlock.  Let me see if I can remember how it goes.
 
   * generic_make_request puts a bio in flight
   * the bio gets past the throttle and initiates network IO
   * net calls sk_alloc-alloc_pages-shrink_caches
   * shrink_caches submits a bio recursively to our block device
   * this bio blocks on the throttle
   * net may never get the memory it needs, and we are wedged

If system is in such condition, it is already broken - throttle limit
must be lowered (next time) not to allow such situation.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov

Hi Daniel.

On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 Your patch is close to the truth, but it needs to throttle at the top 
 (virtual) end of each block device stack instead of the bottom 
 (physical) end.  It does head in the direction of eliminating your own 
 deadlock risk indeed, however there are block devices it does not 
 cover.

I decided to limit physical devices just because any limit on top of
virtual one is not correct. When system recharges bio from virtual
device to physical, and the latter is full, virtual device will not
accept any new blocks for that physical device, but can accept for
another ones. That was created specially to allow fair use for network
and physical storages.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 00:28, Jens Axboe wrote:
 On Sun, Aug 12 2007, Daniel Phillips wrote:
  Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
  derive efficiently from BIO_POOL_IDX() provided the bio was
  allocated in the standard way.

 That would only be feasible, if we ruled that any bio in the system
 must originate from the standard pools.

Not at all.

  This leaves a little bit of clean up to do for bios not allocated
  from a standard pool.

 Please suggest how to do such a cleanup.

Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
bio from the standard pool.  Just put the power of two size in the bits 
and map that number to the standard pool arrangement with a table 
lookup.

  On the other hand, vm writeout deadlock ranks smack dab at the top
  of the list, so that is where the patching effort must go for the
  forseeable future.  Without bio throttling, the ddsnap load can go
  to 24 MB for struct bio alone.  That definitely moves the needle. 
  in short, we save 3,200 times more memory by putting decent
  throttling in place than by saving an int in struct bio.

 Then fix the damn vm writeout. I always thought it was silly to
 depend on the block layer for any sort of throttling. If it's not a
 system wide problem, then throttle the io count in the
 make_request_fn handler of that problematic driver.

It is a system wide problem.  Every block device needs throttling, 
otherwise queues expand without limit.  Currently, block devices that 
use the standard request library get a slipshod form of throttling for 
free in the form of limiting in-flight request structs.  Because the 
amount of IO carried by a single request can vary by two orders of 
magnitude, the system behavior of this approach is far from 
predictable.

  You did not comment on the one about putting the bio destructor in
  the -endio handler, which looks dead simple.  The majority of
  cases just use the default endio handler and the default
  destructor.  Of the remaining cases, where a specialized destructor
  is needed, typically a specialized endio handler is too, so
  combining is free.  There are few if any cases where a new
  specialized endio handler would need to be written.

 We could do that without too much work, I agree.

OK, we got one and another is close to cracking, enough of that.

  As far as code stability goes, current kernels are horribly
  unstable in a variety of contexts because of memory deadlock and
  slowdowns related to the attempt to fix the problem via dirty
  memory limits.  Accurate throttling of bio traffic is one of the
  two key requirements to fix this instability, the other other is
  accurate writeout path reserve management, which is only partially
  addressed by BIO_POOL.

 Which, as written above and stated many times over the years on lkml,
 is not a block layer issue imho.

Whoever stated that was wrong, but this should be no surprise.  There 
have been many wrong things said about this particular bug over the 
years.  The one thing that remains constant is, Linux continues to 
deadlock under a variety of loads both with and without network 
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies.  Even 
the network becomes part of the problem only because it lies in the 
block IO path.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 00:45, Jens Axboe wrote:
 On Mon, Aug 13 2007, Jens Axboe wrote:
   You did not comment on the one about putting the bio destructor
   in the -endio handler, which looks dead simple.  The majority of
   cases just use the default endio handler and the default
   destructor.  Of the remaining cases, where a specialized
   destructor is needed, typically a specialized endio handler is
   too, so combining is free.  There are few if any cases where a
   new specialized endio handler would need to be written.
 
  We could do that without too much work, I agree.

 But that idea fails as well, since reference counts and IO completion
 are two completely seperate entities. So unless end IO just happens
 to be the last user holding a reference to the bio, you cannot free
 it.

That is not a problem.  When bio_put hits zero it calls -endio instead 
of the destructor.  The -endio sees that the count is zero and 
destroys the bio.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 00:28, Jens Axboe wrote:
  On Sun, Aug 12 2007, Daniel Phillips wrote:
   Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
   derive efficiently from BIO_POOL_IDX() provided the bio was
   allocated in the standard way.
 
  That would only be feasible, if we ruled that any bio in the system
  must originate from the standard pools.
 
 Not at all.
 
   This leaves a little bit of clean up to do for bios not allocated
   from a standard pool.
 
  Please suggest how to do such a cleanup.
 
 Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
 bio from the standard pool.  Just put the power of two size in the bits 
 and map that number to the standard pool arrangement with a table 
 lookup.

So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?

   On the other hand, vm writeout deadlock ranks smack dab at the top
   of the list, so that is where the patching effort must go for the
   forseeable future.  Without bio throttling, the ddsnap load can go
   to 24 MB for struct bio alone.  That definitely moves the needle. 
   in short, we save 3,200 times more memory by putting decent
   throttling in place than by saving an int in struct bio.
 
  Then fix the damn vm writeout. I always thought it was silly to
  depend on the block layer for any sort of throttling. If it's not a
  system wide problem, then throttle the io count in the
  make_request_fn handler of that problematic driver.
 
 It is a system wide problem.  Every block device needs throttling, 
 otherwise queues expand without limit.  Currently, block devices that 
 use the standard request library get a slipshod form of throttling for 
 free in the form of limiting in-flight request structs.  Because the 
 amount of IO carried by a single request can vary by two orders of 
 magnitude, the system behavior of this approach is far from 
 predictable.

Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.

   You did not comment on the one about putting the bio destructor in
   the -endio handler, which looks dead simple.  The majority of
   cases just use the default endio handler and the default
   destructor.  Of the remaining cases, where a specialized destructor
   is needed, typically a specialized endio handler is too, so
   combining is free.  There are few if any cases where a new
   specialized endio handler would need to be written.
 
  We could do that without too much work, I agree.
 
 OK, we got one and another is close to cracking, enough of that.

No we did not, I already failed this one in the next mail.

   As far as code stability goes, current kernels are horribly
   unstable in a variety of contexts because of memory deadlock and
   slowdowns related to the attempt to fix the problem via dirty
   memory limits.  Accurate throttling of bio traffic is one of the
   two key requirements to fix this instability, the other other is
   accurate writeout path reserve management, which is only partially
   addressed by BIO_POOL.
 
  Which, as written above and stated many times over the years on lkml,
  is not a block layer issue imho.
 
 Whoever stated that was wrong, but this should be no surprise.  There 
 have been many wrong things said about this particular bug over the 
 years.  The one thing that remains constant is, Linux continues to 
 deadlock under a variety of loads both with and without network 
 involvement, making it effectively useless as a storage platform.
 
 These deadlocks are first and foremost, block layer deficiencies.  Even 
 the network becomes part of the problem only because it lies in the 
 block IO path.

The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 00:45, Jens Axboe wrote:
  On Mon, Aug 13 2007, Jens Axboe wrote:
You did not comment on the one about putting the bio destructor
in the -endio handler, which looks dead simple.  The majority of
cases just use the default endio handler and the default
destructor.  Of the remaining cases, where a specialized
destructor is needed, typically a specialized endio handler is
too, so combining is free.  There are few if any cases where a
new specialized endio handler would need to be written.
  
   We could do that without too much work, I agree.
 
  But that idea fails as well, since reference counts and IO completion
  are two completely seperate entities. So unless end IO just happens
  to be the last user holding a reference to the bio, you cannot free
  it.
 
 That is not a problem.  When bio_put hits zero it calls -endio instead 
 of the destructor.  The -endio sees that the count is zero and 
 destroys the bio.

You can't be serious? You'd stall end io completion notification because
someone holds a reference to a bio. Surely you jest.

Needless to say, that will never go in.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  But that idea fails as well, since reference counts and IO completion
  are two completely seperate entities. So unless end IO just happens
  to be the last user holding a reference to the bio, you cannot free
  it.
 
 That is not a problem.  When bio_put hits zero it calls -endio instead 
 of the destructor.  The -endio sees that the count is zero and 
 destroys the bio.

This is not a very good solution, since it requires all users of the
bios to know how to free it. Right now it is hidden.
And adds additional atomic check (although reading is quite fast) in the
end_io. And for what purpose? To eat 8 bytes on 64bit platform? This
will not reduce its size noticebly, so the same number of bios will be
in the cache's page, so what is a gain? All this cleanups and logic
complicatins should be performed only if after size shring increased
number of bios can fit into cache's page, will it be done after such
cleanups?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:13, Jens Axboe wrote:
 On Mon, Aug 13 2007, Daniel Phillips wrote:
  On Monday 13 August 2007 00:45, Jens Axboe wrote:
   On Mon, Aug 13 2007, Jens Axboe wrote:
 You did not comment on the one about putting the bio
 destructor in the -endio handler, which looks dead simple. 
 The majority of cases just use the default endio handler and
 the default destructor.  Of the remaining cases, where a
 specialized destructor is needed, typically a specialized
 endio handler is too, so combining is free.  There are few if
 any cases where a new specialized endio handler would need to
 be written.
   
We could do that without too much work, I agree.
  
   But that idea fails as well, since reference counts and IO
   completion are two completely seperate entities. So unless end IO
   just happens to be the last user holding a reference to the bio,
   you cannot free it.
 
  That is not a problem.  When bio_put hits zero it calls -endio
  instead of the destructor.  The -endio sees that the count is zero
  and destroys the bio.

 You can't be serious? You'd stall end io completion notification
 because someone holds a reference to a bio.

Of course not.  Nothing I said stops endio from being called in the 
usual way as well.  For this to work, endio just needs to know that one 
call means end and the other means destroy, this is trivial.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 02:13, Jens Axboe wrote:
  On Mon, Aug 13 2007, Daniel Phillips wrote:
   On Monday 13 August 2007 00:45, Jens Axboe wrote:
On Mon, Aug 13 2007, Jens Axboe wrote:
  You did not comment on the one about putting the bio
  destructor in the -endio handler, which looks dead simple. 
  The majority of cases just use the default endio handler and
  the default destructor.  Of the remaining cases, where a
  specialized destructor is needed, typically a specialized
  endio handler is too, so combining is free.  There are few if
  any cases where a new specialized endio handler would need to
  be written.

 We could do that without too much work, I agree.
   
But that idea fails as well, since reference counts and IO
completion are two completely seperate entities. So unless end IO
just happens to be the last user holding a reference to the bio,
you cannot free it.
  
   That is not a problem.  When bio_put hits zero it calls -endio
   instead of the destructor.  The -endio sees that the count is zero
   and destroys the bio.
 
  You can't be serious? You'd stall end io completion notification
  because someone holds a reference to a bio.
 
 Of course not.  Nothing I said stops endio from being called in the 
 usual way as well.  For this to work, endio just needs to know that one 
 call means end and the other means destroy, this is trivial.

Sorry Daniel, but your suggestions would do nothing more than uglify the
code and design.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 03:06, Jens Axboe wrote:
 On Mon, Aug 13 2007, Daniel Phillips wrote:
  Of course not.  Nothing I said stops endio from being called in the
  usual way as well.  For this to work, endio just needs to know that
  one call means end and the other means destroy, this is
  trivial.

 Sorry Daniel, but your suggestions would do nothing more than uglify
 the code and design.

Pretty much exactly what was said about shrinking struct page, ask Bill.  
The difference was, shrinking struct page actually mattered whereas 
shrinking struct bio does not, and neither does expanding it by a few 
bytes.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote:
 On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
   But that idea fails as well, since reference counts and IO
   completion are two completely seperate entities. So unless end IO
   just happens to be the last user holding a reference to the bio,
   you cannot free it.
 
  That is not a problem.  When bio_put hits zero it calls -endio
  instead of the destructor.  The -endio sees that the count is zero
  and destroys the bio.

 This is not a very good solution, since it requires all users of the
 bios to know how to free it.

No, only the specific -endio needs to know that, which is set by the 
bio owner, so this knowledge lies in exactly the right place.  A small 
handful of generic endios all with the same destructor are used nearly 
everywhere.

 Right now it is hidden. 
 And adds additional atomic check (although reading is quite fast) in
 the end_io.

Actual endio happens once in the lifetime of the transfer, this read 
will be entirely lost in the noise.

 And for what purpose? To eat 8 bytes on 64bit platform? 
 This will not reduce its size noticebly, so the same number of bios
 will be in the cache's page, so what is a gain? All this cleanups and
 logic complicatins should be performed only if after size shring
 increased number of bios can fit into cache's page, will it be done
 after such cleanups?

Well, exactly,   My point from the beginning was that the size of struct 
bio is not even close to being a problem and adding a few bytes to it 
in the interest of doing the cleanest fix to a core kernel bug is just 
not a dominant issue.

I suppose that leaving out the word bloated and skipping straight to 
the doesn't matter proof would have saved some bandwidth.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Jens Axboe

On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 03:06, Jens Axboe wrote:
  On Mon, Aug 13 2007, Daniel Phillips wrote:
   Of course not.  Nothing I said stops endio from being called in the
   usual way as well.  For this to work, endio just needs to know that
   one call means end and the other means destroy, this is
   trivial.
 
  Sorry Daniel, but your suggestions would do nothing more than uglify
  the code and design.
 
 Pretty much exactly what was said about shrinking struct page, ask Bill.  
 The difference was, shrinking struct page actually mattered whereas 
 shrinking struct bio does not, and neither does expanding it by a few 
 bytes.

Lets back this up a bit - this whole thing began with you saying that
struct bio was bloated already, which I said wasn't true. You then
continued to hand wave your wave through various suggestions to trim the
obvious fat from that structure, none of which were nice or feasible.

I never compared the bio to struct page, I'd obviously agree that
shrinking struct page was a worthy goal and that it'd be ok to uglify
some code to do that. The same isn't true for struct bio.

And we can expand struct bio if we have to, naturally. And I've done it
before, which I wrote in the initial mail. I just don't want to do it
casually, then it WILL be bloated all of a sudden. Your laissez faire
attitude towards adding members to struct bio oh I'll just add it and
someone less lazy than me will fix it up in the future makes me happy
that you are not maintaining anything that I use.

I'll stop replying to your mails until something interesting surfaces.
I've already made my points clear about both the above and the
throttling. And I'd advise you to let Evgeniy take this forward, he
seems a lot more adept to actually getting CODE done and - at least from
my current and past perspective - is someone you can actually have a
fruitful conversation with.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 03:22, Jens Axboe wrote:
 I never compared the bio to struct page, I'd obviously agree that
 shrinking struct page was a worthy goal and that it'd be ok to uglify
 some code to do that. The same isn't true for struct bio.

I thought I just said that.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
  Oops, and there is also:
 
  3) The bio throttle, which is supposed to prevent deadlock, can
  itself deadlock.  Let me see if I can remember how it goes.
 
* generic_make_request puts a bio in flight
* the bio gets past the throttle and initiates network IO
* net calls sk_alloc-alloc_pages-shrink_caches
* shrink_caches submits a bio recursively to our block device
* this bio blocks on the throttle
* net may never get the memory it needs, and we are wedged

 If system is in such condition, it is already broken - throttle limit
 must be lowered (next time) not to allow such situation.

Agreed that the system is broken, however lowering the throttle limit 
gives no improvement in this case.

This is not theoretical, but a testable, repeatable result.  
Instructions to reproduce should show up tomorrow.

This bug is now solved in a kludgy way.  Now, Peter's patch set offers a 
much cleaner way to fix this little problem, along with at least one 
other nasty that it already fixed.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  This is not a very good solution, since it requires all users of the
  bios to know how to free it.
 
 No, only the specific -endio needs to know that, which is set by the 
 bio owner, so this knowledge lies in exactly the right place.  A small 
 handful of generic endios all with the same destructor are used nearly 
 everywhere.

That is what I meant - there will be no way to just alloc a bio and put
it, helpers for generic bio sets must be exported and each and every
bi_end_io() must be changed to check reference counter and they must
know how they were allocated.

  Right now it is hidden. 
  And adds additional atomic check (although reading is quite fast) in
  the end_io.
 
 Actual endio happens once in the lifetime of the transfer, this read 
 will be entirely lost in the noise.

Not always. Sometimes it is called multiple times, but all bi_end_io()
callbacks I checked (I believe all in mainline tree) tests if bi_size is
zero or not.

Endio callback is of course quite rare and additional atomic
reading will not kill the system, but why introduce another read?
It is possible to provide a flag for endio callback that it is last, but
it still requires to change every single callback - why do we want this?

  And for what purpose? To eat 8 bytes on 64bit platform? 
  This will not reduce its size noticebly, so the same number of bios
  will be in the cache's page, so what is a gain? All this cleanups and
  logic complicatins should be performed only if after size shring
  increased number of bios can fit into cache's page, will it be done
  after such cleanups?
 
 Well, exactly,   My point from the beginning was that the size of struct 
 bio is not even close to being a problem and adding a few bytes to it 
 in the interest of doing the cleanest fix to a core kernel bug is just 
 not a dominant issue.

So, I'm a bit lost...

You say it is too big and some parts can be removed or combined, and
then that size does not matter. Last/not-last checks in the code is not
clear design, so I do not see why it is needed at all if not for size
shrinking.

 I suppose that leaving out the word bloated and skipping straight to 
 the doesn't matter proof would have saved some bandwidth.

:) Likely it will.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote:
 On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  (previous incomplete message sent accidentally)
 
  On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
   On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
  
   So, what did we decide? To bloat bio a bit (add a queue pointer)
   or to use physical device limits? The latter requires to replace
   all occurence of bio-bi_bdev = something_new with
   blk_set_bdev(bio, somthing_new), where queue limits will be
   appropriately charged. So far I'm testing second case, but I only
   changed DST for testing, can change all other users if needed
   though.
 
  Adding a queue pointer to struct bio and using physical device
  limits as in your posted patch both suffer from the same problem:
  you release the throttling on the previous queue when the bio moves
  to a new one, which is a bug because memory consumption on the
  previous queue then becomes unbounded, or limited only by the
  number of struct requests that can be allocated.  In other words,
  it reverts to the same situation we have now as soon as the IO
  stack has more than one queue.  (Just a shorter version of my
  previous post.)

 No. Since all requests for virtual device end up in physical devices,
 which have limits, this mechanism works. Virtual device will
 essentially call either generic_make_request() for new physical
 device (and thus will sleep is limit is over), or will process bios
 directly, but in that case it will sleep in generic_make_request()
 for virutal device.

What can happen is, as soon as you unthrottle the previous queue, 
another thread can come in and put another request on it.  Sure, that 
thread will likely block on the physical throttle and so will the rest 
of the incoming threads, but it still allows the higher level queue to 
grow past any given limit, with the help of lots of threads.  JVM for 
example?

Say you have a device mapper device with some physical device sitting 
underneath, the classic use case for this throttle code.  Say 8,000 
threads each submit an IO in parallel.  The device mapper mapping 
function will be called 8,000 times with associated resource 
allocations, regardless of any throttling on the physical device queue.

Anyway, your approach is awfully close to being airtight, there is just 
a small hole.  I would be more than happy to be proved wrong about 
that, but the more I look, the more I see that hole.

  1) One throttle count per submitted bio is too crude a measure.  A
  bio can carry as few as one page or as many as 256 pages.  If you
  take only

 It does not matter - we can count bytes, pages, bio vectors or
 whatever we like, its just a matter of counter and can be changed
 without problem.

Quite true.  In some cases the simple inc/dec per bio works just fine.  
But the general case where finer granularity is required comes up in 
existing code, so there needs to be a plan.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote:
 On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
   This is not a very good solution, since it requires all users of
   the bios to know how to free it.
 
  No, only the specific -endio needs to know that, which is set by
  the bio owner, so this knowledge lies in exactly the right place. 
  A small handful of generic endios all with the same destructor are
  used nearly everywhere.

 That is what I meant - there will be no way to just alloc a bio and
 put it, helpers for generic bio sets must be exported and each and
 every bi_end_io() must be changed to check reference counter and they
 must know how they were allocated.

There are fewer non-generic bio allocators than you think.

 Endio callback is of course quite rare and additional atomic
 reading will not kill the system, but why introduce another read?
 It is possible to provide a flag for endio callback that it is last,
 but it still requires to change every single callback - why do we
 want this?

We don't.  Struct bio does not need to be shrunk.  Jens wanted to talk 
about what fields could be eliminated if we wanted to shrink it.  It is 
about time to let that lie, don't you think?

 So, I'm a bit lost...

 You say it is too big 

Did not say that.

 and some parts can be removed or combined

True.

 and  then that size does not matter.

Also true, backed up by numbers on real systems.

 Last/not-last checks in the code is 
 not clear design, so I do not see why it is needed at all if not for
 size shrinking.

Not needed, indeed.  Accurate throttling is needed.  If the best way to 
throttle requires expanding struct bio a little then we should not let 
concerns about the cost  of an int or two stand in the way.  Like Jens, 
I am more concerned about the complexity cost, and that is minimized in 
my opinion by throttling in the generic code rather than with custom 
code in each specialized block driver.

Your patch does throttle in the generic code, great.  Next thing is to 
be sure that it completely closes the window for reserve leakage, which 
is not yet clear.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
   Oops, and there is also:
  
   3) The bio throttle, which is supposed to prevent deadlock, can
   itself deadlock.  Let me see if I can remember how it goes.
  
 * generic_make_request puts a bio in flight
 * the bio gets past the throttle and initiates network IO
 * net calls sk_alloc-alloc_pages-shrink_caches
 * shrink_caches submits a bio recursively to our block device
 * this bio blocks on the throttle
 * net may never get the memory it needs, and we are wedged
 
  If system is in such condition, it is already broken - throttle limit
  must be lowered (next time) not to allow such situation.
 
 Agreed that the system is broken, however lowering the throttle limit 
 gives no improvement in this case.

How is it ever possible? The whole idea of throttling is to remove such
situation, and now you say it can not be solved. If limit is for 1gb of
pending block io, and system has for example 2gbs of ram (or any other
resonable parameters), then there is no way we can deadlock in
allocation, since it will not force page reclaim mechanism.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote:
 On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
Oops, and there is also:
   
3) The bio throttle, which is supposed to prevent deadlock, can
itself deadlock.  Let me see if I can remember how it goes.
   
  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc-alloc_pages-shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged
  
   If system is in such condition, it is already broken - throttle
   limit must be lowered (next time) not to allow such situation.
 
  Agreed that the system is broken, however lowering the throttle
  limit gives no improvement in this case.

 How is it ever possible? The whole idea of throttling is to remove
 such situation, and now you say it can not be solved.

It was solved, by not throttling writeout that comes from shrink_caches.
Ugly.

 If limit is for 
 1gb of pending block io, and system has for example 2gbs of ram (or
 any other resonable parameters), then there is no way we can deadlock
 in allocation, since it will not force page reclaim mechanism.

The problem is that sk_alloc (called from our block driver via 
socket-write) would recurse into shrink_pages, which recursively 
submits IO to our block driver and blocks on the throttle.  Subtle 
indeed, and yet another demonstration of why vm recursion is a Bad 
Thing.

I will find a traceback for you tomorrow, which makes this deadlock much 
clearer.

Regards
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 04:18:03AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  No. Since all requests for virtual device end up in physical devices,
  which have limits, this mechanism works. Virtual device will
  essentially call either generic_make_request() for new physical
  device (and thus will sleep is limit is over), or will process bios
  directly, but in that case it will sleep in generic_make_request()
  for virutal device.
 
 What can happen is, as soon as you unthrottle the previous queue, 
 another thread can come in and put another request on it.  Sure, that 
 thread will likely block on the physical throttle and so will the rest 
 of the incoming threads, but it still allows the higher level queue to 
 grow past any given limit, with the help of lots of threads.  JVM for 
 example?

No. You get one slot, and one thread will not be blocked, all others
will. If lucky thread wants to put two requests it will be blocked on
second request, since underlying physical device does not accept requests
anymore an thus caller will sleep.

 Say you have a device mapper device with some physical device sitting 
 underneath, the classic use case for this throttle code.  Say 8,000 
 threads each submit an IO in parallel.  The device mapper mapping 
 function will be called 8,000 times with associated resource 
 allocations, regardless of any throttling on the physical device queue.

Each thread will sleep in generic_make_request(), if limit is specified
correctly, then allocated number of bios will be enough to have a
progress.

Here is an example:

let's say system has 20.000 pages in RAM and 20.000 in swap,
we have 8.000 threads, each one allocates a page, then next page and so
on. System has one virtual device with two physical devices under it,
each device gets half of requests.

We set limit to 4.000 per physical device.

All threads allocate a page and queue it to devices, so all threads
succeeded in its first allocation, and each device has its queue full.
Virtual device does not have a limit (or have it 4.000 too, but since it
was each time recharged, it has zero blocks in-flight).

New thread tries to allocate a page, it is allocated and queued to one
of the devices, but since its queue is full, thread sleeps. So will do
each other.

Thus we ended up allocated 8.000 requests queued, and 8.000 in-flight,
totally 16.000 which is smaller than amount of pages in RAM, so we are
happy.

Consider above as a special kind calculation i.e. number of 
_allocated_ pages is always number of physical device multiplied by each
one's in-flight limit. By adjusting in-flight limit and knowing number
of device it is completely possible to eliminate vm deadlock.

If you do not like such calculation, solution is trivial:
we can sleep _after_ -make_request_fn() in
generic_make_request() until number of in-flight bios is reduced by
bio_endio().

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov

On Mon, Aug 13, 2007 at 05:18:14AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  If limit is for 
  1gb of pending block io, and system has for example 2gbs of ram (or
  any other resonable parameters), then there is no way we can deadlock
  in allocation, since it will not force page reclaim mechanism.
 
 The problem is that sk_alloc (called from our block driver via 
 socket-write) would recurse into shrink_pages, which recursively 
 submits IO to our block driver and blocks on the throttle.  Subtle 
 indeed, and yet another demonstration of why vm recursion is a Bad 
 Thing.

 I will find a traceback for you tomorrow, which makes this deadlock much 
 clearer.

I see how it can happen, but device throttling is a solution we are
trying to complete, which main aim _is_ to remove this problem.

Lower per-device limit, so that the rest of the RAM allowed to
allocate all needed data structures in the network path.
Above example just has 1gb of ram, which should be enough for skbs, if
it is not, decrease limit to 500 mb and so on, until weighted load of
the system allows to always have a forward progress.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote:
  Say you have a device mapper device with some physical device
  sitting underneath, the classic use case for this throttle code. 
  Say 8,000 threads each submit an IO in parallel.  The device mapper
  mapping function will be called 8,000 times with associated
  resource allocations, regardless of any throttling on the physical
  device queue.

 Each thread will sleep in generic_make_request(), if limit is
 specified correctly, then allocated number of bios will be enough to
 have a progress.

The problem is, the sleep does not occur before the virtual device 
mapping function is called.  Let's consider two devices, a physical 
device named pdev and a virtual device sitting on top of it called 
vdev.   vdev's throttle limit is just one element, but we will see that 
in spite of this, two bios can be handled by the vdev's mapping method 
before any IO completes, which violates the throttling rules. According 
to your patch it works like this:

 Thread 1Thread  2

no wait because vdev-bio_queued is zero

vdev-q-bio_queued++

enter devmapper map method

blk_set_bdev(bio, pdev)
 vdev-bio_queued--
   
no wait because vdev-bio_queued is 
zero

vdev-q-bio_queued++

enter devmapper map method

whoops!  Our virtual device mapping
function has now allocated resources
for two in-flight bios in spite of 
having its
throttle limit set to 1.

Perhaps you never worried about the resources that the device mapper 
mapping function allocates to handle each bio and so did not consider 
this hole significant.  These resources can be significant, as is the 
case with ddsnap.  It is essential to close that window through with 
the virtual device's queue limit may be violated.  Not doing so will 
allow deadlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:12, Jens Axboe wrote:
  It is a system wide problem.  Every block device needs throttling,
  otherwise queues expand without limit.  Currently, block devices
  that use the standard request library get a slipshod form of
  throttling for free in the form of limiting in-flight request
  structs.  Because the amount of IO carried by a single request can
  vary by two orders of magnitude, the system behavior of this
  approach is far from predictable.

 Is it? Consider just 10 standard sata disks. The next kernel revision
 will have sg chaining support, so that allows 32MiB per request. Even
 if we disregard reads (not so interesting in this discussion) and
 just look at potentially pinned dirty data in a single queue, that
 number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch.

 So I still think that this throttling needs to happen elsewhere, you
 cannot rely the block layer throttling globally or for a single
 device. It just doesn't make sense.

You are right, so long as the unit of throttle accounting remains one 
request.  This is not what we do in ddsnap.  Instead we inc/dec the 
throttle counter by the number of bvecs in each bio, which produces a 
nice steady data flow to the disk under a wide variety of loads, and 
provides the memory resource bound we require.

One throttle count per bvec will not be the right throttling metric for 
every driver.  To customize this accounting metric for a given driver 
we already have the backing_dev_info structure, which provides 
per-device-instance accounting functions and instance data.  Perfect! 
This allows us to factor the throttling mechanism out of the driver, so 
the only thing the driver has to do is define the throttle accounting 
if it needs a custom one.

We can avoid affecting the traditional behavior quite easily, for 
example if backing_dev_info-throttle_fn (new method) is null then 
either not throttle at all (and rely on the struct request in-flight 
limit) or we can move the in-flight request throttling logic into core 
as the default throttling method, simplifying the request library and 
not changing its behavior.

  These deadlocks are first and foremost, block layer deficiencies. 
  Even the network becomes part of the problem only because it lies
  in the block IO path.

 The block layer has NEVER guaranteed throttling, so it can - by
 definition - not be a block layer deficiency.

The block layer has always been deficient by not providing accurate 
throttling, or any throttling at all for some devices.  We have 
practical proof that this causes deadlock and a good theoretical basis 
for describing exactly how it happens.

To be sure, vm and net are co-conspirators, however the block layer 
really is the main actor in this little drama.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-12 Thread Daniel Phillips

On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
 I don't like structure bloat, but I do like nice design. Overloading
 is a necessary evil sometimes, though. Even today, there isn't enough
 room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
 so that concern can be scratched. If you read bio.h, that much is
 obvious.

Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
this lives in struct bio instead of struct request?

 If you check up on the iommu virtual merging, you'll understand the
 front and back size members. They may smell dubious to you, but
 please take the time to understand why it looks the way it does.

Virtual merging is only needed at the physical device, so why do these 
fields live in struct bio instead of struct request?

 Changing the number of bvecs is integral to how bio buildup current
 works.

Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
in the standard way.  This leaves a little bit of clean up to do for 
bios not allocated from a standard pool.

Incidentally, why does the bvl need to be memset to zero on allocation?  
bi_vcnt already tells you which bvecs are valid and the only field in a 
bvec that can reasonably default to zero is the offset, which ought to 
be set set every time a bvec is initialized anyway.

  bi_destructor could be combined.  I don't see a lot of users of
  bi_idx,

 bi_idx is integral to partial io completions.

Struct request has a remaining submission sector count so what does 
bi_idx do that is different?

  that looks like a soft target.  See what happened to struct page
  when a couple of folks got serious about attacking it, some really
  deep hacks were done to pare off a few bytes here and there.  But
  struct bio as a space waster is not nearly in the same ballpark.

 So show some concrete patches and examples, hand waving and
 assumptions is just a waste of everyones time.

Average struct bio memory footprint ranks near the bottom of the list of 
things that suck most about Linux storage.  At idle I see 8K in use 
(reserves); during updatedb it spikes occasionally to 50K; under a 
heavy  load generated by ddsnap on a storage box it sometimes goes to 
100K with bio throttling in place.  Really not moving the needle.

On the other hand, vm writeout deadlock ranks smack dab at the top of 
the list, so that is where the patching effort must go for the 
forseeable future.  Without bio throttling, the ddsnap load can go to 
24 MB for struct bio alone.  That definitely moves the needle.  in 
short, we save 3,200 times more memory by putting decent throttling in 
place than by saving an int in struct bio.

That said, I did a little analysis to get an idea of where the soft 
targets are in struct bio, and to get to know the bio layer a little 
better.  Maybe these few hints will get somebody interested enough to 
look further.

  It would be interesting to see if bi_bdev could be made read only.
  Generally, each stage in the block device stack knows what the next
  stage is going to be, so why do we have to write that in the bio? 
  For error reporting from interrupt context?  Anyway, if Evgeniy
  wants to do the patch, I will happily unload the task of convincing
  you that random fields are/are not needed in struct bio :-)

 It's a trade off, otherwise you'd have to pass the block device
 around a lot.

Which costs very little, probably less than trashing an extra field's 
worth of cache.

 And it's, again, a design issue. A bio contains 
 destination information, that means device/offset/size information.
 I'm all for shaving structure bytes where it matters, but not for the
 sake of sacrificing code stability or design. I consider struct bio
 quite lean and have worked hard to keep it that way. In fact, iirc,
 the only addition to struct bio since 2001 is the iommu front/back
 size members. And I resisted those for quite a while.

You did not comment on the one about putting the bio destructor in 
the -endio handler, which looks dead simple.  The majority of cases 
just use the default endio handler and the default destructor.  Of the 
remaining cases, where a specialized destructor is needed, typically a 
specialized endio handler is too, so combining is free.  There are few 
if any cases where a new specialized endio handler would need to be 
written.

As far as code stability goes, current kernels are horribly unstable in 
a variety of contexts because of memory deadlock and slowdowns related 
to the attempt to fix the problem via dirty memory limits.  Accurate 
throttling of bio traffic is one of the two key requirements to fix 
this instability, the other other is accurate writeout path reserve 
management, which is only partially addressed by BIO_POOL.

Nice to see you jumping in Jens.  Now it is over to the other side of 
the thread where Evgeniy has posted a patch that a) grants your wish to 
add no new

Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
 On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe 
([EMAIL PROTECTED]) wrote:

 So, what did we decide? To bloat bio a bit (add a queue pointer) or
 to use physical device limits? The latter requires to replace all
 occurence of bio-bi_bdev = something_new with blk_set_bdev(bio,
 somthing_new), where queue limits will be appropriately charged. So
 far I'm testing second case, but I only changed DST for testing, can
 change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.



the throttling performs another function: keeping a reasonable amount of 
IO in flight for the device.  The definition of reasonable is 
complex.  For a hard disk it depends on the physical distance between 
sector addresses of the bios in flight.  In ddsnap we make a crude but 
workable approximation that 


 In general, a per block device 

The throttle count needs to cover 

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips

(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
 On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:

 So, what did we decide? To bloat bio a bit (add a queue pointer) or
 to use physical device limits? The latter requires to replace all
 occurence of bio-bi_bdev = something_new with blk_set_bdev(bio,
 somthing_new), where queue limits will be appropriately charged. So
 far I'm testing second case, but I only changed DST for testing, can
 change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is 
really not a good long term solution for system administration.  
Imagine our help text: just keep trying smaller numbers until your 
system deadlocks.  We really need to figure this out internally and 
get it correct.  I can see putting in a temporary userspace interface 
just for experimentation, to help determine what really is safe, and 
what size the numbers should be to approach optimal throughput in a 
fully loaded memory state.

Regards,

Daniel


Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov

On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 I don't like structure bloat, but I do like nice design. Overloading is

So, what did we decide? To bloat bio a bit (add a queue pointer) or to
use physical device limits? The latter requires to replace all occurence
of bio-bi_bdev = something_new with blk_set_bdev(bio, somthing_new),
where queue limits will be appropriately charged. So far I'm testing
second case, but I only changed DST for testing, can change all other
users if needed though.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov

This throttling mechanism allows to limit maximum amount of queued bios 
per physical device. By default it is turned off and old block layer 
behaviour with unlimited number of bios is used. When turned on (queue
limit is set to something different than -1U via blk_set_queue_limit()),
generic_make_request() will sleep until there is room in the queue.
number of bios is increased in generic_make_request() and reduced either
in bio_endio(), when bio is completely processed (bi_size is zero), and
recharged from original queue when new device is assigned to bio via
blk_set_bdev(). All oerations are not atomic, since we do not care about
precise number of bios, but a fact, that we are close or close enough to
the limit.

Tested on distributed storage device - with limit of 2 bios it works
slow :)

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index c99b463..1882c9b 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q-backing_dev_info.unplug_io_data = q;
 
+   q-bio_limit = -1U;
+   q-bio_queued = 0;
+   init_waitqueue_head(q-wait);
+
mutex_init(q-sysfs_lock);
 
return q;
@@ -3237,6 +3241,16 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   request_queue_t *q;
+
+   BUG_ON(!bio-bi_bdev)
+
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q  q-bio_limit != -1U) {
+   wait_event_interruptible(q-wait, q-bio_queued + 1 = 
q-bio_limit);
+   q-bio_queued++;
+   }
+
if (current-bio_tail) {
/* make_request is active */
*(current-bio_tail) = bio;
diff --git a/fs/bio.c b/fs/bio.c
index 093345f..0a33958 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, 
int error)
bio-bi_size -= bytes_done;
bio-bi_sector += (bytes_done  9);
 
+   if (!bio-bi_size  bio-bi_bdev) {
+   request_queue_t *q;
+
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q) {
+   q-bio_queued--;
+   wake_up(q-wait);
+   }
+   }
+
if (bio-bi_end_io)
bio-bi_end_io(bio, bytes_done, error);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db5b00a..7ce0cd7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -467,6 +467,9 @@ struct request_queue
struct request  *orig_bar_rq;
unsigned intbi_size;
 
+   wait_queue_head_t   wait;
+   unsigned intbio_limit, bio_queued;
+
struct mutexsysfs_lock;
 };
 
@@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void);
 int blk_get_queue(request_queue_t *);
 request_queue_t *blk_alloc_queue(gfp_t);
 request_queue_t *blk_alloc_queue_node(gfp_t, int);
+
+static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit)
+{
+   q-bio_limit = limit;
+}
+
+static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev)
+{
+   request_queue_t *q;
+
+   if (!bio-bi_bdev) {
+   bio-bi_bdev = bdev;
+   return;
+   }
+   
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q) {
+   q-bio_queued--;
+   wake_up(q-wait);
+   }
+
+   bio-bi_bdev = bdev;
+}
+
 extern void blk_put_queue(request_queue_t *);
 
 /*

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov

On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 This throttling mechanism allows to limit maximum amount of queued bios 
 per physical device. By default it is turned off and old block layer 
 behaviour with unlimited number of bios is used. When turned on (queue
 limit is set to something different than -1U via blk_set_queue_limit()),
 generic_make_request() will sleep until there is room in the queue.
 number of bios is increased in generic_make_request() and reduced either
 in bio_endio(), when bio is completely processed (bi_size is zero), and
 recharged from original queue when new device is assigned to bio via
 blk_set_bdev(). All oerations are not atomic, since we do not care about
 precise number of bios, but a fact, that we are close or close enough to
 the limit.
 
 Tested on distributed storage device - with limit of 2 bios it works
 slow :)

As addon I can cook up a patch to configure this via sysfs if needed.
Thoughts?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-07 Thread Jens Axboe

On Sun, Aug 05 2007, Daniel Phillips wrote:
 A simple way to solve the stable accounting field issue is to add a new 
 pointer to struct bio that is owned by the top level submitter 
 (normally generic_make_request but not always) and is not affected by 
 any recursive resubmission.  Then getting rid of that field later 
 becomes somebody's summer project, which is not all that urgent because 
 struct bio is already bloated up with a bunch of dubious fields and is 
 a transient structure anyway.

Thanks for your insights. Care to detail what bloat and dubious fields
struct bio has?

And we don't add temporary fields out of laziness, hoping that someone
will later kill it again and rewrite it in a nicer fashion. Hint: that
never happens, bloat sticks.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-07 Thread Daniel Phillips

On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
 On Sun, Aug 05 2007, Daniel Phillips wrote:
  A simple way to solve the stable accounting field issue is to add a
  new pointer to struct bio that is owned by the top level submitter
  (normally generic_make_request but not always) and is not affected
  by any recursive resubmission.  Then getting rid of that field
  later becomes somebody's summer project, which is not all that
  urgent because struct bio is already bloated up with a bunch of
  dubious fields and is a transient structure anyway.

 Thanks for your insights. Care to detail what bloat and dubious
 fields struct bio has?

First obvious one I see is bi_rw separate from bi_flags.  Front_size and 
back_size smell dubious.  Is max_vecs really necessary?  You could 
reasonably assume bi_vcnt rounded up to a power of two and bury the 
details of making that work behind wrapper functions to change the 
number of bvecs, if anybody actually needs that.  Bi_endio and 
bi_destructor could be combined.  I don't see a lot of users of bi_idx, 
that looks like a soft target.  See what happened to struct page when a 
couple of folks got serious about attacking it, some really deep hacks 
were done to pare off a few bytes here and there.  But struct bio as a 
space waster is not nearly in the same ballpark.

It would be interesting to see if bi_bdev could be made read only.  
Generally, each stage in the block device stack knows what the next 
stage is going to be, so why do we have to write that in the bio?  For 
error reporting from interrupt context?  Anyway, if Evgeniy wants to do 
the patch, I will happily unload the task of convincing you that random 
fields are/are not needed in struct bio :-)

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-07 Thread Jens Axboe

On Tue, Aug 07 2007, Daniel Phillips wrote:
 On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
  On Sun, Aug 05 2007, Daniel Phillips wrote:
   A simple way to solve the stable accounting field issue is to add a
   new pointer to struct bio that is owned by the top level submitter
   (normally generic_make_request but not always) and is not affected
   by any recursive resubmission.  Then getting rid of that field
   later becomes somebody's summer project, which is not all that
   urgent because struct bio is already bloated up with a bunch of
   dubious fields and is a transient structure anyway.
 
  Thanks for your insights. Care to detail what bloat and dubious
  fields struct bio has?
 
 First obvious one I see is bi_rw separate from bi_flags.  Front_size and 
 back_size smell dubious.  Is max_vecs really necessary?  You could 

I don't like structure bloat, but I do like nice design. Overloading is
a necessary evil sometimes, though. Even today, there isn't enough room
to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that
concern can be scratched. If you read bio.h, that much is obvious.

If you check up on the iommu virtual merging, you'll understand the
front and back size members. They may smell dubious to you, but please
take the time to understand why it looks the way it does.

 reasonably assume bi_vcnt rounded up to a power of two and bury the 
 details of making that work behind wrapper functions to change the 
 number of bvecs, if anybody actually needs that.  Bi_endio and 

Changing the number of bvecs is integral to how bio buildup current
works.

 bi_destructor could be combined.  I don't see a lot of users of bi_idx, 

bi_idx is integral to partial io completions.

 that looks like a soft target.  See what happened to struct page when a 
 couple of folks got serious about attacking it, some really deep hacks 
 were done to pare off a few bytes here and there.  But struct bio as a 
 space waster is not nearly in the same ballpark.

So show some concrete patches and examples, hand waving and assumptions
is just a waste of everyones time.

 It would be interesting to see if bi_bdev could be made read only.  
 Generally, each stage in the block device stack knows what the next 
 stage is going to be, so why do we have to write that in the bio?  For 
 error reporting from interrupt context?  Anyway, if Evgeniy wants to do 
 the patch, I will happily unload the task of convincing you that random 
 fields are/are not needed in struct bio :-)

It's a trade off, otherwise you'd have to pass the block device around a
lot. And it's, again, a design issue. A bio contains destination
information, that means device/offset/size information. I'm all for
shaving structure bytes where it matters, but not for the sake of
sacrificing code stability or design. I consider struct bio quite lean
and have worked hard to keep it that way. In fact, iirc, the only
addition to struct bio since 2001 is the iommu front/back size members.
And I resisted those for quite a while.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-06 Thread Evgeniy Polyakov

On Sun, Aug 05, 2007 at 02:23:45PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
  If we are sleeping in memory pool, then we already do not have memory
  to complete previous requests, so we are in trouble.
 
 Not at all.  Any requests in flight are guaranteed to get the resources 
 they need to complete.  This is guaranteed by the combination of memory 
 reserve management and request queue throttling.  In logical terms, 
 reserve management plus queue throttling is necessary and sufficient to 
 prevent these deadlocks.  Conversely, the absence of either one allows 
 deadlock.

Only if you have two, which must be closely related to each other (i.e.
each request  must have network reserve big enough to store data).

  This can work 
  for devices which do not require additional allocations (like usual
  local storage), but not for network connected ones.
 
 It works for network devices too, and also for a fancy device like 
 ddsnap, which is the moral equivalent of a filesystem implemented in 
 user space.

With or without vm deadlock patches? I can not see how it can work, if
network does not have a reserve and there is not free memory completely.
If all systems have reserve then yes, it works good.

  By default things will be like they are now, except additional
  non-atomic increment and branch in generic_make_request() and
  decrement and wake in bio_end_io()?
 
 -endio is called in interrupt context, so the accounting needs to be 
 atomic as far as I can see.

Actually we only care about if there is a place in the queue or not - so
it can be a flag. Actually non-atomic operations are ok, since having
plus/minus couple of requests in flight does not change the picture, but
allows not to introduce slow atomic operations in the fast path.

 We actually account the total number of bio pages in flight, otherwise 
 you would need to assume the largest possible bio and waste a huge 
 amount of reserve memory.  A counting semaphore works fine for this 
 purpose, with some slight inefficiency that is nigh on unmeasurable in 
 the block IO path.  What the semaphore does is make the patch small and 
 easy to understand, which is important at this point.

Yes, it can be bio vectors.

  I can cook up such a patch if idea worth efforts.
 
 It is.  There are some messy details...  You need a place to store the 
 accounting variable/semaphore and need to be able to find that place 
 again in -endio.  Trickier than it sounds, because of the unstructured 
 way drivers rewrite -bi_bdev.   Peterz has already poked at this in a 
 number of different ways, typically involving backing_dev_info, which 
 seems like a good idea to me.

We can demand that reserve is not per virtual device, but per real one -
for example in case of distributed storage locally connected node should
have much higher limit than network one, but having a per-virtual device
reserve might end up with situation, when local node can proceed data,
but no requests will be queued sine all requests below limit are in
network node. In case of per real device limit there is no need to
increase bio.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-06 Thread Evgeniy Polyakov

On Sun, Aug 05, 2007 at 02:35:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote:
  On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote:
DST original code worked as device mapper plugin too, but its two
additional allocations (io and clone) per block request ended up
for me as a show stopper.
  
   Ah, sorry, I misread.  A show stopper in terms of efficiency, or in
   terms of deadlock?
 
  At least as in terms of efficiency. Device mapper lives in happy
  world where memory does not end and allocations are fast.
 
 Are you saying that things are different for a network block device 
 because it needs to do GFP_ATOMIC allocations?  If so then that is just 
 a misunderstanding.  The global page reserve Peter and I use is 
 available in interrupt context just like GFP_ATOMIC.

No, neither device needs atomic allocations, I just said that device
mapper is too expensive, since it performs alot of additional
allocations in the fast path and is not designed to cases when
allocation fails, since there is no recovery path and (maybe because of
this) mempool allocation waits forever until there is free memory
and can not fail.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote:
 On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote:
  To be sure, I am not very proud of this throttling mechanism for
  various reasons, but the thing is, _any_ throttling mechanism no
  matter how sucky solves the deadlock problem.  Over time I want to
  move the

 make_request_fn is always called in process context,

Yes, as is submit_bio which calls it.  The decision re where it is best 
to throttle, in submit_bio or in make_request_fn, has more to do with 
system factoring, that is, is throttling something that _every_ block 
device should have (yes I think) or is it a delicate, optional thing 
that needs a tweakable algorithm per block device type (no I think).

The big worry I had was that by blocking on congestion in the 
submit_bio/make_request_fn I might stuff up system-wide mm writeout.  
But a while ago that part of the mm was tweaked (by Andrew if I recall 
correctly) to use a pool of writeout threads and understand the concept 
of one of them blocking on some block device, and not submit more 
writeout to the same block device until the first thread finishes its 
submission.  Meanwhile, other mm writeout threads carry on with other 
block devices.

 we can wait in it for memory in mempool. Although that means we
 already in trouble. 

Not at all.  This whole block writeout path needs to be written to run 
efficiently even when normal system memory is completely gone.  All it 
means when we wait on a mempool is that the block device queue is as 
full as we are ever going to let it become, and that means the block 
device is working as hard as it can (subject to a small caveat: for 
some loads a device can work more efficiently if it can queue up larger 
numbers of requests down at the physical elevators).

By the way, ddsnap waits on a counting semaphore, not a mempool.  That 
is because we draw our reserve memory from the global memalloc reserve, 
not from a mempool.  And that is not only because it takes less code to 
do so, but mainly because global pools as opposed to lots of little 
special purpose pools seem like a good idea to me.  Though I will admit 
that with our current scheme we need to allow for the total of the 
maximum reserve requirements for all memalloc users in the memalloc 
pool, so it does not actually save any memory vs dedicated pools.  We 
could improve that if we wanted to, by having hard and soft reserve 
requirements: the global reserve actually only needs to be as big as 
the total of the hard requirements.  With this idea, if by some unlucky 
accident every single pool user got itself maxed out at the same time, 
we would still not exceed our share of the global reserve.  
Under normal low memory situations, a block device would typically be 
free to grab reserve memory up to its soft limit, allowing it to 
optimize over a wider range of queued transactions.   My little idea 
here is: allocating specific pages to a pool is kind of dumb, all we 
really want to do is account precisely for the number of pages we are 
allowed to draw from the global reserve.

OK, I kind of digressed, but this all counts as explaining the details 
of what Peter and I have been up to for the last year (longer for me).  
At this point, we don't need to do the reserve accounting in the most 
absolutely perfect way possible, we just need to get something minimal 
in place to fix the current deadlock problems, then we can iteratively 
improve it.

 I agree, any kind of high-boundary leveling must be implemented in
 device itself, since block layer does not know what device is at the
 end and what it will need to process given block request.

I did not say the throttling has to be implemented in the device, only 
that we did it there because it was easiest to code that up and try it 
out (it worked).  This throttling really wants to live at a higher 
level, possibly submit_bio()...bio-endio().  Someone at OLS (James 
Bottomley?) suggested it would be better done at the request queue 
layer, but I do not immediately see why that should be.  I guess this 
is going to come down to somebody throwing out a patch for interested 
folks to poke at.  But this detail is a fine point.  The big point is 
to have _some_ throttling mechanism in place on the block IO path, 
always.

Device mapper in particular does not have any throttling itself: calling 
submit_bio on a device mapper device directly calls the device mapper 
bio dispatcher.  Default initialized block device queue do provide a 
crude form of throttling based on limiting the number of requests.  
This is insufficiently precise to do a good job in the long run, but it 
works for now because the current gaggle of low level block drivers do 
not have a lot of resource requirements and tend to behave fairly 
predictably (except for some irritating issues re very slow devices 
working in parallel with very fast devices, but... worry about that 
later).  Network block drivers - for example

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote:
  On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
   * storage can be formed on top of remote nodes and be
   exported simultaneously (iSCSI is peer-to-peer only, NBD requires
   device mapper and is synchronous)
 
  In fact, NBD has nothing to do with device mapper.  I use it as a
  physical target underneath ddraid (a device mapper plugin) just
  like I would use your DST if it proves out.

 I meant to create a storage on top of several nodes one needs to have
 device mapper or something like that on top of NBD itself. To further
 export resulted device one needs another userspace NDB application
 and so on. DST simplifies that greatly.

 DST original code worked as device mapper plugin too, but its two
 additional allocations (io and clone) per block request ended up for
 me as a show stopper.

Ah, sorry, I misread.  A show stopper in terms of efficiency, or in 
terms of deadlock?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Evgeniy Polyakov

On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  DST original code worked as device mapper plugin too, but its two
  additional allocations (io and clone) per block request ended up for
  me as a show stopper.
 
 Ah, sorry, I misread.  A show stopper in terms of efficiency, or in 
 terms of deadlock?

At least as in terms of efficiency. Device mapper lives in happy world
where memory does not end and allocations are fast.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Evgeniy Polyakov

Hi Daniel.

On Sun, Aug 05, 2007 at 01:04:19AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  we can wait in it for memory in mempool. Although that means we
  already in trouble. 
 
 Not at all.  This whole block writeout path needs to be written to run 
 efficiently even when normal system memory is completely gone.  All it 
 means when we wait on a mempool is that the block device queue is as 
 full as we are ever going to let it become, and that means the block 
 device is working as hard as it can (subject to a small caveat: for 
 some loads a device can work more efficiently if it can queue up larger 
 numbers of requests down at the physical elevators).

If we are sleeping in memory pool, then we already do not have memory to
complete previous requests, so we are in trouble. This can work for
devices which do not require additional allocations (like usual local
storage), but not for network connected ones.

  I agree, any kind of high-boundary leveling must be implemented in
  device itself, since block layer does not know what device is at the
  end and what it will need to process given block request.
 
 I did not say the throttling has to be implemented in the device, only 
 that we did it there because it was easiest to code that up and try it 
 out (it worked).  This throttling really wants to live at a higher 
 level, possibly submit_bio()...bio-endio().  Someone at OLS (James 
 Bottomley?) suggested it would be better done at the request queue 
 layer, but I do not immediately see why that should be.  I guess this 
 is going to come down to somebody throwing out a patch for interested 
 folks to poke at.  But this detail is a fine point.  The big point is 
 to have _some_ throttling mechanism in place on the block IO path, 
 always.

If not in device, then at least it should say to block layer about its
limits. What about new function to register queue which will get maximum
number of bios in flight and sleep in generic_make_request() when new
bio is going to be submitted and it is about to exceed the limit?

By default things will be like they are now, except additional
non-atomic increment and branch in generic_make_request() and decrement
and wake in bio_end_io()?

I can cook up such a patch if idea worth efforts.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
 If we are sleeping in memory pool, then we already do not have memory
 to complete previous requests, so we are in trouble.

Not at all.  Any requests in flight are guaranteed to get the resources 
they need to complete.  This is guaranteed by the combination of memory 
reserve management and request queue throttling.  In logical terms, 
reserve management plus queue throttling is necessary and sufficient to 
prevent these deadlocks.  Conversely, the absence of either one allows 
deadlock.

 This can work 
 for devices which do not require additional allocations (like usual
 local storage), but not for network connected ones.

It works for network devices too, and also for a fancy device like 
ddsnap, which is the moral equivalent of a filesystem implemented in 
user space.

 If not in device, then at least it should say to block layer about
 its limits. What about new function to register queue...

Yes, a new internal API is needed eventually.  However, no new api is 
needed right at the moment because we can just hard code the reserve 
sizes and queue limits and audit them by hand, which is not any more 
sloppy than several other kernel subsystems.  The thing is, we need to 
keep any obfuscating detail out of the initial patches because these 
principles are hard enough to explain already without burying them in 
hundreds of lines of API fluff.

That said, the new improved API should probably not be a new way to 
register, but a set of function calls you can use after the queue is 
created, which follows the pattern of the existing queue API.

 ...which will get 
 maximum number of bios in flight and sleep in generic_make_request()
 when new bio is going to be submitted and it is about to exceed the
 limit?

Exactly.  This is what ddsnap currently does and it works.  But we did 
not change generic_make_request for this driver, instead we throttled 
the driver from the time it makes a request to its user space server, 
until the reply comes back.  We did it that way because it was easy and 
was the only segment of the request lifeline that could not be fixed by 
other means.  A proper solution for all block devices will move the 
throttling up into generic_make_request, as you say below.

 By default things will be like they are now, except additional
 non-atomic increment and branch in generic_make_request() and
 decrement and wake in bio_end_io()?

-endio is called in interrupt context, so the accounting needs to be 
atomic as far as I can see.

We actually account the total number of bio pages in flight, otherwise 
you would need to assume the largest possible bio and waste a huge 
amount of reserve memory.  A counting semaphore works fine for this 
purpose, with some slight inefficiency that is nigh on unmeasurable in 
the block IO path.  What the semaphore does is make the patch small and 
easy to understand, which is important at this point.

 I can cook up such a patch if idea worth efforts.

It is.  There are some messy details...  You need a place to store the 
accounting variable/semaphore and need to be able to find that place 
again in -endio.  Trickier than it sounds, because of the unstructured 
way drivers rewrite -bi_bdev.   Peterz has already poked at this in a 
number of different ways, typically involving backing_dev_info, which 
seems like a good idea to me.

A simple way to solve the stable accounting field issue is to add a new 
pointer to struct bio that is owned by the top level submitter 
(normally generic_make_request but not always) and is not affected by 
any recursive resubmission.  Then getting rid of that field later 
becomes somebody's summer project, which is not all that urgent because 
struct bio is already bloated up with a bunch of dubious fields and is 
a transient structure anyway.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote:
 On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote:
   DST original code worked as device mapper plugin too, but its two
   additional allocations (io and clone) per block request ended up
   for me as a show stopper.
 
  Ah, sorry, I misread.  A show stopper in terms of efficiency, or in
  terms of deadlock?

 At least as in terms of efficiency. Device mapper lives in happy
 world where memory does not end and allocations are fast.

Are you saying that things are different for a network block device 
because it needs to do GFP_ATOMIC allocations?  If so then that is just 
a misunderstanding.  The global page reserve Peter and I use is 
available in interrupt context just like GFP_ATOMIC.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-04 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 06:19:16PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 It depends on the characteristics of the physical and virtual block 
 devices involved.  Slow block devices can produce surprising effects.  
 Ddsnap still qualifies as slow under certain circumstances (big 
 linear write immediately following a new snapshot). Before we added 
 throttling we would see as many as 800,000 bios in flight.  Nice to 

Mmm, sounds tasty to work with such a system :)

 know the system can actually survive this... mostly.  But memory 
 deadlock is a clear and present danger under those conditions and we 
 did hit it (not to mention that read latency sucked beyond belief). 
 
 Anyway, we added a simple counting semaphore to throttle the bio traffic 
 to a reasonable number and behavior became much nicer, but most 
 importantly, this satisfies one of the primary requirements for 
 avoiding block device memory deadlock: a strictly bounded amount of bio 
 traffic in flight.  In fact, we allow some bounded number of 
 non-memalloc bios *plus* however much traffic the mm wants to throw at 
 us in memalloc mode, on the assumption that the mm knows what it is 
 doing and imposes its own bound of in flight bios per device.   This 
 needs auditing obviously, but the mm either does that or is buggy.  In 
 practice, with this throttling in place we never saw more than 2,000 in 
 flight no matter how hard we hit it, which is about the number we were 
 aiming at.  Since we draw our reserve from the main memalloc pool, we 
 can easily handle 2,000 bios in flight, even under extreme conditions.
 
 See:
 http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
 down(info-throttle_sem);
 
 To be sure, I am not very proud of this throttling mechanism for various 
 reasons, but the thing is, _any_ throttling mechanism no matter how 
 sucky solves the deadlock problem.  Over time I want to move the 

make_request_fn is always called in process context, we can wait in it
for memory in mempool. Although that means we already in trouble.

I agree, any kind of high-boundary leveling must be implemented in
device itself, since block layer does not know what device is at the end
and what it will need to process given block request.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-04 Thread Evgeniy Polyakov

Hi Daniel.
 On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
  * storage can be formed on top of remote nodes and be exported
  simultaneously (iSCSI is peer-to-peer only, NBD requires device
  mapper and is synchronous)
 
 In fact, NBD has nothing to do with device mapper.  I use it as a 
 physical target underneath ddraid (a device mapper plugin) just like I 
 would use your DST if it proves out.

I meant to create a storage on top of several nodes one needs to have
device mapper or something like that on top of NBD itself. To further
export resulted device one needs another userspace NDB application and 
so on. DST simplifies that greatly.

DST original code worked as device mapper plugin too, but its two 
additional allocations (io and clone) per block request ended up for me 
as a show stopper.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-04 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) 
wrote:
 On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 
  TODO list currently includes following main items:
  * redundancy algorithm (drop me a request of your own, but it is highly
  unlikley that Reed-Solomon based will ever be used - it is too slow
  for distributed RAID, I consider WEAVER codes)
 
 
 LDPC codes[1][2] have been replacing Turbo code[3] with regards to
 communication links and we have been seeing that transition. (maybe
 helpful, came to mind seeing the mention of Turbo code) Don't know how
 weaver compares to LDPC, though found some comparisons [4][5] But
 looking at fault tolerance figures, i guess Weaver is much better.
 
 [1] http://www.ldpc-codes.com/
 [2] http://portal.acm.org/citation.cfm?id=1240497
 [3] http://en.wikipedia.org/wiki/Turbo_code
 [4] 
 http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf
 [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf

LDPC codes require to solve N order matrix over finite field - exactly
the reason I do not want to use Reed-Solomon codes even with optimized
non-Vandermonde matrix. I will investigate LDPC further though.
Turbo codes are like flow cipher compared to RS codes being block
ciphers. Transport media is reliable in data storages, otherwise they
would not even exist.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 02:26:29PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
  Memory deadlock is a concern of course.  From a cursory glance through, 
  it looks like this code is pretty vm-friendly and you have thought 
  quite a lot about it, however I respectfully invite peterz 
  (obsessive/compulsive memory deadlock hunter) to help give it a good 
  going over with me.

Another major issue is network allocations.

Your initial work and subsequent releases made by Peter were originally
opposed on my side, but now I think the right way is to use both
positive moments from your approach and specialized allocator -
essentially what I proposed (in the blog only though) is to bind a
independent reserve for any socket - such a reserve can be stolen from
socket buffer itself (each socket has a limited socket buffer where
packets are allocated from, it accounts both data and control (skb)
lengths), so when main allocation via common path fails, it would be
possible to get data from own reserve. This allows sending sockets to
make a progress in case of deadlock.

For receiving situation is worse, since system does not know in advance
to which socket given packet will belong to, so it must allocate from
global pool (and thus there must be independent global reserve), and
then exchange part of the socket's reserve to the global one (or just
copy packet to the new one, allocated from socket's reseve is it was
setup, or drop it otherwise). Global independent reserve is what I
proposed when stopped to advertise network allocator, but it seems that
it was not taken into account, and reserve was always allocated only
when system has serious memory pressure in Peter's patches without any
meaning for per-socket reservation.

It allows to separate sockets and effectively make them fair - system
administrator or programmer can limit socket's buffer a bit and request
a reserve for special communication channels, which will have guaranteed
ability to have both sending and receiving progress, no matter how many
of them were setup. And it does not require any changes behind network
side.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Peter Zijlstra

On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:

 For receiving situation is worse, since system does not know in advance
 to which socket given packet will belong to, so it must allocate from
 global pool (and thus there must be independent global reserve), and
 then exchange part of the socket's reserve to the global one (or just
 copy packet to the new one, allocated from socket's reseve is it was
 setup, or drop it otherwise). Global independent reserve is what I
 proposed when stopped to advertise network allocator, but it seems that
 it was not taken into account, and reserve was always allocated only
 when system has serious memory pressure in Peter's patches without any
 meaning for per-socket reservation.

This is not true. I have a global reserve which is set-up a priori. You
cannot allocate a reserve when under pressure, that does not make sense.

Let me explain my approach once again.

At swapon(8) time we allocate a global reserve. And associate the needed
sockets with it. The size of this global reserve is make up of two
parts:
  - TX
  - RX

The RX pool is the most interresting part. It again is made up of two
parts:
  - skb
  - auxilary data

The skb part is scaled such that it can overflow the IP fragment
reassembly, the aux pool such that it can overflow the route cache (that
was the largest other allocator in the RX path)

All (reserve) RX skb allocations are accounted, so as to never allocate
more than we reserved.

All packets are received (given the limit) and are processed up to
socket demux. At that point all packets not targeted at an associated
socket are dropped and the skb memory freed - ready for another packet.

All packets targeted for associated sockets get processed. This requires
that this packet processing happens in-kernel. Since we are swapping
user-space might be waiting for this data, and we'd deadlock.


I'm not quite sure why you need per socket reservations.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Evgeniy Polyakov

On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:
 On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:
 
  For receiving situation is worse, since system does not know in advance
  to which socket given packet will belong to, so it must allocate from
  global pool (and thus there must be independent global reserve), and
  then exchange part of the socket's reserve to the global one (or just
  copy packet to the new one, allocated from socket's reseve is it was
  setup, or drop it otherwise). Global independent reserve is what I
  proposed when stopped to advertise network allocator, but it seems that
  it was not taken into account, and reserve was always allocated only
  when system has serious memory pressure in Peter's patches without any
  meaning for per-socket reservation.
 
 This is not true. I have a global reserve which is set-up a priori. You
 cannot allocate a reserve when under pressure, that does not make sense.

I probably did not cut enough details - my main position is to allocate
per socket reserve from socket's queue, and copy data there from main
reserve, all of which are allocated either in advance (global one) or
per sockoption, so that there would be no fairness issues what to mark 
as special and what to not.

Say we have a page per socket, each socket can assign a reserve for
itself from own memory, this accounts both tx and rx side. Tx is not
interesting, it is simple, rx has global reserve (always allocated on 
startup or sometime way before reclaim/oom)where data is originally 
received (including skb, shared info and whatever is needed, page is 
just an exmaple), then it is copied into per-socket reserve and reused 
for the next packet. Having per-socket reserve allows to have progress 
in any situation not only in cases where single action must be 
received/processed, and allows to be completely fair for all users, but
not only special sockets, thus admin for example would be allowed to
login, ipsec would work and so on...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Evgeniy Polyakov

Hi Mike.

On Fri, Aug 03, 2007 at 12:09:02AM -0400, Mike Snitzer ([EMAIL PROTECTED]) 
wrote:
  * storage can be formed on top of remote nodes and be exported
  simultaneously (iSCSI is peer-to-peer only, NBD requires device
  mapper and is synchronous)
 
 Having the in-kernel export is a great improvement over NBD's
 userspace nbd-server (extra copy, etc).
 
 But NBD's synchronous nature is actually an asset when coupled with MD
 raid1 as it provides guarantees that the data has _really_ been
 mirrored remotely.

I believe, that the right answer to this is barrier, but not synchronous
sending/receiving, which might slow things down noticebly. Barrier must
wait until remote side received data and send back a notice. Until
acknowledge is received, no one can say if data mirrored or ever
received by remote node or not.

  TODO list currently includes following main items:
  * redundancy algorithm (drop me a request of your own, but it is highly
  unlikley that Reed-Solomon based will ever be used - it is too slow
  for distributed RAID, I consider WEAVER codes)
 
 I'd like to better understand where you see DST heading in the area of
 redundancy.Based on your blog entries:
 http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_24_1.html
 http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_31_2.html
 (and your todo above) implementing a mirroring algorithm appears to be
 a near-term goal for you.  Can you comment on how your intended
 implementation would compare, in terms of correctness and efficiency,
 to say MD (raid1) + NBD?  MD raid1 has a write intent bitmap that is
 useful to speed resyncs; what if any mechanisms do you see DST
 embracing to provide similar and/or better reconstruction
 infrastructure?  Do you intend to embrace any exisiting MD or DM
 infrastructure?

Depending on what algorithm will be preferred - I do not want mirroring,
it is _too_ wasteful in terms of used storage, but it is the simplest.
Right now I still consider WEAVER codes as the fastest in distributed
envornment from what I checked before, but it is quite complex and spec
is (at least for me) not clear in all aspects right now. I did not even
start userspace implementation of that codes. (Hint: spec sucks, kidding :)

For simple mirroring each node must be split to chunks, each one has
representation bin in main node mask, when dirty full chunk is resynced.
Depending on node size and amount of memory chunk size varies. Setup is
performed during node initialization. Having checksum for each chunk
is a good step.

All interfaces are already there, although require cleanup and move from
place to place, but I decided to make initial release small.

 BTW, you have definitely published some very compelling work and its
 sad that you're predisposed to think DST won't be recieved well if you
 pushed for inclusion (for others, as much was said in the 7.31.2007
 blog post I referenced above).  Clearly others need to embrace DST to
 help inclusion become a reality.  To that end, its great to see that
 Daniel Phillips and the other zumastor folks will be putting DST
 through its paces.

In that blog entry I misspelled Zen with Xen - that's an error,
according to prognosis - time will judge :)

 regards,
 Mike
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Evgeniy Polyakov

Hi.

On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) 
wrote:
 On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
 
  TODO list currently includes following main items:
  * redundancy algorithm (drop me a request of your own, but it is highly
  unlikley that Reed-Solomon based will ever be used - it is too slow
  for distributed RAID, I consider WEAVER codes)
 
 
 LDPC codes[1][2] have been replacing Turbo code[3] with regards to
 communication links and we have been seeing that transition. (maybe
 helpful, came to mind seeing the mention of Turbo code) Don't know how
 weaver compares to LDPC, though found some comparisons [4][5] But
 looking at fault tolerance figures, i guess Weaver is much better.
 
 [1] http://www.ldpc-codes.com/
 [2] http://portal.acm.org/citation.cfm?id=1240497
 [3] http://en.wikipedia.org/wiki/Turbo_code
 [4] 
 http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf
 [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf

Great thanks for this links, I will definitely study them.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Evgeniy Polyakov

On Thu, Aug 02, 2007 at 02:08:24PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
  Hi.
 
  I'm pleased to announce first release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
 Excellent!  This is precisely what the doctor ordered for the 
 OCFS2-based distributed storage system I have been mumbling about for 
 some time.  In fact the dd in ddsnap and ddraid stands for distributed 
 data.  The ddsnap/raid devices do not include an actual network 
 transport, that is expected to be provided by a specialized block 
 device, which up till now has been NBD.  But NBD has various 
 deficiencies as you note, in addition to its tendency to deadlock when 
 accessed locally.  Your new code base may be just the thing we always 
 wanted.  We (zumastor et al) will take it for a drive and see if 
 anything breaks.

That would be great.

 Memory deadlock is a concern of course.  From a cursory glance through, 
 it looks like this code is pretty vm-friendly and you have thought 
 quite a lot about it, however I respectfully invite peterz 
 (obsessive/compulsive memory deadlock hunter) to help give it a good 
 going over with me.
 
 I see bits that worry me, e.g.:
 
 + req = mempool_alloc(st-w-req_pool, GFP_NOIO);
 
 which seems to be callable in response to a local request, just the case 
 where NBD deadlocks.  Your mempool strategy can work reliably only if 
 you can prove that the pool allocations of the maximum number of 
 requests you can have in flight do not exceed the size of the pool.  In 
 other words, if you ever take the pool's fallback path to normal 
 allocation, you risk deadlock.

mempool should be allocated to be able to catch up with maximum
in-flight requests, in my tests I was unable to force block layer to put
more than 31 pages in sync, but in one bio. Each request is essentially
dealyed bio processing, so this must handle maximum number of in-flight
bios (if they do not cover multiple nodes, if they do, then each node
requires own request). Sync has one bio in-flight on my machines (from
tiny VIA nodes to low-end amd64), number of normal requests *usually*
does not increase several dozens (less than hundred always), but that
might be only my small systems, so request size was selected as small as
possible and number of allocations decreased to absolutely healthcare 
minimum.

 Anyway, if this is as grand as it seems then I would think we ought to 
 factor out a common transfer core that can be used by all of NBD, 
 iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own 
 code those things have now.
 
 Regards,
 
 Daniel

Thanks.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Peter Zijlstra

On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
 On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
 wrote:
  On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:
  
   For receiving situation is worse, since system does not know in advance
   to which socket given packet will belong to, so it must allocate from
   global pool (and thus there must be independent global reserve), and
   then exchange part of the socket's reserve to the global one (or just
   copy packet to the new one, allocated from socket's reseve is it was
   setup, or drop it otherwise). Global independent reserve is what I
   proposed when stopped to advertise network allocator, but it seems that
   it was not taken into account, and reserve was always allocated only
   when system has serious memory pressure in Peter's patches without any
   meaning for per-socket reservation.
  
  This is not true. I have a global reserve which is set-up a priori. You
  cannot allocate a reserve when under pressure, that does not make sense.
 
 I probably did not cut enough details - my main position is to allocate
 per socket reserve from socket's queue, and copy data there from main
 reserve, all of which are allocated either in advance (global one) or
 per sockoption, so that there would be no fairness issues what to mark 
 as special and what to not.
 
 Say we have a page per socket, each socket can assign a reserve for
 itself from own memory, this accounts both tx and rx side. Tx is not
 interesting, it is simple, rx has global reserve (always allocated on 
 startup or sometime way before reclaim/oom)where data is originally 
 received (including skb, shared info and whatever is needed, page is 
 just an exmaple), then it is copied into per-socket reserve and reused 
 for the next packet. Having per-socket reserve allows to have progress 
 in any situation not only in cases where single action must be 
 received/processed, and allows to be completely fair for all users, but
 not only special sockets, thus admin for example would be allowed to
 login, ipsec would work and so on...


Ah, I think I understand now. Yes this is indeed a good idea!

It would be quite doable to implement this on top of that I already
have. We would need to extend the socket with a sock_opt that would
reserve a specified amount of data for that specific socket. And then on
socket demux check if the socket has a non zero reserve and has not yet
exceeded said reserve. If so, process the packet.

This would also quite neatly work for -rt where we would not want
incomming packet processing to be delayed by memory allocations.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote:
 ...rx has global reserve (always allocated on
 startup or sometime way before reclaim/oom)where data is originally
 received (including skb, shared info and whatever is needed, page is
 just an exmaple), then it is copied into per-socket reserve and
 reused for the next packet. Having per-socket reserve allows to have
 progress in any situation not only in cases where single action must
 be received/processed, and allows to be completely fair for all
 users, but not only special sockets, thus admin for example would be
 allowed to login, ipsec would work and so on...

And when the global reserve is entirely used up your system goes back to 
dropping vm writeout acknowledgements, not so good.  I like your 
approach, and specifically the copying idea cuts out considerable 
complexity.  But I believe the per-socket flag to mark a socket as part 
of the vm writeout path is not optional, and in this case it will be a 
better world if it is a slightly unfair world in favor of vm writeout 
traffic.

Ssh will still work fine even with vm getting priority access to the 
pool.  During memory crunches, non-vm ssh traffic may get bumped till 
after the crunch, but vm writeout is never supposed to hog the whole 
machine.  If vm writeout hogs your machine long enough to delay an ssh 
login then that is a vm bug and should be fixed at that level.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 07:53, Peter Zijlstra wrote:
 On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
  On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote:
  ...my main position is to
  allocate per socket reserve from socket's queue, and copy data
  there from main reserve, all of which are allocated either in
  advance (global one) or per sockoption, so that there would be no
  fairness issues what to mark as special and what to not.
 
  Say we have a page per socket, each socket can assign a reserve for
  itself from own memory, this accounts both tx and rx side. Tx is
  not interesting, it is simple, rx has global reserve (always
  allocated on startup or sometime way before reclaim/oom)where data
  is originally received (including skb, shared info and whatever is
  needed, page is just an exmaple), then it is copied into per-socket
  reserve and reused for the next packet. Having per-socket reserve
  allows to have progress in any situation not only in cases where
  single action must be received/processed, and allows to be
  completely fair for all users, but not only special sockets, thus
  admin for example would be allowed to login, ipsec would work and
  so on...

 Ah, I think I understand now. Yes this is indeed a good idea!

 It would be quite doable to implement this on top of that I already
 have. We would need to extend the socket with a sock_opt that would
 reserve a specified amount of data for that specific socket. And then
 on socket demux check if the socket has a non zero reserve and has
 not yet exceeded said reserve. If so, process the packet.

 This would also quite neatly work for -rt where we would not want
 incomming packet processing to be delayed by memory allocations.

At this point we need anything that works in mainline as a starting 
point.  By erring on the side of simplicity we can make this 
understandable for folks who haven't spent the last two years wallowing 
in it.  The page per socket approach is about as simple as it gets.  I 
therefore propose we save our premature optimizations for later.

It will also help our cause if we keep any new internal APIs to strictly 
what is needed to make deadlock go away.  Not a whole lot more than 
just the flag to mark a socket as part of the vm writeout path when you 
get right down to essentials.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 108 matches

Mail list logo