Re: Distributed storage. Move away from char device ioctls.
Returning back to this, since block based storage, which can act as a shared storage/transport layer, is ready with 5'th release of the DST. My couple of notes on proposed data distribution algorithm in FS. On Sun, Sep 16, 2007 at 03:07:11AM -0400, Kyle Moffett ([EMAIL PROTECTED]) wrote: I actually think there is a place for this - and improvements are definitely welcome. Even Lustre needs block-device level redundancy currently, though we will be working to make Lustre- level redundancy available in the future (the problem is WAY harder than it seems at first glance, if you allow writeback caches at the clients and servers). I really think that to get proper non-block-device-level filesystem redundancy you need to base it on something similar to the GIT model. Data replication is done in specific-sized chunks indexed by SHA-1 sum and you actually have a sort of merge algorithm for when local and remote changes differ. The OS would only implement a very limited list of merge algorithms, IE one of: (A) Don't merge, each client gets its own branch and merges are manual (B) Most recent changed version is made the master every X-seconds/ open/close/write/other-event. (C) The tree at X (usually a particular client/server) is always used as the master when there are conflicts. This looks like a good way to work with offline clients (where I recall Coda), after offline node modified data, it should be merged back to the cluster with different algorithms. Data (supposed to be) written to the failed node during its offline time will be resynced from other nodes when failed one is online, there are no problems and/or special algorithms to be used here. Filesystem replication is not a 100% 'git way' - git tree contains already combined objects - i.e. the last blob for given path does not contain its history, only ready to be used data, while filesystem, especially that one which requires simultaneous write from different threads/nodes, should implement copy-on-write semantics, essentially putting all new data (git commit) to the new location and then collect it from different extents to present a ready file. At least that is how I see the filesystem I'm working on. ... There's a lot of other technical details which would need resolution in an actual implementation, but this is enough of a summary to give you the gist of the concept. Most likely there will be some major flaw which makes it impossible to produce reliably, but the concept contains the things I would be interested in for a real networked filesystem. Git semantics and copy-on-write has actually quite a lot in common (on high enough level of abstraction), but SHA1 index is not a possible issue in filesystem - even besides amount of data to be hashed before key is ready. Key should also contain enough information about what underlying data is - git does not store that information (tree, blob or whatever) in its keys, since it does not require it. At least that is how I see it to be implemented. Overall I see this new project as a true copy-on-write FS. Thanks. Cheers, Kyle Moffett -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
Hi! I'm pleased to announce third release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. How is this different from raid0/1 over nbd? Or raid0/1 over ata-over-ethernet? +| DST storate ---| storage? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
Hi Pavel. On Mon, Sep 17, 2007 at 06:22:30PM +, Pavel Machek ([EMAIL PROTECTED]) wrote: I'm pleased to announce third release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. How is this different from raid0/1 over nbd? Or raid0/1 over ata-over-ethernet? I will repeate a quote I made for previous release: It has number of advantages, outlined in the first release and on the project homepage, namely: * non-blocking processing without busy loops (compared to iSCSI and NBD) * small, plugable architecture * failover recovery (reconnect to remote target) * autoconfiguration * no additional allocatins (not including network part) - at least two in device mapper for fast path * very simple - try to compare with iSCSI * works with different network protocols * storage can be formed on top of remote nodes and be exported simultaneously (iSCSI is peer-to-peer only, NBD requires device mapper, is synchronous and wants special userspace thread) DST allows to remove any nodes and then turn it back into the storage without breaking the dataflow, dst core will reconnect automatically to the failed remote nodes, it allows to work with detouched devices just like with usual filesystems (in case it was not formed as a part of linear storage, since in that case meta information is spreaded between nodes). It does not require special processes on behalf of network connection, everything will be performed automatically on behalf of DST core workers, it allows to export new device, created on top of mirror or linear combination of the others, which in turn can be formed on top of another and so on... This was designed to allow to create a distributed storage with completely transparent failover recovery, with ability to detouch remote nodes from mirror array to became standalone realtime backups (or snapshots) and turn it back into the storage without stopping main device node. +| DST storate ---| storage? Yep, thanks. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote: On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote: Yes, block device itself is not able to scale well, but it is the place for redundancy, since filesystem will just fail if underlying device does not work correctly and FS actually does not know about where it should place redundancy bits - it might happen to be the same broken disk, so I created a low-level device which distribute requests itself. I actually think there is a place for this - and improvements are definitely welcome. Even Lustre needs block-device level redundancy currently, though we will be working to make Lustre- level redundancy available in the future (the problem is WAY harder than it seems at first glance, if you allow writeback caches at the clients and servers). I really think that to get proper non-block-device-level filesystem redundancy you need to base it on something similar to the GIT model. Data replication is done in specific-sized chunks indexed by SHA-1 sum and you actually have a sort of merge algorithm for when local and remote changes differ. The OS would only implement a very limited list of merge algorithms, IE one of: (A) Don't merge, each client gets its own branch and merges are manual (B) Most recent changed version is made the master every X-seconds/ open/close/write/other-event. (C) The tree at X (usually a particular client/server) is always used as the master when there are conflicts. This lets you implement whatever replication policy you want: You can require that some files are replicated (cached) on *EVERY* system, you can require that other files are cached on at least X systems. You can say this needs to be replicated on at least X% of the online systems, or at most Y. Moreover, the replication could be done pretty easily from userspace via a couple syscalls. You also automatically keep track of history with some default purge policy. The main point is that for efficiency and speed things are *not* always replicated; this also allows for offline operation. You would of course have userspace merge drivers which notice that the tree on your laptop is not a subset/superset of the tree on your desktop and do various merges based on per-file metadata. My address-book, for example, would have a custom little merge program which knows about how to merge changes between two address book files, asking me useful questions along the way. Since a lot of this merging is mechanical, some of the code from GIT could easily be made into a merge library which knows how to do such things. Moreover, this would allow me to have a shared root filesystem on my laptop and desktop. It would have 'sub-project'-type trees, so that / would be an independent branch on each system. /etc would be separate branches but manually merged git-style as I make changes. /home/* folders would be auto-created as separate subtrees so each user can version their own individually. Specific subfolders (like address-book, email, etc) would be adjusted by the GUI programs that manage them to be separate subtrees with manual- merging controlled by that GUI program. Backups/dumps/archival of such a system would be easy. You would just need to clone the significant commits/trees/etc to a DVD and replace the old SHA-1-indexed objects to tiny object-deleted stubs; to rollback to an archived version you insert the DVD, mount it into the existing kernel SHA-1 index, and then mount the appropriate commit as a read-only volume somewhere to access. The same procedure would also work for wide-area-network backups and such. The effective result would be the ability to do things like the following: (A) Have my homedir synced between both systems mostly- automatically as I make changes to different files on both systems (B) Easily have 2 copies of all my files, so if one system's disk goes kaput I can just re-clone from the other. (C) Keep archived copies of the last 5 years worth of work, including change history, on a stack of DVDs. (D) Synchronize work between locations over a relatively slow link without much work. As long as files were indirectly indexed by sub-block SHA1 (with the index depth based on the size of the file), and each individually- SHA1-ed object could have references, you could trivially have a 4TB- sized file where you modify 4 bytes at a thousand random locations throughout the file and only have to update about 5MB worth of on- disk data. The actual overhead for that kind of operation under any existing filesystem would be 100% seek-dominated regardless whereas with this mechanism you would not directly be overwriting data and so you could append all the updates as a single 5MB chunk. Data reads would be much more seek-y, but you could trivially have an on-line defragmenter tool which notices fragmented
Re: Distributed storage. Move away from char device ioctls.
On Sat, Sep 15, 2007 at 11:24:46AM -0600, Andreas Dilger ([EMAIL PROTECTED]) wrote: When Chris Mason announced btrfs, I found that quite a few new ideas are already implemented there, so I postponed project (although direction of the developement of the btrfs seems to move to the zfs side with some questionable imho points, so I think I can jump to the wagon of new filesystems right now). This is an area I'm always a bit sad about in OSS development - the need everyone has to make a new {fs, editor, gui, etc} themselves instead of spending more time improving the work we already have. Imagine where the If that would be true, we would be still in the stone age. Or not, actually I think the first cell in the universe would not bother itself dividing into the two just because it could spent infinite time trying to make itself better. internet would be (or not) if there were 50 different network protocols instead of TCP/IP? If you don't like some things about btrfs, maybe you can fix them? When some idea is implemented it is virtually impossible to change it, only recreate new one with fixed issues. So, we have multiple ext, reiser and many others. I do not say btrfs is broken or has design problems, it is really interesting filesystem, but all we have our own opinions about how things should be done, that's it. Btw, we do have so many network protocols for different purposes, that number of (storage) filesystems is negligebly small compared to it. Internet as is popular today is just a subset of where network is used. And we do invent new protocols each time we need something new, which does not fit into existing models (for example TCP by design can not work with very long-distance links with tooo long RTT). We have sctp to fix some tcp issues. Number of IP layer 'neighbours' is even more. Physical media layer has many different protocols too. And that is just what exists in the linux tree... To be honest, developing a new filesystem that is actually widely useful and used is a very time consuming task (see Reiserfs and Reiser4). It takes many years before the code is reliable enough for people to trust it, so most likely any effort you put into this would be wasted unless you can come up with something that is dramatically better than something existing. Yep, I know. Wasting my time is one of the most pleasant things I ever tried in my life. The part that bothers me is that this same effort could have been used to improve something that more people would use (btrfs in this case). Of course, sometimes the new code is substantially better than what currently exists, and I think btrfs may have laid claim to the current generation of filesystems. Call me greedy bastard, but I do not care about world happiness, it is just impossible to achieve. So I like what I do right now. If it will be rest under the layer of dust I do not care, I like the process of creating, so if it will fail, I just will get new knowledge. :) Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Hi Jeff. On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik ([EMAIL PROTECTED]) wrote: Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... :) I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. Yes, block device itself is not able to scale well, but it is the place for redundancy, since filesystem will just fail if underlying device does not work correctly and FS actually does not know about where it should place redundancy bits - it might happen to be the same broken disk, so I created a low-level device which distribute requests itself. It is not allowed to mount it via multiple points, that is where distributed filesystem must enter the show - multiple remote nodes export its devices via network, each client gets address of the remote node to work with, connect to it and process requests. All those bits are already in the DST, next logical step is to connect it with higher-layer filesystem. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. Well, originally (about half a year ago) I started to draft a generic filesystem which would be just superior to existing designs, not overbloated like zfs, and just faster. I do believe it can be implemented. Further I added network capabilities (since what I saw that time (AFS was proposed) I did not like - I'm not saying it is bad or something like that at all, but I would implement things differently) into design drafts. When Chris Mason announced btrfs, I found that quite a few new ideas are already implemented there, so I postponed project (although direction of the developement of the btrfs seems to move to the zfs side with some questionable imho points, so I think I can jump to the wagon of new filesystems right now). DST is low level for my (theoretical so far) filesystem (actually its network part) like kevent was a low level system for network AIO (originally). No matter what filesystem works with network it implements some kind of logic completed in DST. Sometimes it is very simple, sometimes a bit more complex, but eventually it is a network entity with parts of stuff I put into DST. Since I postponed the project (looking at btrfs and its results), I completed DST as a standalone block device. So, essentially, a filesystem with simple distributed facilities is on (my) radar, but so far you are first who requested it :) -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Hi Mike. On Fri, Sep 14, 2007 at 10:54:56PM -0400, Mike Snitzer ([EMAIL PROTECTED]) wrote: This distributed storage is very much needed; even if it were to act as a more capable/performant replacement for NBD (or MD+NBD) in the near term. Many high availability applications don't _need_ all the additional complexity of a full distributed filesystem. So given that, its discouraging to see you trying to gently push Evgeniy away from all the promising work he has published. Evgeniy, please continue your current work. Thanks Mike, I work on this and will until feel it is completed. Distributed filesystem is a logical continuation of the whoe idea of storing data on the several remote nodes - DST and FS must exist together for the maximum performance, but that does not mean that block layer itself should be abandoned. As you probably noticed from my mail to Jeff, distributed storage was originally part of the overall filesystem design. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. it's called Lustre. works well, scales well, is widely used, is GPL. sadly it's not in mainline. cheers, robin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Robin Humble wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. it's called Lustre. works well, scales well, is widely used, is GPL. sadly it's not in mainline. Lustre is tilted far too much towards high-priced storage, and needs improvement before it could be considered for mainline. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote: Robin Humble wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. it's called Lustre. works well, scales well, is widely used, is GPL. sadly it's not in mainline. Lustre is tilted far too much towards high-priced storage, many (most?) Lustre deployments are with SATA and md raid5 and GigE - can't get much cheaper than that. if you want storage node failover capabilities (which larger sites often do) or want to saturate an IB link then the price of the storage goes up but this is a consequence of wanting more reliability or performance, not anything to do with lustre. interestingly, one of the ways to provide dual-attached storage behind a failover pair of lustre servers (apart from buying SAS) would be via a networked-raid-1 device like Evgeniy's, so I don't see distributed block devices and distributed filesystems as being mutually exclusive. iSER (almost in http://stgt.berlios.de/) is also intriguing. and needs improvement before it could be considered for mainline. quite likely. from what I understand (hopefully I am mistaken) they consider a merge task to be too daunting as the number of kernel subsystems that any scalable distributed filesystem touches is necessarily large. roadmaps indicate that parts of lustre are likely to move to userspace (partly to ease solaris and ZFS ports) so perhaps those performance critical parts that remain kernel space will be easier to merge. cheers, robin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote: Yes, block device itself is not able to scale well, but it is the place for redundancy, since filesystem will just fail if underlying device does not work correctly and FS actually does not know about where it should place redundancy bits - it might happen to be the same broken disk, so I created a low-level device which distribute requests itself. I actually think there is a place for this - and improvements are definitely welcome. Even Lustre needs block-device level redundancy currently, though we will be working to make Lustre-level redundancy available in the future (the problem is WAY harder than it seems at first glance, if you allow writeback caches at the clients and servers). When Chris Mason announced btrfs, I found that quite a few new ideas are already implemented there, so I postponed project (although direction of the developement of the btrfs seems to move to the zfs side with some questionable imho points, so I think I can jump to the wagon of new filesystems right now). This is an area I'm always a bit sad about in OSS development - the need everyone has to make a new {fs, editor, gui, etc} themselves instead of spending more time improving the work we already have. Imagine where the internet would be (or not) if there were 50 different network protocols instead of TCP/IP? If you don't like some things about btrfs, maybe you can fix them? To be honest, developing a new filesystem that is actually widely useful and used is a very time consuming task (see Reiserfs and Reiser4). It takes many years before the code is reliable enough for people to trust it, so most likely any effort you put into this would be wasted unless you can come up with something that is dramatically better than something existing. The part that bothers me is that this same effort could have been used to improve something that more people would use (btrfs in this case). Of course, sometimes the new code is substantially better than what currently exists, and I think btrfs may have laid claim to the current generation of filesystems. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sep 15, 2007 12:20 -0400, Robin Humble wrote: On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote: Lustre is tilted far too much towards high-priced storage, many (most?) Lustre deployments are with SATA and md raid5 and GigE - can't get much cheaper than that. I have to agree - while Lustre CAN scale up to huge servers and fat pipes, it can definitely also scale down (which is a LOT easier to do :-). I can run a client + MDS + 5 OSTs in a single UML instance using loop devices for testing w/o problems. interestingly, one of the ways to provide dual-attached storage behind a failover pair of lustre servers (apart from buying SAS) would be via a networked-raid-1 device like Evgeniy's, so I don't see distributed block devices and distributed filesystems as being mutually exclusive. That is definitely true, and there are a number of users who run in this mode. We're also working to make Lustre handle the replication internally (RAID5/6+ at the OST level) so you wouldn't need any kind of block-level redundancy at all. I suspect some sites may still use RAID5/6 back-ends anyways to avoid performance loss from taking out a whole OST due to a single disk failure, but that would definitely not be required. and needs improvement before it could be considered for mainline. It's definitely true, and we are always working at improving it. It used to be in the past that one of the reasons we DIDN'T want to go into mainline was because this would restrict our ability to make network protocol changes. Because our install base is large enough and many of the large sites with mutliple supercomputers mounting multiple global filesystems we aren't at liberty to change the network protocol at will anymore. That said, we also have network protocol versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so we are able to add/change features without breaking old clients from what I understand (hopefully I am mistaken) they consider a merge task to be too daunting as the number of kernel subsystems that any scalable distributed filesystem touches is necessarily large. That's partly true - Lustre has its own RDMA RPC mechanism, but it does not need kernel patches anymore (we removed the zero-copy callback and do this at the protocol level because there was too much resistance to it). We are now also able to run a client filesystem that doesn't require any kernel patches, since we've given up on trying to get the intents and raw operations into the VFS, and have worked out other ways to improve the performance to compensate. Likewise with parallel directory operations. It's a bit sad, in a way, because these are features that other filesystems (especially network fs) could have benefitted from also. roadmaps indicate that parts of lustre are likely to move to userspace (partly to ease solaris and ZFS ports) so perhaps those performance critical parts that remain kernel space will be easier to merge. This is also true - when that is done the only parts that will remain in the kernel are the network drivers. With some network stacks there is even direct userspace acceleration. We'll use RDMA and direct IO to avoid doing any user-kernel data copies. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Jeff Garzik wrote: Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching that goal. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? --b. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? --b. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. Far less complicated than NFSv4.1 though (which is easy :)) NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 anyway), there is no guarantee at all that Linux will be able to store and retrieve data, since it's entirely possible that a proprietary protocol is required to access your data. NFSv4.1 is no longer a completely open architecture. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. Far less complicated than NFSv4.1 though (which is easy :)) One would hope so. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? --b. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On 9/14/07, Jeff Garzik [EMAIL PROTECTED] wrote: Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). This distributed storage is very much needed; even if it were to act as a more capable/performant replacement for NBD (or MD+NBD) in the near term. Many high availability applications don't _need_ all the additional complexity of a full distributed filesystem. So given that, its discouraging to see you trying to gently push Evgeniy away from all the promising work he has published. Evgeniy, please continue your current work. Mike - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? I just sorta assumed you could fall back to the NFSv4.0 mode of operation, going through the metadata server for all data accesses. But look at that choice in practice: you can either ditch pNFS completely, or use a proprietary solution. The market incentives are CLEARLY tilted in favor of makers of proprietary solutions. But it's a poor choice (really little choice at all). Overall, my main concern is that NFSv4.1 is no longer an open architecture solution. The no-pNFS or proprietary platform choice merely illustrate one of many negative aspects of this architecture. One of NFS's biggest value propositions is its interoperability. To quote some Wall Street guys, NFS is like crack. It Just Works. We love it. Now, for the first time in NFS's history (AFAIK), the protocol is no longer completely specified, completely known. No longer a closed loop. Private layout types mean that it is _highly_ unlikely that any OS or appliance or implementation will be able to claim full NFS compatibility. And when the proprietary portion of the spec involves something as basic as accessing one's own data, I consider that a fundamental flaw. NFS is no longer completely open. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sat, Sep 15, 2007 at 12:08:42AM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? I just sorta assumed you could fall back to the NFSv4.0 mode of operation, going through the metadata server for all data accesses. Right. So any two pNFS implementations *will* be able to talk to each other; they just may not be able to use the (possibly higher-bandwidth) read/write path that pNFS gives them. But look at that choice in practice: you can either ditch pNFS completely, or use a proprietary solution. The market incentives are CLEARLY tilted in favor of makers of proprietary solutions. I doubt somebody would go to all the trouble to implement pNFS and then present their customers with that kind of choice. But maybe I'm missing something. What market incentives do you see that would make that more attractive than either 1) using a standard fully-specified layout type, or 2) just implementing your own proprietary protocol instead of pNFS? Overall, my main concern is that NFSv4.1 is no longer an open architecture solution. The no-pNFS or proprietary platform choice merely illustrate one of many negative aspects of this architecture. It's always been possible to extend NFS in various ways if you want. You could use sideband protocols with v2 and v3, for example. People have done that. Some of them have been standardized and widely implemented, some haven't. You could probably add your own compound ops to v4 if you wanted, I guess. And there's advantages to experimenting with extensions first and then standardizing when you figure out what works. I wish it happened that way more often. Now, for the first time in NFS's history (AFAIK), the protocol is no longer completely specified, completely known. No longer a closed loop. Private layout types mean that it is _highly_ unlikely that any OS or appliance or implementation will be able to claim full NFS compatibility. Do you know of any such private layout types? This is kind of a boring argument, isn't it? I'd rather hear whatever ideas you have for a new distributed filesystem protocol. --b. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
Hi Paul. On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL PROTECTED]) wrote: Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * implement netlink based setup (simple) * new redundancy algorithm (complex) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst A couple questions below, but otherwise looks good from an RCU viewpoint. Thanx, Paul Thanks for your comments, and sorry for late reply I was at KS/London trip. + if (--num) { + list_for_each_entry_rcu(n, node-shared, shared) { This function is called under rcu_read_lock() or similar, right? (Can't tell from this patch.) It is also OK to call it from under the update-side mutex, of course. Actually not, but it does not require it, since entry can not be removed during this operations since appropriate reference counter for given node is being held. It should not be RCU at all. +static int dst_mirror_read(struct dst_request *req) +{ + struct dst_node *node = req-node, *n, *min_dist_node; + struct dst_mirror_priv *priv = node-priv; + u64 dist, d; + int err; + + req-bio_endio = dst_mirror_read_endio; + + do { + err = -ENODEV; + min_dist_node = NULL; + dist = -1ULL; + + /* +* Reading is never performed from the node under resync. +* If this will cause any troubles (like all nodes must be +* resynced between each other), this check can be removed +* and per-chunk dirty bit can be tested instead. +*/ + + if (!test_bit(DST_NODE_NOTSYNC, node-flags)) { + priv = node-priv; + if (req-start priv-last_start) + dist = req-start - priv-last_start; + else + dist = priv-last_start - req-start; + min_dist_node = req-node; + } + + list_for_each_entry_rcu(n, node-shared, shared) { I see one call to this function that appears to be under the update-side mutex, but I cannot tell if the other calls are safe. (Safe as in either under the update-side mutex or under rcu_read_lock() and friends.) The same here - those processing function are called from generic_make_request() from any lock on top of them. Each node is linked into the list of the first added node, which reference counter is increased in higher layer. Right now there is no way to add or remove nodes after array was started, such functionality requires storage tree lock to be taken and RCU can not be used (since it requires sleeping and I did not investigate sleepable RCU for this purpose). So, essentially RCU is not used in DST :) Thanks for review, Paul. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
On Thu, Sep 13, 2007 at 04:22:59PM +0400, Evgeniy Polyakov wrote: Hi Paul. On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL PROTECTED]) wrote: Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * implement netlink based setup (simple) * new redundancy algorithm (complex) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst A couple questions below, but otherwise looks good from an RCU viewpoint. Thanx, Paul Thanks for your comments, and sorry for late reply I was at KS/London trip. + if (--num) { + list_for_each_entry_rcu(n, node-shared, shared) { This function is called under rcu_read_lock() or similar, right? (Can't tell from this patch.) It is also OK to call it from under the update-side mutex, of course. Actually not, but it does not require it, since entry can not be removed during this operations since appropriate reference counter for given node is being held. It should not be RCU at all. Ah! Yes, it is OK to use _rcu in this case, but should be avoided unless doing so eliminates duplicate code or some such. So, agree with dropping _rcu in this case. +static int dst_mirror_read(struct dst_request *req) +{ + struct dst_node *node = req-node, *n, *min_dist_node; + struct dst_mirror_priv *priv = node-priv; + u64 dist, d; + int err; + + req-bio_endio = dst_mirror_read_endio; + + do { + err = -ENODEV; + min_dist_node = NULL; + dist = -1ULL; + + /* + * Reading is never performed from the node under resync. + * If this will cause any troubles (like all nodes must be + * resynced between each other), this check can be removed + * and per-chunk dirty bit can be tested instead. + */ + + if (!test_bit(DST_NODE_NOTSYNC, node-flags)) { + priv = node-priv; + if (req-start priv-last_start) + dist = req-start - priv-last_start; + else + dist = priv-last_start - req-start; + min_dist_node = req-node; + } + + list_for_each_entry_rcu(n, node-shared, shared) { I see one call to this function that appears to be under the update-side mutex, but I cannot tell if the other calls are safe. (Safe as in either under the update-side mutex or under rcu_read_lock() and friends.) The same here - those processing function are called from generic_make_request() from any lock on top of them. Each node is linked into the list of the first added node, which reference counter is increased in higher layer. Right now there is no way to add or remove nodes after array was started, such functionality requires storage tree lock to be taken and RCU can not be used (since it requires sleeping and I did not investigate sleepable RCU for this purpose). So, essentially RCU is not used in DST :) Works for me! Use the right tool for the job! Thanks for review, Paul. Thanx, Paul - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote: On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. I'm toying with the idea of keeping track of the maximum device stack depth for each stacked device, and only permitting it to increase in controlled circumstances. Hi Alasdair, What kind of circumstances did you have in mind? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? I posted a trivial hack with pointed possible errors and a question about should it be further extended (and race fixed by any of the possible methods and so on) or new one should be developed (like in your approach when only high level device is charged), instead I got replies that it contains bugs, whcih will stop system and kill gene pool of the mankind. I know how it works and where problems are. And if we are going with this approach I will fix pointed issues. --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. It is a weakness all right. Well, - if (q q-metric) { + if (q q-metric !bio-queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! Yes, it should. The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. We still did not come to the conclusion, but I do not want to start a flamewar, you believe that throttling must be done on the top level device, so you need to extend bio and convince others that idea worth it. The other principle we are trying to satisfy is that the throttling should not be released until bio-endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Block layer does not know and should not be bothered with underlying device nature - if you think that in endio callback limit should not be rechardged, then provide your own layer on top of bio and thus call endio callback only when you think it is ready to be completed. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. I'm toying with the idea of keeping track of the maximum device stack depth for each stacked device, and only permitting it to increase in controlled circumstances. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. It is a weakness all right. Well, - if (q q-metric) { + if (q q-metric !bio-queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. The other principle we are trying to satisfy is that the throttling should not be released until bio-endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. I think you should complete your emotional email with decription of how atomic types are operated and how processors access data. Just to give a lesson to those who never knew how SMP works, but create patches and have the conscience to send them and even discuss. Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. Anyway, I prefer to look like I do not know how SMP and atomic operation work and thus stay away from this discussion. --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(q-available, need); + } -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? Regards, Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) wrote: On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote: TODO list currently includes following main items: * redundancy algorithm (drop me a request of your own, but it is highly unlikley that Reed-Solomon based will ever be used - it is too slow for distributed RAID, I consider WEAVER codes) LDPC codes[1][2] have been replacing Turbo code[3] with regards to communication links and we have been seeing that transition. (maybe helpful, came to mind seeing the mention of Turbo code) Don't know how weaver compares to LDPC, though found some comparisons [4][5] But looking at fault tolerance figures, i guess Weaver is much better. [1] http://www.ldpc-codes.com/ [2] http://portal.acm.org/citation.cfm?id=1240497 [3] http://en.wikipedia.org/wiki/Turbo_code [4] http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf I've studied and implemented LDPC encoder/decoder (hard decoding belief propagation algo only though) in userspace and found that any such probabilistic codes generally are not suitable for redundant or distributed data storages, because of its per-bit nature and probabilistic error recovery. Interested reader can find similar to Dr. Plank's iteractive decoding presentation and some of my analysis about codes and all sources at project homepage and in blog: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=ldpc So I consider weaver codes, as a superior decision for distributed storages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. Second, I wrote about this race and there is number of ways to deal with it, from atomic operations to separated counters for in-flight and completed bios (which can be racy too, but in different angle). Third, if people can not agree even on much higher layer detail about should bio structure be increased or not, how we can discuss details of the preliminary implementation with known issues. So I can not agree with fatality of the issue, but of course it exists, and was highlighted. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better, if not, then there are issues with performance (although the more I think, the more I become sure that since bio itself is very rarely shared, and thus requires cloning and alocation/freeing, which itself is much more costly operation than atomic_sub/dec, it can safely host additional operation). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(q-available, need); + } + if (current-bio_tail) { /* make_request is active */ *(current-bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio-bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md-queue-queuedata = md; + md-queue-metric = dm_metric; + atomic_set(md-queue-available, md-queue-capacity = 1000); + init_waitqueue_head(md-queue-throttle_wait); + md-queue-backing_dev_info.congested_fn = dm_any_congested; md-queue-backing_dev_info.congested_data = md; blk_queue_make_request(md-queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio-bi_size; } - bio-bi_size -= bytes_done; + if (!(bio-bi_size -= bytes_done) bio-bi_reserved) { + struct request_queue *q = bio-queue; + atomic_add(q-available, bio-bi_reserved); + bio-bi_reserved = 0; /* just in case */ + wake_up(q-throttle_wait); + } bio-bi_sector += (bytes_done 9); if (bio-bi_end_io) --- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700 +++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.0 -0700 @@ -109,6 +109,9 @@ struct bio { bio_end_io_t*bi_end_io; atomic_tbi_cnt; /* pin count */ + struct request_queue*queue; /* for throttling */ + unsigned
Re: [1/1] Block device throttling [Re: Distributed storage.]
Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. This is not a bug, this is special kind of calculation - total limit is number of physical devices multiplied by theirs limits. It was done _on purpose_ to allow different device to have different limits (for example in distributed storage project it is possible to have both remote and local node in the same device, but local device should not have _any_ limit at all, but network one should). Virtual device essentially has _no_ limit. And that as done on purpose. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. This is not a bug, this is special kind of calculation - total limit is number of physical devices multiplied by theirs limits. It was done _on purpose_ to allow different device to have different limits (for example in distributed storage project it is possible to have both remote and local node in the same device, but local device should not have _any_ limit at all, but network one should). Virtual device essentially has _no_ limit. And that as done on purpose. And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 04:13:10AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. This is not a bug, this is special kind of calculation - total limit is number of physical devices multiplied by theirs limits. It was done _on purpose_ to allow different device to have different limits (for example in distributed storage project it is possible to have both remote and local node in the same device, but local device should not have _any_ limit at all, but network one should). Virtual device essentially has _no_ limit. And that as done on purpose. And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request(), if it queues it to different device, then the same logic applies there. If virutal device does not process bio, its limit will always be recharged to underlying devices, and overall limit is equal to number of physical device (or devices which do process bio) multiplied by theirs limits. This does _work_ and I showed example how limits are processed and who and where will sleep. This solution is not narrow fix, please check my examples I showed before. Regards, Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request() What will make it wait? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request() What will make it wait? gneric_make_request() for given block device. Example: virt_device - do_smth_with_bio -bio_endio(). | / \ phys0 phys1 Each of three devices above works with bio, each one eventually calls bio_endio() and bio-bi_bdev will be one of the three above devices. Thus, when system calls generic_make_request(bio-bi_bdev == virt_device), one of the three limits will be charged, depending on the fact, that virtual device forward bio to physical devices or not. Actually virtual device limit will be charged too first, but if bio is forwarded, its portion will be reduced from virtual device's limit. Now, if virtual device allocates bio itself (like device mapper), then this new bio will be forwarded to physical devices via gneric_make_request() and thus it will sleep in the physical device's queue, if it is filled. So, if each of three devices has a limit of 10 bios, then actual number of bios in flight is maximum 3 * 10, since each device will be charged up to _its_ maximum limit, not limit for the first device in the chain. So, you set 10 to virtual device and its can process bio itself (like send it to network), then this is number of bios in flight, which are processed by _this_ device and not forwarded further. Actual number of bios you can flush into virtual device is its own limit plus limits of all physical devices atached to it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request() What will make it wait? gneric_make_request() for given block device. Not good enough, that only makes one thread wait. Look here: http://lkml.org/lkml/2007/8/13/788 An unlimited number of threads can come in, each consuming resources of the virtual device, and violating the throttling rules. The throttling of the virtual device must begin in generic_make_request and last to -endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not see why that is a bad thing. It is a very bad thing, roughly like leaving some shared data outside a spin_lock/unlock. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 05:32:29AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request() What will make it wait? gneric_make_request() for given block device. Not good enough, that only makes one thread wait. Look here: http://lkml.org/lkml/2007/8/13/788 An unlimited number of threads can come in, each consuming resources of the virtual device, and violating the throttling rules. The throttling of the virtual device must begin in generic_make_request and last to -endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not Because it is charged to another device. No matter how many of them are chained, limit is applied to the last device being used. So, if you have unlimited number of threads, each one allocates a request, forward it down to low-level devices, each one will eventually sleep, but yes, each one _can_ allocate _one_ request before it goes sleeping. It is done to allow fain-grained limits, since some devices (like locally attached disks) do not require throttling. Here is an example with threads you mentioned: http://article.gmane.org/gmane.linux.file-systems/17644 -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote: The throttling of the virtual device must begin in generic_make_request and last to -endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not Because it is charged to another device. Great. You charged the resource to another device, but you did not limit the amount of resources that the first device can consume. Which misses the whole point. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Mirroring to any number of devices.
On Aug 14 2007 20:29, Evgeniy Polyakov wrote: I'm pleased to announce second release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. I'll be quick: what is it good for, are there any users, and what could it have to do with DRBD and all the other distribution storage talk that has come up lately (namely NBD w/Raid1)? Jan -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Mirroring to any number of devices.
On Tue, Aug 14, 2007 at 07:20:49PM +0200, Jan Engelhardt ([EMAIL PROTECTED]) wrote: I'm pleased to announce second release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. I'll be quick: what is it good for, are there any users, and what could it have to do with DRBD and all the other distribution storage talk that has come up lately (namely NBD w/Raid1)? It has number of advantages, outlined in the first release and on the project homepage, namely: * non-blocking processing without busy loops (compared to iSCSI and NBD) * small, plugable architecture * failover recovery (reconnect to remote target) * autoconfiguration * no additional allocatins (not including network part) - at least two in device mapper for fast path * very simple - try to compare with iSCSI * works with different network protocols * storage can be formed on top of remote nodes and be exported simultaneously (iSCSI is peer-to-peer only, NBD requires device mapper, is synchronous and wants special userspace thread) Compared to DRBD, which is a mirroring of the local requests to remote node, and raid on top of NBD, DST supports multiple remote nodes, it allows to remove any of them and then turn it back into the storage without breaking the dataflow, dst core will reconnect automatically to the failed remote nodes, it allows to work with detouched devices just like with usual filesystems (in case it was not formed as a part of linear storage, since in that case meta information is spreaded between nodes). It does not require special processes on behalf of network connection, everything will be performed automatically on behalf of DST core workers, it allows to export new device, created on top of mirror or linear combination of the others, which in turn can be formed on top of another and so on... This was designed to allow to create a distributed storage with completely transparent failover recovery, with ability to detouch remote nodes from mirror array to became standalone realtime backups (or snapshots) and turn it back into the storage without stopping main device node. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Sunday 12 August 2007 22:36, I wrote: Note! There are two more issues I forgot to mention earlier. Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc-alloc_pages-shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged I need to review a backtrace to get this precisely right, however you can see the danger. In ddsnap we kludge around this problem by not throttling any bio submitted in PF_MEMALLOC mode, which effectively increases our reserve requirement by the amount of IO that mm will submit to a given block device before deciding the device is congested and should be left alone. This works, but is sloppy and disgusting. The right thing to do is to make sure than the mm knows about our throttle accounting in backing_dev_info so it will not push IO to our device when it knows that the IO will just block on congestion. Instead, shrink_caches will find some other less congested block device or give up, causing alloc_pages to draw from the memalloc reserve to satisfy the sk_alloc request. The mm already uses backing_dev_info this way, we just need to set the right bits in the backing_dev_info state flags. I think Peter posted a patch set that included this feature at some point. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sun, Aug 12 2007, Daniel Phillips wrote: On Tuesday 07 August 2007 13:55, Jens Axboe wrote: I don't like structure bloat, but I do like nice design. Overloading is a necessary evil sometimes, though. Even today, there isn't enough room to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that concern can be scratched. If you read bio.h, that much is obvious. Sixteen bits in bi_rw are consumed by queue priority. Is there a reason this lives in struct bio instead of struct request? If you don't, you have to pass them down. You can make that very statement about basically any member of struct bio, until we end up with a submit_bio() path and down taking 16 arguments. If you check up on the iommu virtual merging, you'll understand the front and back size members. They may smell dubious to you, but please take the time to understand why it looks the way it does. Virtual merging is only needed at the physical device, so why do these fields live in struct bio instead of struct request? A bio does exist outside of a struct request, and bio buildup also happens before it gets attached to such. Changing the number of bvecs is integral to how bio buildup current works. Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. That would only be feasible, if we ruled that any bio in the system must originate from the standard pools. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Please suggest how to do such a cleanup. Incidentally, why does the bvl need to be memset to zero on allocation? bi_vcnt already tells you which bvecs are valid and the only field in a bvec that can reasonably default to zero is the offset, which ought to be set set every time a bvec is initialized anyway. We could probably skip that, but that's an unrelated subject. bi_destructor could be combined. I don't see a lot of users of bi_idx, bi_idx is integral to partial io completions. Struct request has a remaining submission sector count so what does bi_idx do that is different? Struct request has remaining IO count. You still need to know where to start in the bio. that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. So show some concrete patches and examples, hand waving and assumptions is just a waste of everyones time. Average struct bio memory footprint ranks near the bottom of the list of things that suck most about Linux storage. At idle I see 8K in use (reserves); during updatedb it spikes occasionally to 50K; under a heavy load generated by ddsnap on a storage box it sometimes goes to 100K with bio throttling in place. Really not moving the needle. Then, again, stop wasting time on this subject. Just because struct bio isn't a huge bloat is absolutely no justification for adding extra members to it. It's not just about system wide bloat. On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. Then fix the damn vm writeout. I always thought it was silly to depend on the block layer for any sort of throttling. If it's not a system wide problem, then throttle the io count in the make_request_fn handler of that problematic driver. That said, I did a little analysis to get an idea of where the soft targets are in struct bio, and to get to know the bio layer a little better. Maybe these few hints will get somebody interested enough to look further. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) It's a trade off, otherwise you'd have to pass the block device around a lot. Which costs very little, probably less than trashing an extra field's worth of cache. Again, you can make that argument for most of the members. It's a non-starter. And it's, again, a design issue. A bio contains destination information, that means device/offset/size information. I'm all for shaving structure bytes where it matters, but not for the sake of sacrificing code
Re: Distributed storage.
On Mon, Aug 13 2007, Jens Axboe wrote: You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Sunday 12 August 2007 22:36, I wrote: Note! There are two more issues I forgot to mention earlier. Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc-alloc_pages-shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged If system is in such condition, it is already broken - throttle limit must be lowered (next time) not to allow such situation. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. I decided to limit physical devices just because any limit on top of virtual one is not correct. When system recharges bio from virtual device to physical, and the latter is full, virtual device will not accept any new blocks for that physical device, but can accept for another ones. That was created specially to allow fair use for network and physical storages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:28, Jens Axboe wrote: On Sun, Aug 12 2007, Daniel Phillips wrote: Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. That would only be feasible, if we ruled that any bio in the system must originate from the standard pools. Not at all. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Please suggest how to do such a cleanup. Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a bio from the standard pool. Just put the power of two size in the bits and map that number to the standard pool arrangement with a table lookup. On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. Then fix the damn vm writeout. I always thought it was silly to depend on the block layer for any sort of throttling. If it's not a system wide problem, then throttle the io count in the make_request_fn handler of that problematic driver. It is a system wide problem. Every block device needs throttling, otherwise queues expand without limit. Currently, block devices that use the standard request library get a slipshod form of throttling for free in the form of limiting in-flight request structs. Because the amount of IO carried by a single request can vary by two orders of magnitude, the system behavior of this approach is far from predictable. You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. OK, we got one and another is close to cracking, enough of that. As far as code stability goes, current kernels are horribly unstable in a variety of contexts because of memory deadlock and slowdowns related to the attempt to fix the problem via dirty memory limits. Accurate throttling of bio traffic is one of the two key requirements to fix this instability, the other other is accurate writeout path reserve management, which is only partially addressed by BIO_POOL. Which, as written above and stated many times over the years on lkml, is not a block layer issue imho. Whoever stated that was wrong, but this should be no surprise. There have been many wrong things said about this particular bug over the years. The one thing that remains constant is, Linux continues to deadlock under a variety of loads both with and without network involvement, making it effectively useless as a storage platform. These deadlocks are first and foremost, block layer deficiencies. Even the network becomes part of the problem only because it lies in the block IO path. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:45, Jens Axboe wrote: On Mon, Aug 13 2007, Jens Axboe wrote: You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 00:28, Jens Axboe wrote: On Sun, Aug 12 2007, Daniel Phillips wrote: Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. That would only be feasible, if we ruled that any bio in the system must originate from the standard pools. Not at all. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Please suggest how to do such a cleanup. Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a bio from the standard pool. Just put the power of two size in the bits and map that number to the standard pool arrangement with a table lookup. So reserve a bit that tells you how to interpret the (now) 3 remaining bits. Doesn't sound very pretty, does it? On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. Then fix the damn vm writeout. I always thought it was silly to depend on the block layer for any sort of throttling. If it's not a system wide problem, then throttle the io count in the make_request_fn handler of that problematic driver. It is a system wide problem. Every block device needs throttling, otherwise queues expand without limit. Currently, block devices that use the standard request library get a slipshod form of throttling for free in the form of limiting in-flight request structs. Because the amount of IO carried by a single request can vary by two orders of magnitude, the system behavior of this approach is far from predictable. Is it? Consider just 10 standard sata disks. The next kernel revision will have sg chaining support, so that allows 32MiB per request. Even if we disregard reads (not so interesting in this discussion) and just look at potentially pinned dirty data in a single queue, that number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch. So I still think that this throttling needs to happen elsewhere, you cannot rely the block layer throttling globally or for a single device. It just doesn't make sense. You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. OK, we got one and another is close to cracking, enough of that. No we did not, I already failed this one in the next mail. As far as code stability goes, current kernels are horribly unstable in a variety of contexts because of memory deadlock and slowdowns related to the attempt to fix the problem via dirty memory limits. Accurate throttling of bio traffic is one of the two key requirements to fix this instability, the other other is accurate writeout path reserve management, which is only partially addressed by BIO_POOL. Which, as written above and stated many times over the years on lkml, is not a block layer issue imho. Whoever stated that was wrong, but this should be no surprise. There have been many wrong things said about this particular bug over the years. The one thing that remains constant is, Linux continues to deadlock under a variety of loads both with and without network involvement, making it effectively useless as a storage platform. These deadlocks are first and foremost, block layer deficiencies. Even the network becomes part of the problem only because it lies in the block IO path. The block layer has NEVER guaranteed throttling, so it can - by definition - not be a block layer deficiency. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 00:45, Jens Axboe wrote: On Mon, Aug 13 2007, Jens Axboe wrote: You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. You can't be serious? You'd stall end io completion notification because someone holds a reference to a bio. Surely you jest. Needless to say, that will never go in. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. This is not a very good solution, since it requires all users of the bios to know how to free it. Right now it is hidden. And adds additional atomic check (although reading is quite fast) in the end_io. And for what purpose? To eat 8 bytes on 64bit platform? This will not reduce its size noticebly, so the same number of bios will be in the cache's page, so what is a gain? All this cleanups and logic complicatins should be performed only if after size shring increased number of bios can fit into cache's page, will it be done after such cleanups? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:13, Jens Axboe wrote: On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 00:45, Jens Axboe wrote: On Mon, Aug 13 2007, Jens Axboe wrote: You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. You can't be serious? You'd stall end io completion notification because someone holds a reference to a bio. Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means end and the other means destroy, this is trivial. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 02:13, Jens Axboe wrote: On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 00:45, Jens Axboe wrote: On Mon, Aug 13 2007, Jens Axboe wrote: You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. We could do that without too much work, I agree. But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. You can't be serious? You'd stall end io completion notification because someone holds a reference to a bio. Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means end and the other means destroy, this is trivial. Sorry Daniel, but your suggestions would do nothing more than uglify the code and design. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:06, Jens Axboe wrote: On Mon, Aug 13 2007, Daniel Phillips wrote: Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means end and the other means destroy, this is trivial. Sorry Daniel, but your suggestions would do nothing more than uglify the code and design. Pretty much exactly what was said about shrinking struct page, ask Bill. The difference was, shrinking struct page actually mattered whereas shrinking struct bio does not, and neither does expanding it by a few bytes. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote: On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: But that idea fails as well, since reference counts and IO completion are two completely seperate entities. So unless end IO just happens to be the last user holding a reference to the bio, you cannot free it. That is not a problem. When bio_put hits zero it calls -endio instead of the destructor. The -endio sees that the count is zero and destroys the bio. This is not a very good solution, since it requires all users of the bios to know how to free it. No, only the specific -endio needs to know that, which is set by the bio owner, so this knowledge lies in exactly the right place. A small handful of generic endios all with the same destructor are used nearly everywhere. Right now it is hidden. And adds additional atomic check (although reading is quite fast) in the end_io. Actual endio happens once in the lifetime of the transfer, this read will be entirely lost in the noise. And for what purpose? To eat 8 bytes on 64bit platform? This will not reduce its size noticebly, so the same number of bios will be in the cache's page, so what is a gain? All this cleanups and logic complicatins should be performed only if after size shring increased number of bios can fit into cache's page, will it be done after such cleanups? Well, exactly, My point from the beginning was that the size of struct bio is not even close to being a problem and adding a few bytes to it in the interest of doing the cleanest fix to a core kernel bug is just not a dominant issue. I suppose that leaving out the word bloated and skipping straight to the doesn't matter proof would have saved some bandwidth. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13 2007, Daniel Phillips wrote: On Monday 13 August 2007 03:06, Jens Axboe wrote: On Mon, Aug 13 2007, Daniel Phillips wrote: Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means end and the other means destroy, this is trivial. Sorry Daniel, but your suggestions would do nothing more than uglify the code and design. Pretty much exactly what was said about shrinking struct page, ask Bill. The difference was, shrinking struct page actually mattered whereas shrinking struct bio does not, and neither does expanding it by a few bytes. Lets back this up a bit - this whole thing began with you saying that struct bio was bloated already, which I said wasn't true. You then continued to hand wave your wave through various suggestions to trim the obvious fat from that structure, none of which were nice or feasible. I never compared the bio to struct page, I'd obviously agree that shrinking struct page was a worthy goal and that it'd be ok to uglify some code to do that. The same isn't true for struct bio. And we can expand struct bio if we have to, naturally. And I've done it before, which I wrote in the initial mail. I just don't want to do it casually, then it WILL be bloated all of a sudden. Your laissez faire attitude towards adding members to struct bio oh I'll just add it and someone less lazy than me will fix it up in the future makes me happy that you are not maintaining anything that I use. I'll stop replying to your mails until something interesting surfaces. I've already made my points clear about both the above and the throttling. And I'd advise you to let Evgeniy take this forward, he seems a lot more adept to actually getting CODE done and - at least from my current and past perspective - is someone you can actually have a fruitful conversation with. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:22, Jens Axboe wrote: I never compared the bio to struct page, I'd obviously agree that shrinking struct page was a worthy goal and that it'd be ok to uglify some code to do that. The same isn't true for struct bio. I thought I just said that. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc-alloc_pages-shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged If system is in such condition, it is already broken - throttle limit must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. This is not theoretical, but a testable, repeatable result. Instructions to reproduce should show up tomorrow. This bug is now solved in a kludgy way. Now, Peter's patch set offers a much cleaner way to fix this little problem, along with at least one other nasty that it already fixed. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: This is not a very good solution, since it requires all users of the bios to know how to free it. No, only the specific -endio needs to know that, which is set by the bio owner, so this knowledge lies in exactly the right place. A small handful of generic endios all with the same destructor are used nearly everywhere. That is what I meant - there will be no way to just alloc a bio and put it, helpers for generic bio sets must be exported and each and every bi_end_io() must be changed to check reference counter and they must know how they were allocated. Right now it is hidden. And adds additional atomic check (although reading is quite fast) in the end_io. Actual endio happens once in the lifetime of the transfer, this read will be entirely lost in the noise. Not always. Sometimes it is called multiple times, but all bi_end_io() callbacks I checked (I believe all in mainline tree) tests if bi_size is zero or not. Endio callback is of course quite rare and additional atomic reading will not kill the system, but why introduce another read? It is possible to provide a flag for endio callback that it is last, but it still requires to change every single callback - why do we want this? And for what purpose? To eat 8 bytes on 64bit platform? This will not reduce its size noticebly, so the same number of bios will be in the cache's page, so what is a gain? All this cleanups and logic complicatins should be performed only if after size shring increased number of bios can fit into cache's page, will it be done after such cleanups? Well, exactly, My point from the beginning was that the size of struct bio is not even close to being a problem and adding a few bytes to it in the interest of doing the cleanest fix to a core kernel bug is just not a dominant issue. So, I'm a bit lost... You say it is too big and some parts can be removed or combined, and then that size does not matter. Last/not-last checks in the code is not clear design, so I do not see why it is needed at all if not for size shrinking. I suppose that leaving out the word bloated and skipping straight to the doesn't matter proof would have saved some bandwidth. :) Likely it will. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote: On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: (previous incomplete message sent accidentally) On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: So, what did we decide? To bloat bio a bit (add a queue pointer) or to use physical device limits? The latter requires to replace all occurence of bio-bi_bdev = something_new with blk_set_bdev(bio, somthing_new), where queue limits will be appropriately charged. So far I'm testing second case, but I only changed DST for testing, can change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) No. Since all requests for virtual device end up in physical devices, which have limits, this mechanism works. Virtual device will essentially call either generic_make_request() for new physical device (and thus will sleep is limit is over), or will process bios directly, but in that case it will sleep in generic_make_request() for virutal device. What can happen is, as soon as you unthrottle the previous queue, another thread can come in and put another request on it. Sure, that thread will likely block on the physical throttle and so will the rest of the incoming threads, but it still allows the higher level queue to grow past any given limit, with the help of lots of threads. JVM for example? Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Anyway, your approach is awfully close to being airtight, there is just a small hole. I would be more than happy to be proved wrong about that, but the more I look, the more I see that hole. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only It does not matter - we can count bytes, pages, bio vectors or whatever we like, its just a matter of counter and can be changed without problem. Quite true. In some cases the simple inc/dec per bio works just fine. But the general case where finer granularity is required comes up in existing code, so there needs to be a plan. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote: On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: This is not a very good solution, since it requires all users of the bios to know how to free it. No, only the specific -endio needs to know that, which is set by the bio owner, so this knowledge lies in exactly the right place. A small handful of generic endios all with the same destructor are used nearly everywhere. That is what I meant - there will be no way to just alloc a bio and put it, helpers for generic bio sets must be exported and each and every bi_end_io() must be changed to check reference counter and they must know how they were allocated. There are fewer non-generic bio allocators than you think. Endio callback is of course quite rare and additional atomic reading will not kill the system, but why introduce another read? It is possible to provide a flag for endio callback that it is last, but it still requires to change every single callback - why do we want this? We don't. Struct bio does not need to be shrunk. Jens wanted to talk about what fields could be eliminated if we wanted to shrink it. It is about time to let that lie, don't you think? So, I'm a bit lost... You say it is too big Did not say that. and some parts can be removed or combined True. and then that size does not matter. Also true, backed up by numbers on real systems. Last/not-last checks in the code is not clear design, so I do not see why it is needed at all if not for size shrinking. Not needed, indeed. Accurate throttling is needed. If the best way to throttle requires expanding struct bio a little then we should not let concerns about the cost of an int or two stand in the way. Like Jens, I am more concerned about the complexity cost, and that is minimized in my opinion by throttling in the generic code rather than with custom code in each specialized block driver. Your patch does throttle in the generic code, great. Next thing is to be sure that it completely closes the window for reserve leakage, which is not yet clear. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc-alloc_pages-shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged If system is in such condition, it is already broken - throttle limit must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. How is it ever possible? The whole idea of throttling is to remove such situation, and now you say it can not be solved. If limit is for 1gb of pending block io, and system has for example 2gbs of ram (or any other resonable parameters), then there is no way we can deadlock in allocation, since it will not force page reclaim mechanism. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote: On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc-alloc_pages-shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged If system is in such condition, it is already broken - throttle limit must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. How is it ever possible? The whole idea of throttling is to remove such situation, and now you say it can not be solved. It was solved, by not throttling writeout that comes from shrink_caches. Ugly. If limit is for 1gb of pending block io, and system has for example 2gbs of ram (or any other resonable parameters), then there is no way we can deadlock in allocation, since it will not force page reclaim mechanism. The problem is that sk_alloc (called from our block driver via socket-write) would recurse into shrink_pages, which recursively submits IO to our block driver and blocks on the throttle. Subtle indeed, and yet another demonstration of why vm recursion is a Bad Thing. I will find a traceback for you tomorrow, which makes this deadlock much clearer. Regards - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 04:18:03AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: No. Since all requests for virtual device end up in physical devices, which have limits, this mechanism works. Virtual device will essentially call either generic_make_request() for new physical device (and thus will sleep is limit is over), or will process bios directly, but in that case it will sleep in generic_make_request() for virutal device. What can happen is, as soon as you unthrottle the previous queue, another thread can come in and put another request on it. Sure, that thread will likely block on the physical throttle and so will the rest of the incoming threads, but it still allows the higher level queue to grow past any given limit, with the help of lots of threads. JVM for example? No. You get one slot, and one thread will not be blocked, all others will. If lucky thread wants to put two requests it will be blocked on second request, since underlying physical device does not accept requests anymore an thus caller will sleep. Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Each thread will sleep in generic_make_request(), if limit is specified correctly, then allocated number of bios will be enough to have a progress. Here is an example: let's say system has 20.000 pages in RAM and 20.000 in swap, we have 8.000 threads, each one allocates a page, then next page and so on. System has one virtual device with two physical devices under it, each device gets half of requests. We set limit to 4.000 per physical device. All threads allocate a page and queue it to devices, so all threads succeeded in its first allocation, and each device has its queue full. Virtual device does not have a limit (or have it 4.000 too, but since it was each time recharged, it has zero blocks in-flight). New thread tries to allocate a page, it is allocated and queued to one of the devices, but since its queue is full, thread sleeps. So will do each other. Thus we ended up allocated 8.000 requests queued, and 8.000 in-flight, totally 16.000 which is smaller than amount of pages in RAM, so we are happy. Consider above as a special kind calculation i.e. number of _allocated_ pages is always number of physical device multiplied by each one's in-flight limit. By adjusting in-flight limit and knowing number of device it is completely possible to eliminate vm deadlock. If you do not like such calculation, solution is trivial: we can sleep _after_ -make_request_fn() in generic_make_request() until number of in-flight bios is reduced by bio_endio(). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 05:18:14AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: If limit is for 1gb of pending block io, and system has for example 2gbs of ram (or any other resonable parameters), then there is no way we can deadlock in allocation, since it will not force page reclaim mechanism. The problem is that sk_alloc (called from our block driver via socket-write) would recurse into shrink_pages, which recursively submits IO to our block driver and blocks on the throttle. Subtle indeed, and yet another demonstration of why vm recursion is a Bad Thing. I will find a traceback for you tomorrow, which makes this deadlock much clearer. I see how it can happen, but device throttling is a solution we are trying to complete, which main aim _is_ to remove this problem. Lower per-device limit, so that the rest of the RAM allowed to allocate all needed data structures in the network path. Above example just has 1gb of ram, which should be enough for skbs, if it is not, decrease limit to 500 mb and so on, until weighted load of the system allows to always have a forward progress. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote: Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Each thread will sleep in generic_make_request(), if limit is specified correctly, then allocated number of bios will be enough to have a progress. The problem is, the sleep does not occur before the virtual device mapping function is called. Let's consider two devices, a physical device named pdev and a virtual device sitting on top of it called vdev. vdev's throttle limit is just one element, but we will see that in spite of this, two bios can be handled by the vdev's mapping method before any IO completes, which violates the throttling rules. According to your patch it works like this: Thread 1Thread 2 no wait because vdev-bio_queued is zero vdev-q-bio_queued++ enter devmapper map method blk_set_bdev(bio, pdev) vdev-bio_queued-- no wait because vdev-bio_queued is zero vdev-q-bio_queued++ enter devmapper map method whoops! Our virtual device mapping function has now allocated resources for two in-flight bios in spite of having its throttle limit set to 1. Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:12, Jens Axboe wrote: It is a system wide problem. Every block device needs throttling, otherwise queues expand without limit. Currently, block devices that use the standard request library get a slipshod form of throttling for free in the form of limiting in-flight request structs. Because the amount of IO carried by a single request can vary by two orders of magnitude, the system behavior of this approach is far from predictable. Is it? Consider just 10 standard sata disks. The next kernel revision will have sg chaining support, so that allows 32MiB per request. Even if we disregard reads (not so interesting in this discussion) and just look at potentially pinned dirty data in a single queue, that number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch. So I still think that this throttling needs to happen elsewhere, you cannot rely the block layer throttling globally or for a single device. It just doesn't make sense. You are right, so long as the unit of throttle accounting remains one request. This is not what we do in ddsnap. Instead we inc/dec the throttle counter by the number of bvecs in each bio, which produces a nice steady data flow to the disk under a wide variety of loads, and provides the memory resource bound we require. One throttle count per bvec will not be the right throttling metric for every driver. To customize this accounting metric for a given driver we already have the backing_dev_info structure, which provides per-device-instance accounting functions and instance data. Perfect! This allows us to factor the throttling mechanism out of the driver, so the only thing the driver has to do is define the throttle accounting if it needs a custom one. We can avoid affecting the traditional behavior quite easily, for example if backing_dev_info-throttle_fn (new method) is null then either not throttle at all (and rely on the struct request in-flight limit) or we can move the in-flight request throttling logic into core as the default throttling method, simplifying the request library and not changing its behavior. These deadlocks are first and foremost, block layer deficiencies. Even the network becomes part of the problem only because it lies in the block IO path. The block layer has NEVER guaranteed throttling, so it can - by definition - not be a block layer deficiency. The block layer has always been deficient by not providing accurate throttling, or any throttling at all for some devices. We have practical proof that this causes deadlock and a good theoretical basis for describing exactly how it happens. To be sure, vm and net are co-conspirators, however the block layer really is the main actor in this little drama. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 13:55, Jens Axboe wrote: I don't like structure bloat, but I do like nice design. Overloading is a necessary evil sometimes, though. Even today, there isn't enough room to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that concern can be scratched. If you read bio.h, that much is obvious. Sixteen bits in bi_rw are consumed by queue priority. Is there a reason this lives in struct bio instead of struct request? If you check up on the iommu virtual merging, you'll understand the front and back size members. They may smell dubious to you, but please take the time to understand why it looks the way it does. Virtual merging is only needed at the physical device, so why do these fields live in struct bio instead of struct request? Changing the number of bvecs is integral to how bio buildup current works. Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Incidentally, why does the bvl need to be memset to zero on allocation? bi_vcnt already tells you which bvecs are valid and the only field in a bvec that can reasonably default to zero is the offset, which ought to be set set every time a bvec is initialized anyway. bi_destructor could be combined. I don't see a lot of users of bi_idx, bi_idx is integral to partial io completions. Struct request has a remaining submission sector count so what does bi_idx do that is different? that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. So show some concrete patches and examples, hand waving and assumptions is just a waste of everyones time. Average struct bio memory footprint ranks near the bottom of the list of things that suck most about Linux storage. At idle I see 8K in use (reserves); during updatedb it spikes occasionally to 50K; under a heavy load generated by ddsnap on a storage box it sometimes goes to 100K with bio throttling in place. Really not moving the needle. On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. That said, I did a little analysis to get an idea of where the soft targets are in struct bio, and to get to know the bio layer a little better. Maybe these few hints will get somebody interested enough to look further. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) It's a trade off, otherwise you'd have to pass the block device around a lot. Which costs very little, probably less than trashing an extra field's worth of cache. And it's, again, a design issue. A bio contains destination information, that means device/offset/size information. I'm all for shaving structure bytes where it matters, but not for the sake of sacrificing code stability or design. I consider struct bio quite lean and have worked hard to keep it that way. In fact, iirc, the only addition to struct bio since 2001 is the iommu front/back size members. And I resisted those for quite a while. You did not comment on the one about putting the bio destructor in the -endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. As far as code stability goes, current kernels are horribly unstable in a variety of contexts because of memory deadlock and slowdowns related to the attempt to fix the problem via dirty memory limits. Accurate throttling of bio traffic is one of the two key requirements to fix this instability, the other other is accurate writeout path reserve management, which is only partially addressed by BIO_POOL. Nice to see you jumping in Jens. Now it is over to the other side of the thread where Evgeniy has posted a patch that a) grants your wish to add no new
Re: Block device throttling [Re: Distributed storage.]
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: So, what did we decide? To bloat bio a bit (add a queue pointer) or to use physical device limits? The latter requires to replace all occurence of bio-bi_bdev = something_new with blk_set_bdev(bio, somthing_new), where queue limits will be appropriately charged. So far I'm testing second case, but I only changed DST for testing, can change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. the throttling performs another function: keeping a reasonable amount of IO in flight for the device. The definition of reasonable is complex. For a hard disk it depends on the physical distance between sector addresses of the bios in flight. In ddsnap we make a crude but workable approximation that In general, a per block device The throttle count needs to cover Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
(previous incomplete message sent accidentally) On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: So, what did we decide? To bloat bio a bit (add a queue pointer) or to use physical device limits? The latter requires to replace all occurence of bio-bi_bdev = something_new with blk_set_bdev(bio, somthing_new), where queue limits will be appropriately charged. So far I'm testing second case, but I only changed DST for testing, can change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. 2) Exposing the per-block device throttle limits via sysfs or similar is really not a good long term solution for system administration. Imagine our help text: just keep trying smaller numbers until your system deadlocks. We really need to figure this out internally and get it correct. I can see putting in a temporary userspace interface just for experimentation, to help determine what really is safe, and what size the numbers should be to approach optimal throughput in a fully loaded memory state. Regards, Daniel Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Block device throttling [Re: Distributed storage.]
On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: I don't like structure bloat, but I do like nice design. Overloading is So, what did we decide? To bloat bio a bit (add a queue pointer) or to use physical device limits? The latter requires to replace all occurence of bio-bi_bdev = something_new with blk_set_bdev(bio, somthing_new), where queue limits will be appropriately charged. So far I'm testing second case, but I only changed DST for testing, can change all other users if needed though. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/1] Block device throttling [Re: Distributed storage.]
This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index c99b463..1882c9b 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; q-backing_dev_info.unplug_io_data = q; + q-bio_limit = -1U; + q-bio_queued = 0; + init_waitqueue_head(q-wait); + mutex_init(q-sysfs_lock); return q; @@ -3237,6 +3241,16 @@ end_io: */ void generic_make_request(struct bio *bio) { + request_queue_t *q; + + BUG_ON(!bio-bi_bdev) + + q = bdev_get_queue(bio-bi_bdev); + if (q q-bio_limit != -1U) { + wait_event_interruptible(q-wait, q-bio_queued + 1 = q-bio_limit); + q-bio_queued++; + } + if (current-bio_tail) { /* make_request is active */ *(current-bio_tail) = bio; diff --git a/fs/bio.c b/fs/bio.c index 093345f..0a33958 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, int error) bio-bi_size -= bytes_done; bio-bi_sector += (bytes_done 9); + if (!bio-bi_size bio-bi_bdev) { + request_queue_t *q; + + q = bdev_get_queue(bio-bi_bdev); + if (q) { + q-bio_queued--; + wake_up(q-wait); + } + } + if (bio-bi_end_io) bio-bi_end_io(bio, bytes_done, error); } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index db5b00a..7ce0cd7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -467,6 +467,9 @@ struct request_queue struct request *orig_bar_rq; unsigned intbi_size; + wait_queue_head_t wait; + unsigned intbio_limit, bio_queued; + struct mutexsysfs_lock; }; @@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void); int blk_get_queue(request_queue_t *); request_queue_t *blk_alloc_queue(gfp_t); request_queue_t *blk_alloc_queue_node(gfp_t, int); + +static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit) +{ + q-bio_limit = limit; +} + +static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev) +{ + request_queue_t *q; + + if (!bio-bi_bdev) { + bio-bi_bdev = bdev; + return; + } + + q = bdev_get_queue(bio-bi_bdev); + if (q) { + q-bio_queued--; + wake_up(q-wait); + } + + bio-bi_bdev = bdev; +} + extern void blk_put_queue(request_queue_t *); /* -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) As addon I can cook up a patch to configure this via sysfs if needed. Thoughts? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sun, Aug 05 2007, Daniel Phillips wrote: A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Thanks for your insights. Care to detail what bloat and dubious fields struct bio has? And we don't add temporary fields out of laziness, hoping that someone will later kill it again and rewrite it in a nicer fashion. Hint: that never happens, bloat sticks. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 05:05, Jens Axboe wrote: On Sun, Aug 05 2007, Daniel Phillips wrote: A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Thanks for your insights. Care to detail what bloat and dubious fields struct bio has? First obvious one I see is bi_rw separate from bi_flags. Front_size and back_size smell dubious. Is max_vecs really necessary? You could reasonably assume bi_vcnt rounded up to a power of two and bury the details of making that work behind wrapper functions to change the number of bvecs, if anybody actually needs that. Bi_endio and bi_destructor could be combined. I don't see a lot of users of bi_idx, that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tue, Aug 07 2007, Daniel Phillips wrote: On Tuesday 07 August 2007 05:05, Jens Axboe wrote: On Sun, Aug 05 2007, Daniel Phillips wrote: A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Thanks for your insights. Care to detail what bloat and dubious fields struct bio has? First obvious one I see is bi_rw separate from bi_flags. Front_size and back_size smell dubious. Is max_vecs really necessary? You could I don't like structure bloat, but I do like nice design. Overloading is a necessary evil sometimes, though. Even today, there isn't enough room to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that concern can be scratched. If you read bio.h, that much is obvious. If you check up on the iommu virtual merging, you'll understand the front and back size members. They may smell dubious to you, but please take the time to understand why it looks the way it does. reasonably assume bi_vcnt rounded up to a power of two and bury the details of making that work behind wrapper functions to change the number of bvecs, if anybody actually needs that. Bi_endio and Changing the number of bvecs is integral to how bio buildup current works. bi_destructor could be combined. I don't see a lot of users of bi_idx, bi_idx is integral to partial io completions. that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. So show some concrete patches and examples, hand waving and assumptions is just a waste of everyones time. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) It's a trade off, otherwise you'd have to pass the block device around a lot. And it's, again, a design issue. A bio contains destination information, that means device/offset/size information. I'm all for shaving structure bytes where it matters, but not for the sake of sacrificing code stability or design. I consider struct bio quite lean and have worked hard to keep it that way. In fact, iirc, the only addition to struct bio since 2001 is the iommu front/back size members. And I resisted those for quite a while. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sun, Aug 05, 2007 at 02:23:45PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote: If we are sleeping in memory pool, then we already do not have memory to complete previous requests, so we are in trouble. Not at all. Any requests in flight are guaranteed to get the resources they need to complete. This is guaranteed by the combination of memory reserve management and request queue throttling. In logical terms, reserve management plus queue throttling is necessary and sufficient to prevent these deadlocks. Conversely, the absence of either one allows deadlock. Only if you have two, which must be closely related to each other (i.e. each request must have network reserve big enough to store data). This can work for devices which do not require additional allocations (like usual local storage), but not for network connected ones. It works for network devices too, and also for a fancy device like ddsnap, which is the moral equivalent of a filesystem implemented in user space. With or without vm deadlock patches? I can not see how it can work, if network does not have a reserve and there is not free memory completely. If all systems have reserve then yes, it works good. By default things will be like they are now, except additional non-atomic increment and branch in generic_make_request() and decrement and wake in bio_end_io()? -endio is called in interrupt context, so the accounting needs to be atomic as far as I can see. Actually we only care about if there is a place in the queue or not - so it can be a flag. Actually non-atomic operations are ok, since having plus/minus couple of requests in flight does not change the picture, but allows not to introduce slow atomic operations in the fast path. We actually account the total number of bio pages in flight, otherwise you would need to assume the largest possible bio and waste a huge amount of reserve memory. A counting semaphore works fine for this purpose, with some slight inefficiency that is nigh on unmeasurable in the block IO path. What the semaphore does is make the patch small and easy to understand, which is important at this point. Yes, it can be bio vectors. I can cook up such a patch if idea worth efforts. It is. There are some messy details... You need a place to store the accounting variable/semaphore and need to be able to find that place again in -endio. Trickier than it sounds, because of the unstructured way drivers rewrite -bi_bdev. Peterz has already poked at this in a number of different ways, typically involving backing_dev_info, which seems like a good idea to me. We can demand that reserve is not per virtual device, but per real one - for example in case of distributed storage locally connected node should have much higher limit than network one, but having a per-virtual device reserve might end up with situation, when local node can proceed data, but no requests will be queued sine all requests below limit are in network node. In case of per real device limit there is no need to increase bio. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sun, Aug 05, 2007 at 02:35:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote: On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote: DST original code worked as device mapper plugin too, but its two additional allocations (io and clone) per block request ended up for me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? At least as in terms of efficiency. Device mapper lives in happy world where memory does not end and allocations are fast. Are you saying that things are different for a network block device because it needs to do GFP_ATOMIC allocations? If so then that is just a misunderstanding. The global page reserve Peter and I use is available in interrupt context just like GFP_ATOMIC. No, neither device needs atomic allocations, I just said that device mapper is too expensive, since it performs alot of additional allocations in the fast path and is not designed to cases when allocation fails, since there is no recovery path and (maybe because of this) mempool allocation waits forever until there is free memory and can not fail. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote: On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote: To be sure, I am not very proud of this throttling mechanism for various reasons, but the thing is, _any_ throttling mechanism no matter how sucky solves the deadlock problem. Over time I want to move the make_request_fn is always called in process context, Yes, as is submit_bio which calls it. The decision re where it is best to throttle, in submit_bio or in make_request_fn, has more to do with system factoring, that is, is throttling something that _every_ block device should have (yes I think) or is it a delicate, optional thing that needs a tweakable algorithm per block device type (no I think). The big worry I had was that by blocking on congestion in the submit_bio/make_request_fn I might stuff up system-wide mm writeout. But a while ago that part of the mm was tweaked (by Andrew if I recall correctly) to use a pool of writeout threads and understand the concept of one of them blocking on some block device, and not submit more writeout to the same block device until the first thread finishes its submission. Meanwhile, other mm writeout threads carry on with other block devices. we can wait in it for memory in mempool. Although that means we already in trouble. Not at all. This whole block writeout path needs to be written to run efficiently even when normal system memory is completely gone. All it means when we wait on a mempool is that the block device queue is as full as we are ever going to let it become, and that means the block device is working as hard as it can (subject to a small caveat: for some loads a device can work more efficiently if it can queue up larger numbers of requests down at the physical elevators). By the way, ddsnap waits on a counting semaphore, not a mempool. That is because we draw our reserve memory from the global memalloc reserve, not from a mempool. And that is not only because it takes less code to do so, but mainly because global pools as opposed to lots of little special purpose pools seem like a good idea to me. Though I will admit that with our current scheme we need to allow for the total of the maximum reserve requirements for all memalloc users in the memalloc pool, so it does not actually save any memory vs dedicated pools. We could improve that if we wanted to, by having hard and soft reserve requirements: the global reserve actually only needs to be as big as the total of the hard requirements. With this idea, if by some unlucky accident every single pool user got itself maxed out at the same time, we would still not exceed our share of the global reserve. Under normal low memory situations, a block device would typically be free to grab reserve memory up to its soft limit, allowing it to optimize over a wider range of queued transactions. My little idea here is: allocating specific pages to a pool is kind of dumb, all we really want to do is account precisely for the number of pages we are allowed to draw from the global reserve. OK, I kind of digressed, but this all counts as explaining the details of what Peter and I have been up to for the last year (longer for me). At this point, we don't need to do the reserve accounting in the most absolutely perfect way possible, we just need to get something minimal in place to fix the current deadlock problems, then we can iteratively improve it. I agree, any kind of high-boundary leveling must be implemented in device itself, since block layer does not know what device is at the end and what it will need to process given block request. I did not say the throttling has to be implemented in the device, only that we did it there because it was easiest to code that up and try it out (it worked). This throttling really wants to live at a higher level, possibly submit_bio()...bio-endio(). Someone at OLS (James Bottomley?) suggested it would be better done at the request queue layer, but I do not immediately see why that should be. I guess this is going to come down to somebody throwing out a patch for interested folks to poke at. But this detail is a fine point. The big point is to have _some_ throttling mechanism in place on the block IO path, always. Device mapper in particular does not have any throttling itself: calling submit_bio on a device mapper device directly calls the device mapper bio dispatcher. Default initialized block device queue do provide a crude form of throttling based on limiting the number of requests. This is insufficiently precise to do a good job in the long run, but it works for now because the current gaggle of low level block drivers do not have a lot of resource requirements and tend to behave fairly predictably (except for some irritating issues re very slow devices working in parallel with very fast devices, but... worry about that later). Network block drivers - for example
Re: Distributed storage.
On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote: On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: * storage can be formed on top of remote nodes and be exported simultaneously (iSCSI is peer-to-peer only, NBD requires device mapper and is synchronous) In fact, NBD has nothing to do with device mapper. I use it as a physical target underneath ddraid (a device mapper plugin) just like I would use your DST if it proves out. I meant to create a storage on top of several nodes one needs to have device mapper or something like that on top of NBD itself. To further export resulted device one needs another userspace NDB application and so on. DST simplifies that greatly. DST original code worked as device mapper plugin too, but its two additional allocations (io and clone) per block request ended up for me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: DST original code worked as device mapper plugin too, but its two additional allocations (io and clone) per block request ended up for me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? At least as in terms of efficiency. Device mapper lives in happy world where memory does not end and allocations are fast. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Daniel. On Sun, Aug 05, 2007 at 01:04:19AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: we can wait in it for memory in mempool. Although that means we already in trouble. Not at all. This whole block writeout path needs to be written to run efficiently even when normal system memory is completely gone. All it means when we wait on a mempool is that the block device queue is as full as we are ever going to let it become, and that means the block device is working as hard as it can (subject to a small caveat: for some loads a device can work more efficiently if it can queue up larger numbers of requests down at the physical elevators). If we are sleeping in memory pool, then we already do not have memory to complete previous requests, so we are in trouble. This can work for devices which do not require additional allocations (like usual local storage), but not for network connected ones. I agree, any kind of high-boundary leveling must be implemented in device itself, since block layer does not know what device is at the end and what it will need to process given block request. I did not say the throttling has to be implemented in the device, only that we did it there because it was easiest to code that up and try it out (it worked). This throttling really wants to live at a higher level, possibly submit_bio()...bio-endio(). Someone at OLS (James Bottomley?) suggested it would be better done at the request queue layer, but I do not immediately see why that should be. I guess this is going to come down to somebody throwing out a patch for interested folks to poke at. But this detail is a fine point. The big point is to have _some_ throttling mechanism in place on the block IO path, always. If not in device, then at least it should say to block layer about its limits. What about new function to register queue which will get maximum number of bios in flight and sleep in generic_make_request() when new bio is going to be submitted and it is about to exceed the limit? By default things will be like they are now, except additional non-atomic increment and branch in generic_make_request() and decrement and wake in bio_end_io()? I can cook up such a patch if idea worth efforts. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote: If we are sleeping in memory pool, then we already do not have memory to complete previous requests, so we are in trouble. Not at all. Any requests in flight are guaranteed to get the resources they need to complete. This is guaranteed by the combination of memory reserve management and request queue throttling. In logical terms, reserve management plus queue throttling is necessary and sufficient to prevent these deadlocks. Conversely, the absence of either one allows deadlock. This can work for devices which do not require additional allocations (like usual local storage), but not for network connected ones. It works for network devices too, and also for a fancy device like ddsnap, which is the moral equivalent of a filesystem implemented in user space. If not in device, then at least it should say to block layer about its limits. What about new function to register queue... Yes, a new internal API is needed eventually. However, no new api is needed right at the moment because we can just hard code the reserve sizes and queue limits and audit them by hand, which is not any more sloppy than several other kernel subsystems. The thing is, we need to keep any obfuscating detail out of the initial patches because these principles are hard enough to explain already without burying them in hundreds of lines of API fluff. That said, the new improved API should probably not be a new way to register, but a set of function calls you can use after the queue is created, which follows the pattern of the existing queue API. ...which will get maximum number of bios in flight and sleep in generic_make_request() when new bio is going to be submitted and it is about to exceed the limit? Exactly. This is what ddsnap currently does and it works. But we did not change generic_make_request for this driver, instead we throttled the driver from the time it makes a request to its user space server, until the reply comes back. We did it that way because it was easy and was the only segment of the request lifeline that could not be fixed by other means. A proper solution for all block devices will move the throttling up into generic_make_request, as you say below. By default things will be like they are now, except additional non-atomic increment and branch in generic_make_request() and decrement and wake in bio_end_io()? -endio is called in interrupt context, so the accounting needs to be atomic as far as I can see. We actually account the total number of bio pages in flight, otherwise you would need to assume the largest possible bio and waste a huge amount of reserve memory. A counting semaphore works fine for this purpose, with some slight inefficiency that is nigh on unmeasurable in the block IO path. What the semaphore does is make the patch small and easy to understand, which is important at this point. I can cook up such a patch if idea worth efforts. It is. There are some messy details... You need a place to store the accounting variable/semaphore and need to be able to find that place again in -endio. Trickier than it sounds, because of the unstructured way drivers rewrite -bi_bdev. Peterz has already poked at this in a number of different ways, typically involving backing_dev_info, which seems like a good idea to me. A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote: On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote: DST original code worked as device mapper plugin too, but its two additional allocations (io and clone) per block request ended up for me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? At least as in terms of efficiency. Device mapper lives in happy world where memory does not end and allocations are fast. Are you saying that things are different for a network block device because it needs to do GFP_ATOMIC allocations? If so then that is just a misunderstanding. The global page reserve Peter and I use is available in interrupt context just like GFP_ATOMIC. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 06:19:16PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: It depends on the characteristics of the physical and virtual block devices involved. Slow block devices can produce surprising effects. Ddsnap still qualifies as slow under certain circumstances (big linear write immediately following a new snapshot). Before we added throttling we would see as many as 800,000 bios in flight. Nice to Mmm, sounds tasty to work with such a system :) know the system can actually survive this... mostly. But memory deadlock is a clear and present danger under those conditions and we did hit it (not to mention that read latency sucked beyond belief). Anyway, we added a simple counting semaphore to throttle the bio traffic to a reasonable number and behavior became much nicer, but most importantly, this satisfies one of the primary requirements for avoiding block device memory deadlock: a strictly bounded amount of bio traffic in flight. In fact, we allow some bounded number of non-memalloc bios *plus* however much traffic the mm wants to throw at us in memalloc mode, on the assumption that the mm knows what it is doing and imposes its own bound of in flight bios per device. This needs auditing obviously, but the mm either does that or is buggy. In practice, with this throttling in place we never saw more than 2,000 in flight no matter how hard we hit it, which is about the number we were aiming at. Since we draw our reserve from the main memalloc pool, we can easily handle 2,000 bios in flight, even under extreme conditions. See: http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c down(info-throttle_sem); To be sure, I am not very proud of this throttling mechanism for various reasons, but the thing is, _any_ throttling mechanism no matter how sucky solves the deadlock problem. Over time I want to move the make_request_fn is always called in process context, we can wait in it for memory in mempool. Although that means we already in trouble. I agree, any kind of high-boundary leveling must be implemented in device itself, since block layer does not know what device is at the end and what it will need to process given block request. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Daniel. On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: * storage can be formed on top of remote nodes and be exported simultaneously (iSCSI is peer-to-peer only, NBD requires device mapper and is synchronous) In fact, NBD has nothing to do with device mapper. I use it as a physical target underneath ddraid (a device mapper plugin) just like I would use your DST if it proves out. I meant to create a storage on top of several nodes one needs to have device mapper or something like that on top of NBD itself. To further export resulted device one needs another userspace NDB application and so on. DST simplifies that greatly. DST original code worked as device mapper plugin too, but its two additional allocations (io and clone) per block request ended up for me as a show stopper. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) wrote: On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote: TODO list currently includes following main items: * redundancy algorithm (drop me a request of your own, but it is highly unlikley that Reed-Solomon based will ever be used - it is too slow for distributed RAID, I consider WEAVER codes) LDPC codes[1][2] have been replacing Turbo code[3] with regards to communication links and we have been seeing that transition. (maybe helpful, came to mind seeing the mention of Turbo code) Don't know how weaver compares to LDPC, though found some comparisons [4][5] But looking at fault tolerance figures, i guess Weaver is much better. [1] http://www.ldpc-codes.com/ [2] http://portal.acm.org/citation.cfm?id=1240497 [3] http://en.wikipedia.org/wiki/Turbo_code [4] http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf LDPC codes require to solve N order matrix over finite field - exactly the reason I do not want to use Reed-Solomon codes even with optimized non-Vandermonde matrix. I will investigate LDPC further though. Turbo codes are like flow cipher compared to RS codes being block ciphers. Transport media is reliable in data storages, otherwise they would not even exist. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 02:26:29PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. Another major issue is network allocations. Your initial work and subsequent releases made by Peter were originally opposed on my side, but now I think the right way is to use both positive moments from your approach and specialized allocator - essentially what I proposed (in the blog only though) is to bind a independent reserve for any socket - such a reserve can be stolen from socket buffer itself (each socket has a limited socket buffer where packets are allocated from, it accounts both data and control (skb) lengths), so when main allocation via common path fails, it would be possible to get data from own reserve. This allows sending sockets to make a progress in case of deadlock. For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure in Peter's patches without any meaning for per-socket reservation. It allows to separate sockets and effectively make them fair - system administrator or programmer can limit socket's buffer a bit and request a reserve for special communication channels, which will have guaranteed ability to have both sending and receiving progress, no matter how many of them were setup. And it does not require any changes behind network side. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote: For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure in Peter's patches without any meaning for per-socket reservation. This is not true. I have a global reserve which is set-up a priori. You cannot allocate a reserve when under pressure, that does not make sense. Let me explain my approach once again. At swapon(8) time we allocate a global reserve. And associate the needed sockets with it. The size of this global reserve is make up of two parts: - TX - RX The RX pool is the most interresting part. It again is made up of two parts: - skb - auxilary data The skb part is scaled such that it can overflow the IP fragment reassembly, the aux pool such that it can overflow the route cache (that was the largest other allocator in the RX path) All (reserve) RX skb allocations are accounted, so as to never allocate more than we reserved. All packets are received (given the limit) and are processed up to socket demux. At that point all packets not targeted at an associated socket are dropped and the skb memory freed - ready for another packet. All packets targeted for associated sockets get processed. This requires that this packet processing happens in-kernel. Since we are swapping user-space might be waiting for this data, and we'd deadlock. I'm not quite sure why you need per socket reservations. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote: For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure in Peter's patches without any meaning for per-socket reservation. This is not true. I have a global reserve which is set-up a priori. You cannot allocate a reserve when under pressure, that does not make sense. I probably did not cut enough details - my main position is to allocate per socket reserve from socket's queue, and copy data there from main reserve, all of which are allocated either in advance (global one) or per sockoption, so that there would be no fairness issues what to mark as special and what to not. Say we have a page per socket, each socket can assign a reserve for itself from own memory, this accounts both tx and rx side. Tx is not interesting, it is simple, rx has global reserve (always allocated on startup or sometime way before reclaim/oom)where data is originally received (including skb, shared info and whatever is needed, page is just an exmaple), then it is copied into per-socket reserve and reused for the next packet. Having per-socket reserve allows to have progress in any situation not only in cases where single action must be received/processed, and allows to be completely fair for all users, but not only special sockets, thus admin for example would be allowed to login, ipsec would work and so on... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Mike. On Fri, Aug 03, 2007 at 12:09:02AM -0400, Mike Snitzer ([EMAIL PROTECTED]) wrote: * storage can be formed on top of remote nodes and be exported simultaneously (iSCSI is peer-to-peer only, NBD requires device mapper and is synchronous) Having the in-kernel export is a great improvement over NBD's userspace nbd-server (extra copy, etc). But NBD's synchronous nature is actually an asset when coupled with MD raid1 as it provides guarantees that the data has _really_ been mirrored remotely. I believe, that the right answer to this is barrier, but not synchronous sending/receiving, which might slow things down noticebly. Barrier must wait until remote side received data and send back a notice. Until acknowledge is received, no one can say if data mirrored or ever received by remote node or not. TODO list currently includes following main items: * redundancy algorithm (drop me a request of your own, but it is highly unlikley that Reed-Solomon based will ever be used - it is too slow for distributed RAID, I consider WEAVER codes) I'd like to better understand where you see DST heading in the area of redundancy.Based on your blog entries: http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_24_1.html http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_07_31_2.html (and your todo above) implementing a mirroring algorithm appears to be a near-term goal for you. Can you comment on how your intended implementation would compare, in terms of correctness and efficiency, to say MD (raid1) + NBD? MD raid1 has a write intent bitmap that is useful to speed resyncs; what if any mechanisms do you see DST embracing to provide similar and/or better reconstruction infrastructure? Do you intend to embrace any exisiting MD or DM infrastructure? Depending on what algorithm will be preferred - I do not want mirroring, it is _too_ wasteful in terms of used storage, but it is the simplest. Right now I still consider WEAVER codes as the fastest in distributed envornment from what I checked before, but it is quite complex and spec is (at least for me) not clear in all aspects right now. I did not even start userspace implementation of that codes. (Hint: spec sucks, kidding :) For simple mirroring each node must be split to chunks, each one has representation bin in main node mask, when dirty full chunk is resynced. Depending on node size and amount of memory chunk size varies. Setup is performed during node initialization. Having checksum for each chunk is a good step. All interfaces are already there, although require cleanup and move from place to place, but I decided to make initial release small. BTW, you have definitely published some very compelling work and its sad that you're predisposed to think DST won't be recieved well if you pushed for inclusion (for others, as much was said in the 7.31.2007 blog post I referenced above). Clearly others need to embrace DST to help inclusion become a reality. To that end, its great to see that Daniel Phillips and the other zumastor folks will be putting DST through its paces. In that blog entry I misspelled Zen with Xen - that's an error, according to prognosis - time will judge :) regards, Mike - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi. On Fri, Aug 03, 2007 at 09:04:51AM +0400, Manu Abraham ([EMAIL PROTECTED]) wrote: On 7/31/07, Evgeniy Polyakov [EMAIL PROTECTED] wrote: TODO list currently includes following main items: * redundancy algorithm (drop me a request of your own, but it is highly unlikley that Reed-Solomon based will ever be used - it is too slow for distributed RAID, I consider WEAVER codes) LDPC codes[1][2] have been replacing Turbo code[3] with regards to communication links and we have been seeing that transition. (maybe helpful, came to mind seeing the mention of Turbo code) Don't know how weaver compares to LDPC, though found some comparisons [4][5] But looking at fault tolerance figures, i guess Weaver is much better. [1] http://www.ldpc-codes.com/ [2] http://portal.acm.org/citation.cfm?id=1240497 [3] http://en.wikipedia.org/wiki/Turbo_code [4] http://domino.research.ibm.com/library/cyberdig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf [5] http://hplabs.hp.com/personal/Jay_Wylie/publications/wylie_dsn2007.pdf Great thanks for this links, I will definitely study them. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Thu, Aug 02, 2007 at 02:08:24PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: Hi. I'm pleased to announce first release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. Excellent! This is precisely what the doctor ordered for the OCFS2-based distributed storage system I have been mumbling about for some time. In fact the dd in ddsnap and ddraid stands for distributed data. The ddsnap/raid devices do not include an actual network transport, that is expected to be provided by a specialized block device, which up till now has been NBD. But NBD has various deficiencies as you note, in addition to its tendency to deadlock when accessed locally. Your new code base may be just the thing we always wanted. We (zumastor et al) will take it for a drive and see if anything breaks. That would be great. Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. I see bits that worry me, e.g.: + req = mempool_alloc(st-w-req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. mempool should be allocated to be able to catch up with maximum in-flight requests, in my tests I was unable to force block layer to put more than 31 pages in sync, but in one bio. Each request is essentially dealyed bio processing, so this must handle maximum number of in-flight bios (if they do not cover multiple nodes, if they do, then each node requires own request). Sync has one bio in-flight on my machines (from tiny VIA nodes to low-end amd64), number of normal requests *usually* does not increase several dozens (less than hundred always), but that might be only my small systems, so request size was selected as small as possible and number of allocations decreased to absolutely healthcare minimum. Anyway, if this is as grand as it seems then I would think we ought to factor out a common transfer core that can be used by all of NBD, iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own code those things have now. Regards, Daniel Thanks. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote: On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote: For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure in Peter's patches without any meaning for per-socket reservation. This is not true. I have a global reserve which is set-up a priori. You cannot allocate a reserve when under pressure, that does not make sense. I probably did not cut enough details - my main position is to allocate per socket reserve from socket's queue, and copy data there from main reserve, all of which are allocated either in advance (global one) or per sockoption, so that there would be no fairness issues what to mark as special and what to not. Say we have a page per socket, each socket can assign a reserve for itself from own memory, this accounts both tx and rx side. Tx is not interesting, it is simple, rx has global reserve (always allocated on startup or sometime way before reclaim/oom)where data is originally received (including skb, shared info and whatever is needed, page is just an exmaple), then it is copied into per-socket reserve and reused for the next packet. Having per-socket reserve allows to have progress in any situation not only in cases where single action must be received/processed, and allows to be completely fair for all users, but not only special sockets, thus admin for example would be allowed to login, ipsec would work and so on... Ah, I think I understand now. Yes this is indeed a good idea! It would be quite doable to implement this on top of that I already have. We would need to extend the socket with a sock_opt that would reserve a specified amount of data for that specific socket. And then on socket demux check if the socket has a non zero reserve and has not yet exceeded said reserve. If so, process the packet. This would also quite neatly work for -rt where we would not want incomming packet processing to be delayed by memory allocations. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote: ...rx has global reserve (always allocated on startup or sometime way before reclaim/oom)where data is originally received (including skb, shared info and whatever is needed, page is just an exmaple), then it is copied into per-socket reserve and reused for the next packet. Having per-socket reserve allows to have progress in any situation not only in cases where single action must be received/processed, and allows to be completely fair for all users, but not only special sockets, thus admin for example would be allowed to login, ipsec would work and so on... And when the global reserve is entirely used up your system goes back to dropping vm writeout acknowledgements, not so good. I like your approach, and specifically the copying idea cuts out considerable complexity. But I believe the per-socket flag to mark a socket as part of the vm writeout path is not optional, and in this case it will be a better world if it is a slightly unfair world in favor of vm writeout traffic. Ssh will still work fine even with vm getting priority access to the pool. During memory crunches, non-vm ssh traffic may get bumped till after the crunch, but vm writeout is never supposed to hog the whole machine. If vm writeout hogs your machine long enough to delay an ssh login then that is a vm bug and should be fixed at that level. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 07:53, Peter Zijlstra wrote: On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote: On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote: ...my main position is to allocate per socket reserve from socket's queue, and copy data there from main reserve, all of which are allocated either in advance (global one) or per sockoption, so that there would be no fairness issues what to mark as special and what to not. Say we have a page per socket, each socket can assign a reserve for itself from own memory, this accounts both tx and rx side. Tx is not interesting, it is simple, rx has global reserve (always allocated on startup or sometime way before reclaim/oom)where data is originally received (including skb, shared info and whatever is needed, page is just an exmaple), then it is copied into per-socket reserve and reused for the next packet. Having per-socket reserve allows to have progress in any situation not only in cases where single action must be received/processed, and allows to be completely fair for all users, but not only special sockets, thus admin for example would be allowed to login, ipsec would work and so on... Ah, I think I understand now. Yes this is indeed a good idea! It would be quite doable to implement this on top of that I already have. We would need to extend the socket with a sock_opt that would reserve a specified amount of data for that specific socket. And then on socket demux check if the socket has a non zero reserve and has not yet exceeded said reserve. If so, process the packet. This would also quite neatly work for -rt where we would not want incomming packet processing to be delayed by memory allocations. At this point we need anything that works in mainline as a starting point. By erring on the side of simplicity we can make this understandable for folks who haven't spent the last two years wallowing in it. The page per socket approach is about as simple as it gets. I therefore propose we save our premature optimizations for later. It will also help our cause if we keep any new internal APIs to strictly what is needed to make deadlock go away. Not a whole lot more than just the flag to mark a socket as part of the vm writeout path when you get right down to essentials. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html