Re: The future of NetBSD by Charles M. Hannum
Jonathon McKitrick wrote: I'm starting to imagine the size of the Lisp image I could run on a cluster like the kind being discussed ;-) Jonathon McKitrick -- My other computer is your Windows box. Go and wath out your mouth with thoap! ;-) Bill
Re: The future of NetBSD by Charles M. Hannum
Matthew Dillon wrote: :On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: :: that 75% of the interest in our project has nothing to do with my :: project goals but instead are directly associated with work being done :: by our relatively small community. I truely appreciate that effort :: because it allows me to focus on the part that is most near and dear :: to my own heart. : :Big question: after all the work that will go into the clustering, other than :scientific research, what will the average user be able to use such advanced :capability for? : :Jonathon McKitrick I held off answering because I became quite interested in what others thought the clustering would be used for. Lets take a big, big step back and look at what the clustering means from a practical standpoint. There are really two situations involved here. First, we certainly can allow you to say 'hey, I am going to take down machine A for maintainance', giving the kernel the time to migrate all resources off of machine A. But being able to flip the power switch on machine A without warning, or otherwise have a machine fail unexpectedly, is another ball of wax entirely. There are only a few ways to cope with such an event: (1) Processes with inaccessible data are killed. High level programs such as 'make' would have to be made aware of this possibility, process the correct error code, and restart the killed children (e.g. compiles and such). In this scenario, only a few programs would have to be made aware of this type of failure in order to reap large benefits from a big cluster, such as the ability to do massively parallel compiles or graphics or other restartable things. (2) You take a snapshot every once in a while and if a process fails on one machine you recover an earlier version of it on another (including rolling back any file modifications that were made). (3) You run the cpu context in tandem on multiple machines so if one machine fails another can take over without a break. This is really an extension of the rollback mechanism, but with additional requirements and it is particularly difficult to accomplish with a threaded program where there may be direct memory interactions between threads. Tandem operation is possible with non-threaded programs but all I/O interactions would have to be synchronization points (and thus performance would suffer). Threaded programs would have to be aware of the tandem operation, or else we make writing to memory a synchronization point too (and even then I am not convinced it is possible to keep two wholely duplicate copies of the program operating in tandem). Needless to say, a fully redundant system is very, very complex. My 2-year goal is NOT to achieve #3. It is to achieve #1 and also have the ability to say 'hey, I'm taking machine BLAH down for maintainance, migrate all the running contexts and related resources off of it please'. Achieving #2 or #3 in a fully transparent fashion is more like a 5-year project, and you would take a very large performance hit in order to achieve it. But lets consider #1... consider the things you actually might want to accomplish with a cluster. Large simulations, huge builds, or simply providing resources to other projects that want to do large simulations or huge builds. Only a few programs like 'make' or the window manager have to actually be aware of the failure case in order to be able to restart the killed programs and make a cluster useful to a very large class of work product. Even programs like sendmail and other services can operate fairly well in such an environment. So what can the average user do ? * The average user can support a third party project by providing cpu, memory, and storage resources to that project. (clearly there are security issues involved, but even so there is a large class of problems that can be addressed). * The average user wants to leverage the cpu and memory resources of all his networked machines for things like builds (buildworld, pkg builds, etc)... batch operations which can be restarted if a failure occurs. So, consider, the average user has his desktop, and most processes are running locally, but he also has other machines and they tie into a named cluster based on the desktop. The cluster would 'see' the desktop's filesystems but otherwise operate as a separate system. The average user would then be able to login to the 'cluster' and run things that then take advantage of all the machine's resources. * The average user might be part of a large project that has access to a cluster.
Re: The future of NetBSD by Charles M. Hannum
On Sat, Sep 02, 2006 at 05:54:14PM +0800, Bill Hacker wrote: : Jonathon McKitrick wrote: : : I'm starting to imagine the size of the Lisp image I could run on a cluster : like the kind being discussed ;-) : : Jonathon McKitrick : -- : My other computer is your Windows box. : : Go and wath out your mouth with thoap! Sorry, but I'm never coming back after discovering Lisp. ;-P Jonathon McKitrick -- My other computer is your Windows box.
Re: The future of NetBSD by Charles M. Hannum
Jonathon McKitrick wrote: On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: : that 75% of the interest in our project has nothing to do with my : project goals but instead are directly associated with work being done : by our relatively small community. I truely appreciate that effort : because it allows me to focus on the part that is most near and dear : to my own heart. Big question: after all the work that will go into the clustering, other than scientific research, what will the average user be able to use such advanced capability for? Jonathon McKitrick -- My other computer is your Windows box. Well, I for one would be thrilled if my high-availability is just as simple as putting another box on my cluster somewhere else on the internet, backups get easy with snapshots like implemented in ZFS. It's a real peace of mind to know that a box is expected to fail at some point and I know that I don't need to figure out what went wrong, I just tell that administrator across the ocean to put another box in that cluster and remove the bad one when he has the time to do it. No more 24h administration, no more emergency calls because of bad hardware, I finally can do the more important stuff in my job, like drinking coffee, socializing with that cute secretary, recreating solutions that is just perfect for a problem I'll never get. Or porting that platform independent python program that some brain dead has found a way to make it exclusive linux. my € 0,02 -- mph
Re: The future of NetBSD by Charles M. Hannum
:On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: :: that 75% of the interest in our project has nothing to do with my :: project goals but instead are directly associated with work being done :: by our relatively small community. I truely appreciate that effort :: because it allows me to focus on the part that is most near and dear :: to my own heart. : :Big question: after all the work that will go into the clustering, other than :scientific research, what will the average user be able to use such advanced :capability for? : :Jonathon McKitrick I held off answering because I became quite interested in what others thought the clustering would be used for. Lets take a big, big step back and look at what the clustering means from a practical standpoint. There are really two situations involved here. First, we certainly can allow you to say 'hey, I am going to take down machine A for maintainance', giving the kernel the time to migrate all resources off of machine A. But being able to flip the power switch on machine A without warning, or otherwise have a machine fail unexpectedly, is another ball of wax entirely. There are only a few ways to cope with such an event: (1) Processes with inaccessible data are killed. High level programs such as 'make' would have to be made aware of this possibility, process the correct error code, and restart the killed children (e.g. compiles and such). In this scenario, only a few programs would have to be made aware of this type of failure in order to reap large benefits from a big cluster, such as the ability to do massively parallel compiles or graphics or other restartable things. (2) You take a snapshot every once in a while and if a process fails on one machine you recover an earlier version of it on another (including rolling back any file modifications that were made). (3) You run the cpu context in tandem on multiple machines so if one machine fails another can take over without a break. This is really an extension of the rollback mechanism, but with additional requirements and it is particularly difficult to accomplish with a threaded program where there may be direct memory interactions between threads. Tandem operation is possible with non-threaded programs but all I/O interactions would have to be synchronization points (and thus performance would suffer). Threaded programs would have to be aware of the tandem operation, or else we make writing to memory a synchronization point too (and even then I am not convinced it is possible to keep two wholely duplicate copies of the program operating in tandem). Needless to say, a fully redundant system is very, very complex. My 2-year goal is NOT to achieve #3. It is to achieve #1 and also have the ability to say 'hey, I'm taking machine BLAH down for maintainance, migrate all the running contexts and related resources off of it please'. Achieving #2 or #3 in a fully transparent fashion is more like a 5-year project, and you would take a very large performance hit in order to achieve it. But lets consider #1... consider the things you actually might want to accomplish with a cluster. Large simulations, huge builds, or simply providing resources to other projects that want to do large simulations or huge builds. Only a few programs like 'make' or the window manager have to actually be aware of the failure case in order to be able to restart the killed programs and make a cluster useful to a very large class of work product. Even programs like sendmail and other services can operate fairly well in such an environment. So what can the average user do ? * The average user can support a third party project by providing cpu, memory, and storage resources to that project. (clearly there are security issues involved, but even so there is a large class of problems that can be addressed). * The average user wants to leverage the cpu and memory resources of all his networked machines for things like builds (buildworld, pkg builds, etc)... batch operations which can be restarted if a failure occurs. So, consider, the average user has his desktop, and most processes are running locally, but he also has other machines and they tie into a named cluster based on the desktop. The cluster would 'see' the desktop's filesystems but otherwise operate as a separate system. The average user would then be able to login to the 'cluster' and run things that then take advantage of all the machine's resources. * The average user might be part of a large project that has access to a cluster.
Re: The future of NetBSD by Charles M. Hannum
On Fri, September 1, 2006 12:45 pm, Matthew Dillon wrote: So what can the average user do ? * The average user can support a third party project by providing cpu, memory, and storage resources to that project. (clearly there are security issues involved, but even so there is a large class of problems that can be addressed). It would be neat, in terms of both speed and community, if we could have binary builds of pkgsrc for DragonFly accomplished by *everyone*.
Re: The future of NetBSD by Charles M. Hannum
I'm starting to imagine the size of the Lisp image I could run on a cluster like the kind being discussed ;-) Jonathon McKitrick -- My other computer is your Windows box.
Re: The future of NetBSD by Charles M. Hannum
On Fri, 1 Sep 2006 09:45:32 -0700 (PDT) Matthew Dillon [EMAIL PROTECTED] wrote: :On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: :: that 75% of the interest in our project has nothing to do with my :: project goals but instead are directly associated with work being done :: by our relatively small community. I truely appreciate that effort :: because it allows me to focus on the part that is most near and dear :: to my own heart. : :Big question: after all the work that will go into the clustering, other than :scientific research, what will the average user be able to use such advanced :capability for? : :Jonathon McKitrick I held off answering because I became quite interested in what others thought the clustering would be used for. Lets take a big, big step back and look at what the clustering means from a practical standpoint. There are really two situations involved here. First, we certainly can allow you to say 'hey, I am going to take down machine A for maintainance', giving the kernel the time to migrate all resources off of machine A. But being able to flip the power switch on machine A without warning, or otherwise have a machine fail unexpectedly, is another ball of wax entirely. There are only a few ways to cope with such an event: (1) Processes with inaccessible data are killed. High level programs such as 'make' would have to be made aware of this possibility, process the correct error code, and restart the killed children (e.g. compiles and such). In this scenario, only a few programs would have to be made aware of this type of failure in order to reap large benefits from a big cluster, such as the ability to do massively parallel compiles or graphics or other restartable things. This is also quite good enough from my point of view, I think my post may have given the impression that I was expecting #3 to appear - I certainly was not, I know how hard that is. In fact #1 is more than I was hoping for, having the make fail and a few windows close but being able to reopen them and restart the make by hand would be orders of magnitude better than I can achieve now with periodic rsync and a fair amount of fiddling around to get environments running on a backup machine when I have a hardware failure. -- C:WIN | Directable Mirror Arrays The computer obeys and wins.| A better way to focus the sun You lose and Bill collects. |licences available see |http://www.sohara.org/
Re: The future of NetBSD by Charles M. Hannum
:Hello, : :I found this message on the NetBSD mailing list and it :can be quite interesting for reading. It says about :negative stuff in the NetBSD project and manners for :fixing the problems of the project. : :I hope it can be useful for read to others, for me is :quite interesting. He mentions DragonFly, so I think :is worth mentioning it here because of that ;) : :http://mail-index.netbsd.org/netbsd-users/2006/08/30/0016.html : :Regards, :timofonic It's very interesting and should serve as a caution both that no open-source project lasts forever, and no open-source project ever truely dies, either. What happens is that people move on, and others fill the gaps and, eventually, even if it winds up being 20 years later, the best pieces of the project morph into something else entirely. For my part, I have a very clear set of personal goals that I want to achieve with DragonFly, but regardless of my own goals the concept of 'getting behind' in various areas is one that we, facing similarly low numbers of developers, have to deal with every day. In many respects, the interest in the DragonFly project is very heavily supported by the work that everyone is doing to keep the project up to date as it is by my lofty clustering goals. In fact, I would say that 75% of the interest in our project has nothing to do with my project goals but instead are directly associated with work being done by our relatively small community. I truely appreciate that effort because it allows me to focus on the part that is most near and dear to my own heart. -Matt Matthew Dillon [EMAIL PROTECTED]
Re: The future of NetBSD by Charles M. Hannum
On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: : that 75% of the interest in our project has nothing to do with my : project goals but instead are directly associated with work being done : by our relatively small community. I truely appreciate that effort : because it allows me to focus on the part that is most near and dear : to my own heart. Big question: after all the work that will go into the clustering, other than scientific research, what will the average user be able to use such advanced capability for? Jonathon McKitrick -- My other computer is your Windows box.
Re: The future of NetBSD by Charles M. Hannum
Jonathon McKitrick wrote: On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote: : that 75% of the interest in our project has nothing to do with my : project goals but instead are directly associated with work being done : by our relatively small community. I truely appreciate that effort : because it allows me to focus on the part that is most near and dear : to my own heart. Big question: after all the work that will go into the clustering, other than scientific research, what will the average user be able to use such advanced capability for? Heh. I'm probably not an average user (I'm just an amateur OS geek), but the day I can compile OpenOffice in under an hour on my cluster of five DragonFly PC's will be the day I die and go straight to Heaven. Yes, I once hoped to achieve World Peace, but I'm older now and my goals have become slightly less ambitious.
Re: The future of NetBSD by Charles M. Hannum
On Thu, August 31, 2006 3:42 pm, Jonathon McKitrick wrote: Big question: after all the work that will go into the clustering, other than scientific research, what will the average user be able to use such advanced capability for? Lots. To get to a single system image, the operating system has to be made less obfusticated. A cleaner system means less pain for the developers and less bugs due to obscurity. Making the system multiprocessor safe generally improves uniprocessor speed. Even so, we are quickly reaching the point where all processors are dual-core. New ideas, like Kip Macy's checkpointing work, can be tried out without judgement on anything but its technical merits. While Matt concentrates on his work, other people are free to add. For instance, we haven't removed any commit bits because of disagreements in project direction. :) Keep in mind that as Matt pointed out, a lot of what makes DragonFly interesting is the other work being done. Clustering is one part of many.