*music* Don't tell anyone I'm free... *music* Don't tell anyone I'm free...
Commentary: Hello, and welcome to BSDTalk number 98. It's Wednesday, February 7, 2007. I just have an interview with Matthew Dillon today, so we'll go right to it. Host: Today on BSDTalk, we're speaking to Matthew Dillon. Welcome to the show. Matt: Thank you. Host: So, recently there has been a release, so maybe you can start by giving us a basic overview of what's new in DragonFlyBSD. Matt: The biggest user-visible item in this release is the virtual kernel support, which is basically similar to UML (User Mode Linux), but DragonFly on DragonFly, basically running a DragonFly kernel as a single process on a DragonFly system. My primary reason for doing this first is that it became very obvious that the engineering cycle time just to test and make changes was going way too slowly for kernel work, and for the features that we really want to get in the system, the ultimate goals of DragonFly ---the clustering, the new filesystem--- we really needed that ability with the virtual kernel to have an instant run, instant test cycle. Host: So for people who aren't familiar with the different levels of virtualization, where do these virtual kernels fit in between BSD jails versus Xen or QEMU? Matt: You have a jail-like environment, you have a Xen-like environment, and you have a QEMU or VMWare-like environment, and they're different levels of virtualization. The jail doesn't try to virtualize anything really, it just restricts what the programs have access to, running on the same system basically. It's not really virtualizing a kernel, it's just making it look that way, kind of faking it. It still has many uses, because it's very high performance, because it's not virtualizing anything. In a Xen-like environment you're running a kernel that's aware of the Xen environment, so it's not 100% hardware virtualization. It's aware that it's in a Xen environment and it makes allowances for that by doing special calls when it needs to do something with the VM system, for example, or with drivers. A VMWare or QEMU virtualization is a full hardware virtualization where you run the same kernel and system that you would run on bare harware but you would run it on QEMU or VMWare, and it would just run---or at least, that's the idea. Where we come in is closer to Xen than anything else. The DragonFly virtual kernel is very aware that it is running as a virtual kernel. It only runs a real DragonFly kernel, so it's not like Xen where we're trying to allow or support multiple guest operating systems. What we're really doing is supporting DragonFly kernel running under DragonFly kernel, and we're doing that by giving the real kernel (and the virtual kernel) additional system calls that allow the manipulation of VM spaces, the interception of system calls, and that sort of thing. So the DragonFly virtual kernel isn't actually doing anything special other than using these new system calls to manage its own memory space and to manage the virtualized user processes that are running under it. The advantage of this is that the real kernel has very low resource overhead. The real kernel is only managing page tables for the virtualized processes being run by the virtual kernel, and on nearly all BSD systems (certianly FreeBSD and DragonFly), page tables are all throw-away, which means that the real kernel can throw them away at any time and take extra faults to regenerate them later on. So the resource overhead is very low except for the memory issue, which is the amount of memory that you tell the virtual kernel it has. Host: What about filesystem space? Matt: Well, the virtual kernel is basically a kernel, so right now you give it a root image, basically a virtualized hard drive, that's basically just a 50GB file you give to the virtual kernel and that's what it uses as a block device for its filesystem. We've given it a network interface using the tap device, so the virtual kernel can communicate with the outside world via the network interface, which means that of course it can use NFS and any other networking protocol to also access filesystems. Eventually we're going to have as part of the clustering work a protocol called SYSLINK, which will be used to link the clustered hosts together, and that will also have the capability of doing a filesystem transport, a device transport, and transport for other resources such as CPU contexts and VM contexts and that sort of thing. Host: Any why develop your own virtual kernel technology instead of using existing ones, perhaps like Xen or QEMU? Matt: Well, the problem with QEMU and VMWare is that they aren't really good for testing kernels. They're good for running systems. In many cases they can be higher performing than a virtual kernel, although theoretically as things mature I think the virtual kernel technology --that is, the Xen-like technology-- will be able to exceed that, because they will be aware of the context. But the issue with VMWare and QEMU is that since they're emulating a hardware environment, if a kernel takes 5 minutes to boot on real hardware, well, it takes at least 5 minutes to boot on QEMU or VMWare, or slower, depending on what devices you've got. But certainly, it will take longer to boot, even if you can get it down to 30 seconds. It takes longer to boot than the 5 seconds it takes for a virtual kernel to boot. Another big difference is that since the virtual kernel is simply a process running under the real kernel, you can GDB it live on the real kernel, so it really makes debugging a whole lot easier when the virtual kernel is sitting there as a process that you can manipulate. Host: And what other features are coming for 1.8? Or I should say, not coming, it's already here; what features are in 1.8? Matt: Well, I can go through the diary... there are a ton of applciation updates, a ton of bug fixes, a ton of driver updates, especially the wireless drivers and network drivers. One of our developers really had a focus on those. There's a lot of under-the-hood work to continue to get the DragonFly kernel staged and ready for the upcoming clustering work. For example, I'm slowly integrating the cache coherency management into the kernel. It's actually partially in there now in 1.8, it's just not doing anything yet. We revamped the way the namespace topology works a little. Got rid of aliases that we had there before to handle NULLFS mounts. NULLFS is basically the ability to take an existing portion of the filesystem and mount it somewhere else. Kind of like a hardlink directory, but it looks like a separate mount to the system, so you can construct jailed environments, chrooted environments with portions of your real filesystem mounted read-only to avoid having to duplicate data. We did a major rewrite of NULLFS to make that very very efficient. A lot of cleanups in the filesystem APIs to get that ready for the SYSLINK protocol, the ability to run filesystems over communications links, and basically a lot of under-the-hood work. We brought in GCC 4.1. The way the DragonFly compiler environment works is we can actually have multiple compilers installed on the system, and switch between them with a simple environment variable. So, our base compiler is still 3.4, but we brought 4.1 in and there are people simply by setting the environment variable who are getting the tree buildable under that kernel [editor's note: compiler?]. We'll probably go live with it when 4.2 comes out; we're probably not going to go live with GCC 4.1. Host: Are there other features that didn't quite make it into 1.8 that you're looking forward to putting to 2.0? Matt: We did port the FreeBSD newATA IDE driver, and we're still working on stabilizing that; it's in there, but it's really just for debugging. Hopefully that will go live some time during this semester and be there in 2.0 as default. Host: One of the things that you did to with DragonFlyBSD was to start with FreeBSD and take a diferent path. Have you found it more difficult over time to import work from the other BSDs as you've headed down your own road? Matt: For the work that makes sense to import, it's been fairly easy. The real issue, well there are two issues. One is bringing in application support, and as you may know, we went with pkgsrc for that, and that's pretty painless compared to trying to continuously keep the FreeBSD ports up-to-date. The second issue is, of course, hardware drivers. For wholesale replacements, for example, bringing in nATA, it is a considerable amount of work now, because the DragonFly driver infrastructure has really diverged significantly from FreeBSD. But for bug fixes and incremental fixes, things like that, it's pretty painless. You just take a diff of the FreeBSD tree, see what they fixed, and you fix it in the DragonFly tree or vice-versa. Host: Some people seem to have this fixation with major-number bumps, such as the 1-series to the 2-series. Are you just doing a continuation of numbers, or is there anything in particular... Matt: Well, I don't want to go to 1.10. I think that's a little confusing, and it doesn't sort well---1.10 is less than 1.8 if you string sort it. Will it be a 2.0 as in boldface 2.0? Probably not, but we're in the second half of the project now. We're at the point now where, sure there's still a lot of SMP work to go, but the infrastructure is really ready to go for the second part of the project, and the second part of the project is really the clustering, the system link, and the new filesystem. 2.0 will not have those in production, but 2.1 certainly will, or at least it will certainly have the new filesystem in production. I don't really think that you can have new features in production in a dot-zero anyway, so I think 2.0 is fairly well-positioned, but 2.1 will be the real deal. Host: And if people read your blog on the DragonFly website, they will see that every once in a while there is some talk about ZFS or other filesystems that you're considering for your clustered filesystem. Can you talk a little more about that? Matt: Yeah, I've been thinking a lot about that. There are a lot of issues. One of the issues with ZFS is that it would probably take as much time to port it as it would to build a new filesystem from scratch using similar principles. Another issue with ZFS is that it's not really a cluster filesystem; it's really a mass-storage filesystem. You have hundreds of terabytes and you don't want to lose any of it---that's ZFS. What we really need for DragonFly is a cluster filesystem capable of supporting hundreds of terabytes without losing it. ZFS doesn't quite fit the bill, but many of the features of ZFS are features that I really like, like the hashing of the data and having hash checks on the data and that kind of thing---the snapshotting. To give you an example of what we need versus what ZFS has, you have a snapshot capability in ZFS, but what we really need in DragonFly is more like a database transaction where you have an infinite number of snapshots where each snapshot is basically every time the filesystem syncs, and you can at any point mount a snapshot as of any point in the past that's still present that you haven't deleted the old data for. The reason we need that in a clustered system is that in a clustered system, you need to be able to do cache coherency management, and you need to be able to do it without transferring unnecessary amounts of data, and so if every bit of data has, say, a transaction id, it has some kind of relative timestamp where you can say, "OK here's my cached data, here's a timestamp with the cached data, and here's this other program that I need to be syncronized with, and it has an idea of its synchronization point. If those timestamps match, then my cache is good; if they don't match then I need to go through some kind of synchronization mechanism to get the most up-to-date data." In a cluster system, where you have execution contexts all over the place potentially, you can't afford to just copy the data back and forth willy-nilly. You really have to copy only the data the needs to be copied, and only when it needs to be updated, really. Host: Are there any candidate cluster filesystems that are BSD-licensed or have a licence that would be appropriate for DragonFly? Matt: Not that I know of, but I'm not the type of person that really goes through and looks at the research papers very deeply. I'm the kind of person who just likes to invent---you know, sit down, say "OK here are the basic ideas of what filesystem A, B, and C do that I like; I'm going to pull those ideas in and implement it from scratch." That's kind of what I am. Host: So, hopefully in the next couple releases we'll be able to create a virtual cluster on a single piece of hardware to test out all this fun stuff. Matt: That's one of the main reasons why that virtual kernel support is in 1.8. It's precisely what we'll be able to do. It's really the only way we'll be able to test the clustering code and the cache coherency management, is to really have these virtualized systems that as far as they're concerned are running on separate boxes with communications links. Otherwise, there's just no way we would test it and have an engineering cycle time that would get us to our goal in the next two years. Host: So is the project looking for any specific developers with skillsets or any particular hardware that people could help out with? Matt: We're always looking for people that are interested in various aspects of the system. It doesn't necessarily have to be the clustering code; it doesn't necessarily have to be the filesystem code. There's still a lot of work that needs to be done with SMP, for example, the network stack. There are a lot of loose ends in the codebase which work fine now but are still under the big giant lock and could with some attention be taken out of that big giant lock and really made to operate in parallel on an SMP system. It just hasn't been a priority for me because the main project goal is the clustering. Host: Are you working on any other projects that are BSD-related, but aren't DragonFlyBSD these days? Matt: No, not right now. DragonFly is the main project. I have this list of projects that I want to do in my lifetime, and I'm never going to be able to do all of them. Right now my focus is DragonFly, probably for the next two years, and at the end of two years, my expectation is that we're going to have this clustered system with a very redundant filesystem that's fairly high-performance. Host: Some people are never satisfied with running a released system. They really want to run bleeding edge. If people want to start working on and testing 1.9, how can they get a hold of that? Matt: You just go to our site, we have the release branch and the HEAD branch, and people can choose to run what they want; of course, we suggest the release branch for anyone running any production UNIX stuff. The HEAD branch we've managed to keep stable throughout most of the development cycles that we've done throughout 1.2, 1.4, 1.6, 1.8 release. The head branch during the development of those releases has remained fairly stable, and I expect that to continue. It's always good to have a fairly stable head. But occasionally, something on the HEAD branch will break, and it will be noticed and fixed or reported or we'll just have to say "this is going to be broken for a week" or something like that. You just have to be aware of that when you're working on the HEAD of the source development tree. Host: Is it possible to install 1.8 and then run the HEAD branch on a virtual kernel? Matt: Yeah, in fact, you can. They should remain compatible all the way through 2.0. The only issue there is if we make changes to the API. Since the only thing using the new system calls is the virtual kernel and not any third party apps, we still feel free to change that API, and if we do change that API, it means that a person running release has to update the release to the latest release to get the changes, and then update HEAD to the latest HEAD to be able to run the virtual kernel. But, so far there haven't been any changes to the API that require that. Host: Are the any other topics that you want to talk about today? Matt: Actually, I would like to talk about the virtual kernel a little more, because it turned out to be very easy to implment, and I think that it's something that the other BSD projects should look at very seriously, not necessarily depend on VMWare or QEMU for their virtualization, or at least not depend on it entirely. To implement the virtual kernel support, I think I only had to do three things, and it's very significant work, but it's still three things and it's fairly generic. The first was I implmenented signal mailboxes, so you tell a signal to basically write data into a mailbox instead of generating an exception in a signal stack and all that. The second is, I added a new feature to memory map --the mmap() call-- that allows you to memory map a virtual page table, and I added another system call to specify where in the backing of that map the page directory was. This is basically virtual page table support. And the third is VM space management: the ability to create, manage, and execute code in VM spaces that are completely under your control. So this is in lieu of forking processes to handle the "virtualized user processes." Instead, the kernel has the ability to manipulate VM spaces and it basically, at least with the current virtual kernel, switches the VM space in, runs it, and when you have to return to the virtual kernel, it switches the VM space back out, and everything runs under a single process. And it turned out to be fairly easy to implemnet these things, fairly straightforward, and they're generic mechanisms. Especially the virtual page table memory map mechanism. It's completely generic. And the signal mailbox mechanism---completely generic. And so I think it's well worth the other BSDs looking at those mechanisms, looking at those implementations, and perhaps bringing that code into their systems, because I think it's very, very useful. Host: Do these changes provide any benefit for the general operating system kernel, or is it only in the context of virtual kernels? Matt: I think they do provide a benefit to general applications. A lot of applications, especially with signals---you can't really do much in the signal handler because you're interrupting code and you really don't know what the context is, and so if you look at applications such as the shell code, or really any program that needs to handle a terminal --a terminal window size change or something like that-- usually what it does is it specifies a signal handler, and all the signal handler does is set a global variable and return, and that's it. What you really want to do is use a signal mailbox there rather than a signal handler that does the same thing. So it contracts that, and makes it very easy. Any application that needs to manage a VM space, and this includes user-level clustering applications. DragonFly's goal is a kernel-level clustering, but there are already existing many user-level clustering systems, and being able to manage a VM space with this page table memory map mechanism makes that a whole lot easier. So if that feature became generally used in the BSDs and ported to Linux, and generally used in the UNIX world, I can see a lot of advantage to third-party application development there. Host: Alright, well thank you very much for speaking with me today. Matt: Yeah, thank you very much. Commentary: If you'd like to reach comments on the web site, or reach the show archives, you can find them at http://bsdtalk.blogspot.com/ or if you'd like to send me an email, you can reach me at bitgeist at yahoo dot com. Thank you for listening; this has been BSDTalk number 98.
