BSDTalk #98 transcription

Geoff Speicher Mon, 12 Feb 2007 19:08:52 -0800

*music* Don't tell anyone I'm free...
*music* Don't tell anyone I'm free...


Commentary: Hello, and welcome to BSDTalk number 98.  It's Wednesday,
February 7, 2007.  I just have an interview with Matthew Dillon
today, so we'll go right to it.

Host: Today on BSDTalk, we're speaking to Matthew Dillon.  Welcome
to the show.

Matt: Thank you.

Host: So, recently there has been a release, so maybe you can start
by giving us a basic overview of what's new in DragonFlyBSD.

Matt: The biggest user-visible item in this release is the virtual
kernel support, which is basically similar to UML (User Mode Linux),
but DragonFly on DragonFly, basically running a DragonFly kernel
as a single process on a DragonFly system.  My primary reason for
doing this first is that it became very obvious that the engineering
cycle time just to test and make changes was going way too slowly
for kernel work, and for the features that we really want to get
in the system, the ultimate goals of DragonFly ---the clustering,
the new filesystem--- we really needed that ability with the virtual
kernel to have an instant run, instant test cycle.

Host: So for people who aren't familiar with the different levels
of virtualization, where do these virtual kernels fit in between
BSD jails versus Xen or QEMU?

Matt: You have a jail-like environment, you have a Xen-like
environment, and you have a QEMU or VMWare-like environment, and
they're different levels of virtualization.

The jail doesn't try to virtualize anything really, it just restricts
what the programs have access to, running on the same system
basically.  It's not really virtualizing a kernel, it's just making
it look that way, kind of faking it.  It still has many uses, because
it's very high performance, because it's not virtualizing anything.

In a Xen-like environment you're running a kernel that's aware of
the Xen environment, so it's not 100% hardware virtualization.  It's
aware that it's in a Xen environment and it makes allowances for
that by doing special calls when it needs to do something with the
VM system, for example, or with drivers.

A VMWare or QEMU virtualization is a full hardware virtualization
where you run the same kernel and system that you would run on bare
harware but you would run it on QEMU or VMWare, and it would just
run---or at least, that's the idea.

Where we come in is closer to Xen than anything else.  The DragonFly
virtual kernel is very aware that it is running as a virtual kernel.
It only runs a real DragonFly kernel, so it's not like Xen where
we're trying to allow or support multiple guest operating systems.
What we're really doing is supporting DragonFly kernel running under
DragonFly kernel, and we're doing that by giving the real kernel
(and the virtual kernel) additional system calls that allow the
manipulation of VM spaces, the interception of system calls, and
that sort of thing.  So the DragonFly virtual kernel isn't actually
doing anything special other than using these new system calls to
manage its own memory space and to manage the virtualized user
processes that are running under it.  The advantage of this is that
the real kernel has very low resource overhead.  The real kernel
is only managing page tables for the virtualized processes being
run by the virtual kernel, and on nearly all BSD systems (certianly
FreeBSD and DragonFly), page tables are all throw-away, which means
that the real kernel can throw them away at any time and take extra
faults to regenerate them later on.  So the resource overhead is
very low except for the memory issue, which is the amount of memory
that you tell the virtual kernel it has.

Host: What about filesystem space?

Matt: Well, the virtual kernel is basically a kernel, so right now
you give it a root image, basically a virtualized hard drive, that's
basically just a 50GB file you give to the virtual kernel and that's
what it uses as a block device for its filesystem.  We've given it
a network interface using the tap device, so the virtual kernel can
communicate with the outside world via the network interface, which
means that of course it can use NFS and any other networking protocol
to also access filesystems.  Eventually we're going to have as part
of the clustering work a protocol called SYSLINK, which will be
used to link the clustered hosts together, and that will also have
the capability of doing a filesystem transport, a device transport,
and transport for other resources such as CPU contexts and VM
contexts and that sort of thing.

Host: Any why develop your own virtual kernel technology instead
of using existing ones, perhaps like Xen or QEMU?

Matt: Well, the problem with QEMU and VMWare is that they aren't
really good for testing kernels.  They're good for running systems.
In many cases they can be higher performing than a virtual kernel,
although theoretically as things mature I think the virtual kernel
technology --that is, the Xen-like technology-- will be able to
exceed that, because they will be aware of the context.  But the
issue with VMWare and QEMU is that since they're emulating a hardware
environment, if a kernel takes 5 minutes to boot on real hardware,
well, it takes at least 5 minutes to boot on QEMU or VMWare, or
slower, depending on what devices you've got.  But certainly, it
will take longer to boot, even if you can get it down to 30 seconds.
It takes longer to boot than the 5 seconds it takes for a virtual
kernel to boot.  Another big difference is that since the virtual
kernel is simply a process running under the real kernel, you can
GDB it live on the real kernel, so it really makes debugging a whole
lot easier when the virtual kernel is sitting there as a process
that you can manipulate.

Host: And what other features are coming for 1.8?  Or I should say,
not coming, it's already here; what features are in 1.8?

Matt: Well, I can go through the diary... there are a ton of
applciation updates, a ton of bug fixes, a ton of driver updates,
especially the wireless drivers and network drivers.  One of our
developers really had a focus on those.  There's a lot of under-the-hood
work to continue to get the DragonFly kernel staged and ready for
the upcoming clustering work.  For example, I'm slowly integrating
the cache coherency management into the kernel.  It's actually
partially in there now in 1.8, it's just not doing anything yet.

We revamped the way the namespace topology works a little.  Got rid
of aliases that we had there before to handle NULLFS mounts.  NULLFS
is basically the ability to take an existing portion of the filesystem
and mount it somewhere else.  Kind of like a hardlink directory,
but it looks like a separate mount to the system, so you can construct
jailed environments, chrooted environments with portions of your
real filesystem mounted read-only to avoid having to duplicate data.
We did a major rewrite of NULLFS to make that very very efficient.

A lot of cleanups in the filesystem APIs to get that ready for the
SYSLINK protocol, the ability to run filesystems over communications
links, and basically a lot of under-the-hood work.  We brought in
GCC 4.1.  The way the DragonFly compiler environment works is we
can actually have multiple compilers installed on the system, and
switch between them with a simple environment variable.  So, our
base compiler is still 3.4, but we brought 4.1 in and there are
people simply by setting the environment variable who are getting
the tree buildable under that kernel [editor's note: compiler?].
We'll probably go live with it when 4.2 comes out; we're probably
not going to go live with GCC 4.1.

Host: Are there other features that didn't quite make it into 1.8
that you're looking forward to putting to 2.0?

Matt: We did port the FreeBSD newATA IDE driver, and we're still
working on stabilizing that; it's in there, but it's really just
for debugging.  Hopefully that will go live some time during this
semester and be there in 2.0 as default.

Host: One of the things that you did to with DragonFlyBSD was to
start with FreeBSD and take a diferent path.  Have you found it
more difficult over time to import work from the other BSDs as
you've headed down your own road?

Matt: For the work that makes sense to import, it's been fairly
easy.  The real issue, well there are two issues.  One is bringing
in application support, and as you may know, we went with pkgsrc
for that, and that's pretty painless compared to trying to continuously
keep the FreeBSD ports up-to-date.  The second issue is, of course,
hardware drivers.  For wholesale replacements, for example, bringing
in nATA, it is a considerable amount of work now, because the
DragonFly driver infrastructure has really diverged significantly
from FreeBSD.  But for bug fixes and incremental fixes, things like
that, it's pretty painless.  You just take a diff of the FreeBSD
tree, see what they fixed, and you fix it in the DragonFly tree or
vice-versa.

Host: Some people seem to have this fixation with major-number
bumps, such as the 1-series to the 2-series.  Are you just doing a
continuation of numbers, or is there anything in particular...

Matt: Well, I don't want to go to 1.10.  I think that's a little
confusing, and it doesn't sort well---1.10 is less than 1.8 if you
string sort it.  Will it be a 2.0 as in boldface 2.0?  Probably
not, but we're in the second half of the project now.  We're at the
point now where, sure there's still a lot of SMP work to go, but
the infrastructure is really ready to go for the second part of the
project, and the second part of the project is really the clustering,
the system link, and the new filesystem.  2.0 will not have those
in production, but 2.1 certainly will, or at least it will certainly
have the new filesystem in production.  I don't really think that
you can have new features in production in a dot-zero anyway, so I
think 2.0 is fairly well-positioned, but 2.1 will be the real deal.

Host: And if people read your blog on the DragonFly website,  they
will see that every once in a while there is some talk about ZFS
or other filesystems that you're considering for your clustered
filesystem.  Can you talk a little more about that?

Matt: Yeah, I've been thinking a lot about that.  There are a lot
of issues.  One of the issues with ZFS is that it would probably
take as much time to port it as it would to build a new filesystem
from scratch using similar principles.  Another issue with ZFS is
that it's not really a cluster filesystem; it's really a mass-storage
filesystem.  You have hundreds of terabytes and you don't want to
lose any of it---that's ZFS.

What we really need for DragonFly is a cluster filesystem capable
of supporting hundreds of terabytes without losing it.  ZFS doesn't
quite fit the bill, but many of the features of ZFS are features
that I really like, like the hashing of the data and having hash
checks on the data and that kind of thing---the snapshotting.

To give you an example of what we need versus what ZFS has, you
have a snapshot capability in ZFS, but what we really need in
DragonFly is more like a database transaction where you have an
infinite number of snapshots where each snapshot is basically every
time the filesystem syncs, and you can at any point mount a snapshot
as of any point in the past that's still present that you haven't
deleted the old data for.

The reason we need that in a clustered system is that in a clustered
system, you need to be able to do cache coherency management, and
you need to be able to do it without transferring unnecessary amounts
of data, and so if every bit of data has, say, a transaction id,
it has some kind of relative timestamp where you can say, "OK here's
my cached data, here's a timestamp with the cached data, and here's
this other program that I need to be syncronized with, and it has
an idea of its synchronization point.  If those timestamps match,
then my cache is good; if they don't match then I need to go through
some kind of synchronization mechanism to get the most up-to-date
data."

In a cluster system, where you have execution contexts all over the
place potentially, you can't afford to just copy the data back and
forth willy-nilly.  You really have to copy only the data the needs
to be copied, and only when it needs to be updated, really.

Host: Are there any candidate cluster filesystems that are BSD-licensed
or have a licence that would be appropriate for DragonFly?

Matt: Not that I know of, but I'm not the type of person that really
goes through and looks at the research papers very deeply.  I'm the
kind of person who just likes to invent---you know, sit down, say
"OK here are the basic ideas of what filesystem A, B, and C do that
I like; I'm going to pull those ideas in and implement it from
scratch."  That's kind of what I am.

Host: So, hopefully in the next couple releases we'll be able to
create a virtual cluster on a single piece of hardware to test out
all this fun stuff.

Matt: That's one of the main reasons why that virtual kernel support
is in 1.8.  It's precisely what we'll be able to do.  It's really
the only way we'll be able to test the clustering code and the cache
coherency management, is to really have these virtualized systems
that as far as they're concerned are running on separate boxes with
communications links.  Otherwise, there's just no way we would test
it and have an engineering cycle time that would get us to our goal
in the next two years.

Host: So is the project looking for any specific developers with
skillsets or any particular hardware that people could help out
with?

Matt: We're always looking for people that are interested in various
aspects of the system.  It doesn't necessarily have to be the
clustering code; it doesn't necessarily have to be the filesystem
code.  There's still a lot of work that needs to be done with SMP,
for example, the network stack.  There are a lot of loose ends in
the codebase which work fine now but are still under the big giant
lock and could with some attention be taken out of that big giant
lock and really made to operate in parallel on an SMP system.  It
just hasn't been a priority for me because the main project goal
is the clustering.

Host: Are you working on any other projects that are BSD-related,
but aren't DragonFlyBSD these days?

Matt: No, not right now.  DragonFly is the main project.  I have
this list of projects that I want to do in my lifetime, and I'm
never going to be able to do all of them.  Right now my focus is
DragonFly, probably for the next two years, and at the end of two
years, my expectation is that we're going to have this clustered
system with a very redundant filesystem that's fairly high-performance.

Host: Some people are never satisfied with running a released system.
They really want to run bleeding edge.  If people want to start
working on and testing 1.9, how can they get a hold of that?

Matt: You just go to our site, we have the release branch and the
HEAD branch, and people can choose to run what they want; of course,
we suggest the release branch for anyone running any production
UNIX stuff.  The HEAD branch we've managed to keep stable throughout
most of the development cycles that we've done throughout 1.2, 1.4,
1.6, 1.8 release.  The head branch during the development of those
releases has remained fairly stable, and I expect that to continue.
It's always good to have a fairly stable head.  But occasionally,
something on the HEAD branch will break, and it will be noticed and
fixed or reported or we'll just have to say "this is going to be
broken for a week" or something like that.  You just have to be
aware of that when you're working on the HEAD of the source development
tree.

Host: Is it possible to install 1.8 and then run the HEAD branch
on a virtual kernel?

Matt: Yeah, in fact, you can.  They should remain compatible all
the way through 2.0.  The only issue there is if we make changes
to the API.  Since the only thing using the new system calls is the
virtual kernel and not any third party apps, we still feel free to
change that API, and if we do change that API, it means that a
person running release has to update the release to the latest
release to get the changes, and then update HEAD to the latest HEAD
to be able to run the virtual kernel.  But, so far there haven't
been any changes to the API that require that.

Host: Are the any other topics that you want to talk about today?

Matt: Actually, I would like to talk about the virtual kernel a
little more, because it turned out to be very easy to implment, and
I think that it's something that the other BSD projects should look
at very seriously, not necessarily depend on VMWare or QEMU for
their virtualization, or at least not depend on it entirely.  To
implement the virtual kernel support, I think I only had to do three
things, and it's very significant work, but it's still three things
and it's fairly generic.

The first was I implmenented signal mailboxes, so you tell a signal
to basically write data into a mailbox instead of generating an
exception in a signal stack and all that.  The second is, I added
a new feature to memory map --the mmap() call-- that allows you to
memory map a virtual page table, and I added another system call
to specify where in the backing of that map the page directory was.
This is basically virtual page table support.  And the third is VM
space management: the ability to create, manage, and execute code
in VM spaces that are completely under your control.  So this is
in lieu of forking processes to handle the "virtualized user
processes." Instead, the kernel has the ability to manipulate VM
spaces and it basically, at least with the current virtual kernel,
switches the VM space in, runs it, and when you have to return to
the virtual kernel, it switches the VM space back out, and everything
runs under a single process.

And it turned out to be fairly easy to implemnet these things,
fairly straightforward, and they're generic mechanisms.  Especially
the virtual page table memory map mechanism.  It's completely
generic.  And the signal mailbox mechanism---completely generic.
And so I think it's well worth the other BSDs looking at those
mechanisms, looking at those implementations, and perhaps bringing
that code into their systems, because I think it's very, very useful.

Host: Do these changes provide any benefit for the general operating
system kernel, or is it only in the context of virtual kernels?

Matt: I think they do provide a benefit to general applications.
A lot of applications, especially with signals---you can't really
do much in the signal handler because you're interrupting code and
you really don't know what the context is, and so if you look at
applications such as the shell code, or really any program that
needs to handle a terminal --a terminal window size change or
something like that-- usually what it does is it specifies a signal
handler, and all the signal handler does is set a global variable
and return, and that's it.  What you really want to do is use a
signal mailbox there rather than a signal handler that does the
same thing.  So it contracts that, and makes it very easy.

Any application that needs to manage a VM space, and this includes
user-level clustering applications.  DragonFly's goal is a kernel-level
clustering, but there are already existing many user-level clustering
systems, and being able to manage a VM space with this page table
memory map mechanism makes that a whole lot easier.  So if that
feature became generally used in the BSDs and ported to Linux, and
generally used in the UNIX world, I can see a lot of advantage to
third-party application development there.

Host: Alright, well thank you very much for speaking with me today.

Matt: Yeah, thank you very much.

Commentary: If you'd like to reach comments on the web site, or
reach the show archives, you can find them at http://bsdtalk.blogspot.com/
or if you'd like to send me an email, you can reach me at bitgeist
at yahoo dot com.  Thank you for listening; this has been BSDTalk
number 98.

BSDTalk #98 transcription

Reply via email to