Sun keyboard on i386?
I have a desk on which (for reasons not immediately relevant) the main head is an i386 machine (4.0.1). But this has meant I'm stuck using a crappy peecee keyboard. Today, I put together the interface electronics to put one of my good (Sun type 3) keyboards on one of the serial ports. It works, in that a program that talks to the serial port can speak the keyboard's protocol and get keystrokes and suchlike. I can, if I have to, bludgeon X into being such a program. But I thought I would first try to use the existing kernel code for Sun keyboards (which would, I would expect, have the additional advantage of working in the text console). Looking at the kernel configs, I see that on sparc64 (and on sparc, though the comments say it's just for test building) kbd can attach at com, which is convenient because it's exactly what I want to do. So I appended a handful of lines to my i386 machine's kernel config, mostly lifted from sparc64: define firm_events file dev/sun/event.cfirm_events needs-flag device kbd: firm_events, wskbddev file dev/sun/kbd.c kbd needs-flag file dev/sun/kbd_tables.c kbd file dev/sun/wskbdmap_sun.c kbd wskbd attach kbd at com with kbd_tty file dev/sun/sunkbd.c kbd_tty file dev/sun/kbdsun.c kbd_tty kbd0 at com0 I had to change an #include and remove another to get the kernel to compile, and rip a little code out of kbd.c and sunkbd.c to get it to link, but surprisingly little. Less than I was expecting. (Specifically: in sunkbd.c, machine/kbd.h - sys/dev/kbd_reg.h, remove machine/vuid_event.h, and rip out both arms of the if (args-kmta_consdev) test in sunkbd_attach(); in kbd.c, remove sunkbd_wskbd_cn{getc,pollc,bell} and sunkbd_bell_off, remove sunkbd_wskbd_consops and the code in kbd_enable that conditionally uses it. Exact diffs available if anyone wants.) But it doesn't work. I added a printf to sunkbd_match, and it's never even getting called. Is there some kind person here who has any idea why not and can point me in a useful direction? I daresay it's something that will be blindingly obvious once I see it /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Sun keyboard on i386?
Oh well. It would have been a nice hack, but it's sounding like more effort than it's worth to me. Make it a line discipline, may be? Possibly. The kbd-at-com attachment is already close to that, according to the comments (I haven't looked at the code enough to be competent to remark on whether the comments are accurate). The major difference I see between what I think you're suggesting and the sparc64 way is the use of a userland utility versus autoconf machinery. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Multiple device attachments
/sys/device.h -- seems to indicate that a device driver can attach to multiple parent drivers (e.g. busses, controllers, ?) Does anyone know how this is done in practice? device wdc: ata, wdc_common attach wdc at isa with wdc_isa attach wdc at isapnp with wdc_iaspnp attach wdc at ofisa with wdc_ofisa attach wdc at pcmcia with wdc_pcmcia com is another example. So is le. I'm sure there are plenty of others. In some cases they don't even need the with stuff, I think. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Multiple device attachments
The examples you site seem to indicate that for example the le device may attach to many alternative devices (e.g. pci, tc, ?), but only one attachment is made when autoconf is complete. For any particular instance of le, yes. I may have read the code examples incorrectly -- please pardon me if I did; but what I want to know is -- can a device have multiple attachments (more than one parent device) when autoconf is complete. A device can in the sense that, for example, ne0 and ne1 might not attach to the same parent, or even the same kind of parent (eg, one ISA and one PCMCIA). But a single node, a single instance of a driver (eg, ne0), always has at most one parent (exactly one, I think, except for the autoconf root most ports call mainbus). To put it another way, the autoconf tree is a tree, not a dag. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: rsync very slow with current kernel (select issue?)
So select blocks (maybe because there's effectively nothing to read at this time), but instead of waking up when there's data ready it wakes up when the timeout expires. This seems rather similar to something I was looking at back in January. [...] I had a similar symptom once, which turned out to be fixed by having both ends of the TCP connection set TCP_NODELAY. (Just one end might have been enough, but, since I was in there anyway, I did both.) This case doesn't sound quite similar enough for that to be it here, but I could have missed something. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Adding linux_link(2) system call (Was: Re: link(2) on a symlink to a directory fails)
What about adding a linux_link(2) that would do exactly what link(2) does but without the FOLLOW flag to NDINIT on the path argument? How about just fixing link(2) that way? If linux_link(2) seems unreasonable, it could be lazy_link(2), weak_link(2), braindead_link(2) or whatever. You'll also need to update every filesystem to allow this and update all the various fsck programs to allow filesystems to be in this state. Hardly. The most that needs to be done to every filesystem is to reject these operations. The filesystem(s) that we want to support hardlinks to symlinks can then be uptdated, one at a time, along with their fscks. I'd disagree with this as it seems like a nonsensical thing to do. Why? I usually can understand your point of view, even when I disagree with it, but this time I'm baffled. What's nonsensical about hardlinking to a symlink? Two names pointing to the same inode, which inode happens to be a symlink - I see nothing nonsensical about that. Of course, some filesystems may not implement symlinks as (their analog of) inodes; they presumably will refuse such attempts. Again, nothing nonsensical; pretty much everything about symlinks can potentially vary with the filesystem; this is no different. What am I missing? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Adding linux_link(2) system call (Was: Re: link(2) on a symlink to a directory fails)
I'd disagree with this as it seems like a nonsensical thing to do. Why? Because symlinks are a special type of filesystem object with their own semantics Every filesystem object is. :) Also, from a more operational standpoint, because there's no way to update a symlink in place, so there's no difference between two symlinks and two hard links to the same symlink except confusion and the number of inodes used. (a) You're forgetting that symlinks have other attributes than the link-to string. The most obvious is mode bits (which have no effect unless you mount -o symperm, but (a1) that can be done and (a2) they can be queried with lstat(2) even if the filesystem doesn't use them), but there are others, such as owner, or even inumber. (b) If you have a lot of symlinks, inode usage may actually matter. (c) I've long thought there should be a way to update a symlink in-place. FWIW, I just asked some linux guys about the linux behavior and the answer was we sell rope. That would be my answer too, though I'd probably phrase it as not preventing you from doing stupid things because it would also prevent you from doing clever things. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: autoclean mode for tmpfs
How hard would it be to add a mount option for tmpfs to automatically drop files after a given timeout? Anyone think this is worthwhile? Sounds like a job for the userland and cron(8). Sort of. Personally? I'd say that scheduling it should be done by userland, but that putting the actual removal in the filesystem makes sense. I'm not sure whether I'd prefer to do it with a new and idiosyncratic syscall, a vfs.something sysctl, some sort of filesystem-level analog to ioctl, or what. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: autoclean mode for tmpfs
It's a security FAQ. If you do rm -rf (or nearly any of the other obvious/easy alternatives) in a world-writable directory, a hostile user can interact with it to erase any file on the system. I believe that this is partially fixable: provided there is at least one file descriptor available per directory level, I think it is possible to safely remove everything but directories. Most briefly, fchdir to each directory, stat . and make sure it matches the directory we thought we chdired into (to avoid doing damage if we lose a symlink race). Then delete things using relative-to-. paths and fchdir back out. However, since there's no way to make rmdir(2) use NOFOLLOW, we have to either leave directory structure in place or risk removing an attacker's choice of empty directories. Not that this makes it any easier to do the usual find | xargs rm style of cleanup, though. To do it safely in the way I refer to above would require doing it all inside rm. Might be worth doing, but quite possibly better done in the filesystem, to (a) avoid the need for the file descriptors, (b) delete a file here and a file there rather than the wholesale destruction of rm -rf (even if I'm right about it being possible to make it safe against hostile users), and (c) get directories right. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: autoclean mode for tmpfs
However, since there's no way to make rmdir(2) use NOFOLLOW, [...] ? lrwx-- 1 dholland notmp3 Aug 7 12:32 baz - bar valkyrie% rmdir baz rmdir: baz: Not a directory ! Hm, I see that too. I wonder where I got the idea it followed symlinks. My apologies for spreading misinformation. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: what to do on memory or cache errors?
besides panicing, of course. Ideally, I think... Corrected error: Usually, log and ignore. Maybe watch for elevated levels of corrected errors and disable either the containing page or the containing memory stick, depending on how much the hardware lets the kernel determine and maybe policy sysctls. Maybe even allow paranoid sysadmins to configure elevated levels of to mean any. Uncorrectable error: Log. Disable the containing page and/or stick, as mentioned above. If it's for the contents of a dirty page, about all we can do is deliver a memory-error signal. If it's for a clean page (including (most) instruction-stream fetches), re-fetch the virtual page into a new physical page and carry on. This is going to involve a lot of help from UVM. Probably. Maybe the pmap, too, for things such as figuring out what regions of RAM would have to be disabled to stop using the affected memory stick, or the like. If uvm_page_error can't correct the error, it would panic. I'd recommend doing that only for kernel accesses; for userland, I'd much prefer to blow up at most the process incurring the fault. Preemptively, we could have a thread force dirty cache lines to memory if they've been in L2 too long (thereby reducing the problem to an ECC error on a clean cache line which means you just toss the cache-line contents.) Depends. Are we talking ECC on L2 cache, or on main memory? I'd say the results should be different. We can also have a thread that reads all of memory (slowly) thereby causing any single bit errors to be corrected before they become double-bit errors. Well, to be detected. Whether the correct action upon detecting them is to silently correct them is a policy matter I'd prefer to avoid wiring into the kernel. I'm not familiar enough with UVM internals to actually know what to do but I hope someone else reading this is. Me neither. I have just about zero idea how implementable any of the above is; I've been speaking in ideal generalities. (My idea of ideal generalities, that is, of course.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Where are the specific WARNS=n defined?
[...] gcc errors due to comparison of signed and unsigned values. It is best to fix the errors. What errors? It is not necessarily an error to compare signed and unsigned values. In my experience, that warning produces so many more false positives than useful warnings that I normally shut it off entirely. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Where are the specific WARNS=n defined?
[...] gcc errors due to comparison of signed and unsigned values. It is best to fix the errors. In my experience, that warning produces so many more false positives than useful warnings that I normally shut it off entirely. and that one time that using it might have warned you about a serious vulnerability? When was that? Except for a few that also provoked, or would have provoked, the warning about how a conditional's value is constant due to limited range of data type, I can't recall ever finding a bug that -Wsign-compare warned about (or would have warned about). Ever. In anyone's code. Yes, it's possible there is such an occasion lurking in my future. It's also possible I've forgotten about one in the past. But I judge the expected cost of possibly having to track down such a bug directly to be well below the expected cost (both immediate and in down-the-road maintenance) of pervasive manual uglification of code to fix non-errors. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Where are the specific WARNS=n defined?
It is not necessarily an error to compare signed and unsigned values. [...] And it is not an error to put assignments in conditionals, or not place parentheses to clarify operator precedence, etc. It is a warning [...]. For some of us this is helpful. The compiler writers try to help protect programmers against common mistakes. If you don't like the warnings you are free to turn them off. That's what I do - along with a handful of other such warnings. The question asked what the appropriate action was, whether to turn the warnings off (the way real kernel compiles apparently do anyway) or uglify the code to work around the warning [ok, my phrasing]. I believe the former is better, because in my experience the mistake the warning warns about is anything but common. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Adding pulse support to gpio(4), gpioctl(8)
Well, you need to open it first, before you can to ioctl, and if only one process can open it, only one process can ioctl it, right? Wrong. Agreed. Multiple threads can ioctl and nobody prevents one from having a single process with multiple threads (pthreads, if you like). Not only that, but even without threading, there are at least two ways I can think of offhand that a file descriptor, once opened, can end up in multiple processes' open file tables: fork() and SCM_RIGHTS. (There are probably others, too.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: KASSERTMSG fix
And I am not aware of a solution where you can have two ... in a C function. You can't actually _write_ something like void foo(const char *, ..., int, const char *, const char *, ...); but, except for -Wformat issues, you can get that effect with suitable use of va_* calls within the implementation of foo(). (If you expect to use vprintf or relatives to consume the first ... list, this involves unwarranted chumminess with the stdarg implementation. But if you walk the first ... list yourself, it's no problem at all.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: KASSERTMSG fix
You can't actually _write_ something like void foo(const char *, ..., int, const char *, const char *, ...); But you can do: void foo(const char *, va_list, const char *, ...) if you need to add some extra args. Yeah, but then you have to pass a va_list, not separate args. Of course, for some uses, that's entirely tolerable. What would be useful is a format effector that processes a format string and a va_list (recursive call inside vsnprintf). But adding non-standard effectors is not really a good idea. I once added such a thing (I think I used %@). It was easy, but I never used it very much and never rolled it forward (it was 1.4T I added it to). Never even got around to adding it to -Wformat. As for using nonstandard formats, don't we already do that with %b? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Perform mmap and poll on PUD character devices
I do not understahd why it is desirable to involve additional context switches to and from userspace into this data path. Instead of writing a bunch of fairly dubious page mapping code [...] in the kernel to support user-space daemons handling various virtual disk formats, why not put the effort into just doing the various desired virtual disk formats in-kernel? I can't speak for Roger, but it seems to me that an appropriate answer would be the same reasons you do _anything_ in userspace rather than the kernel: better insulation of pieces of the system from one another and ease of changing if you want to run something else instead strike me as the biggest ones. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: MAXNAMLEN vs NAME_MAX
MAXNAMLEN = 511 NAME_MAX = 255 [...] We want to make them consistent. Do you want to increase NAME_MAX, or decrease MAXNAMLEN? My opinion is that [versioning userland] is not worth the trouble. The only programs that can fail are ones that do things like: char name[NAME_MAX]; strcpy(name, d-d_name); This sounds as though you are contemplating increasing NAME_MAX. sizeof(d-d_name) does not change. It is just that d_namelen can be 255 (NAME_MAX). Only programs that use NAME_MAX to store directory entries can fail. Not quite. Such things can also find their way into code in subtler ways. For example, I've writen code that knows it can store a directory entry length in an unsigned char (which amounts to assuming NAME_MAX = UCHAR_MAX). I think all the recent examples of that I've written have been FFS-specific and therefore safe (if I'm reading things right, FFS uses a single octet to store directory entry length on disk), but I'm probably not the only one who's done such stuff. My vote is to bump without versioning, what's yours? I probably agree with you. But what's the motivation for increasing NAME_MAX rather than decreasing MAXNAMLEN? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: MAXNAMLEN vs NAME_MAX
Certainly the original 14 byte limit was occasionally a nuisance (but even that was better than 8+3 which was typical), but longer than 255? I've run into the 255 limit. On only a few occasions, but definitely more than zero. (About three times, I think.) In my case it is usually files named after URLs; I will typically put (for example) http://www3.telus.net/~bhilpert/tmp/touchTone1969.gif in a file called www3.telus.net%~bhilpert%tmp%touchTone1969.gif. I regularly see (though seldom want to fetch) URLs long enough to blow out a 255-character limit under this transformation. I'm sure other people have their own uses for long pathname components, too, though I don't know of any offhand. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: A simple cpufreq(9)
If that periodically-threatened pdp10 port (or some other off-size port) ever appears, it's not likely to care about the size that appears in some other environment (unlike for on-disk structures) and using an explicit size will if anything make life more complicated. Especially if it's a size that doesn't exist on that port. Is uint32_t 32 bits or at least 32 bits? THe former may well not exist on a pdp10 port. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: A simple cpufreq(9)
I'm sure at this point someone could put together a 36-bit machine out of FPGAs that ran fast enough to be used as a low-volume web server, and there are certainly heterogeneity advantages to such a platform. Maybe someone who knows enough about such things should actually do this :-) If I had a source of FPGAs that were decently documented - in particular, that didn't demand use of a vendor-provided opaque binary blob to generate the programming data, I'd probably be doing that (among other things) already. (Such things may well already exist, but I haven't found them. Not that I've put all _that_ much effort into looking; finding needles in haystacks is not exactly my forte - unless the needles are bugs and the haystacks are code.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: UNIX kernel notification system
[Do you really mean to use paragraph-length lines? I'd suggest against it; they impair readability significantly, at least for me. Manually rewrapped in the quotes below.] less(1) (or more(1)) doesn't take care of you? Maybe; see below. The nice thing about such formatting is that the text can be wrapped at relatively arbitrary word boundaries, making it more readably displayable on a wider range of display widths (e.g. mobile phones, tablets). That would have been true if the mail were marked format=flowed, which yours wasn't. Since it wasn't, the UA has to assume that that single long line is supposed to be a single long line, and rewrapping it arbitrarily is wrong. Actually, my software deals with it moderately poorly. Depending on exactly which piece is handling the text in question, it either wraps with no regard to word boundaries or truncates - I'm not sure whether this counts as tak[ing] care of it for [me] or not. When mail displays uglily because my software doesn't know how to interpret correctly-marked mail, I don't mention it - but when the mail isn't marked as rewrappable, it is hardly a UA fault to not rewrap it. Again, I'll be manually repairing the damage for purposes of this email. What about embedded? [...] What about machines with multiple keyboard/screen heads [...] I'd argue that embedded is a degenerate case of lights-out, [...] Certainly defensible. The multi-bottle+keyboard ( possibly mouse, though last we met, there was only one of you ...) is arguably the standard multi-user timesharing system set up, with a little more complicated terminal Hm. I think I agree. The handling could even be look up the appropriate language for this message to match what the users of this system know how to read, e.g. catgets(3) in NLS message catalogs. See? i18n handled! I think it's more like i18n handwaved, but never mind. (OK, except for the translation part, but I'll put on an MIT X11 hat here and say, mechanism, not policy!) Agreed. For unclassified text messages, I think just passing the text message to userland (for display, translation, ignoring, checking against admin-configured swatch-style patterns, whatever) is about as good as we're likely to get. /dev/log basically _is_ that answer, though; that part's already done. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Mail.app idiocy [was Re: UNIX kernel notification system]
[Do you really mean to use paragraph-length lines? [...]] less(1) (or more(1)) doesn't take care of you? [...] I assume you're using Mail.app as a user-agent. Apple used to do this right -- they wrapped the lines at 72 columns or thereabouts, but then marked the text as format=flowed in the MIME headers, so readers wishing to rewrap it could do so. In recent versions they've broken this again. Woo hoo. Because, you know, all the world's a Mac and every user-agent is Mail.app. There actually is approximately zero chance this will get fixed. I know someone who works at Apple (I have no reason to think he's on this list) who checked their logs and tells me the breakage was done for Exchange-compatability reasons. I would have hoped Apple would have had the balls to close the bug report with a cannot fix; it's Exchange that's broken here response, or _at least_ provide a bug compatability with Exchange tickbox, but apparently they prefer to just ship broken software, silently gulling their users into inflicting second-hand Microsoft brain damage on the rest of the net. It doesn't help that most Apple users - heck, most users - aren't competent to understand the issue and thus don't see what the problem is. (I hasten to add that Erik is not on that list of most [] users whom I fear are not competent to understand the issue.) The current Mail.app behaviour is broken enough in its own way. Try sending a nicely-laid-out table like HostSizeConnection frodo 74290 10-only bilbo 81288 10/100 samwise 41442 10-only sauron 940061 10/100/gig aragorn 286166 10/100 merry 40447 10/100 to such a user and watch it get converted into Host Size Connection frodo 74290 10-only bilbo 81288 10/100 samwise 41442 10-only sauron 940061 10/100/gig aragorn 286166 10/100 merry 40447 10/100 One of my correspondents at work has exactly this problem when I send nicely formatted text. This is, admittedly, less evil than the Android mail client's unalterable behavior of base64-encoding *all* message data, even that which would be perfectly readable as plain ASCII text. I don't think so. That at least is correctly labeled and thus _can_ be mechanically corrected for, by those for whom it needs correction. This is a case of Mail.app suckering its users into sending out mislabeled mail without even telling them it's doing so. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fs-independent quotas
(ufs is unix filesystems, isn't it ?) On the few occasions when I've seen it expanded (usually in Sun documentation - my impression is that the name came to BSD from Sun), the U has been expanded to User. However, regardless of the expansion, the name has come to refer to what is perhaps more properly called FFS, and I think using the ufs name as part of something that is filesystem-independent is a mistake. If nothing else, it will confuse humans. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fs-independent quotas
Nor in the tree-based dictionnary, or in the multidimentionnal array. No, in an array the unused locations do exist. I don't understand this. If you have a 2-dimention array quota[id][type], and quota[class=group] doesn't exist for this filesytem, you have quota[class=group]=NULL and no memory associated with it. Not no memory. The memory for that pointer, the one that's nil, still exists. If you use an array, representing a quota for id=0 and a quota for id=99 requires 100 array elements, even if 98 of them are nil. With a suitable sparse data structure, the memory cost is..substantially lower. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Patch: rework kernel random number subsystem
The critical values for the statistical tests are set so that p=.0001, so there should be one false positive (the null hypothesis being that the data _are_ random) in 10,000 rekeyings. In that case the right thing to do is simply to rekey -- though for a hardware generator that fails the test, the conservative thing to do, I believe, is to detach that particular random source, so that is the behavior I intend to leave in place in that case. Conservative, but not necessarily conrrect. Some systems stay up a long time, and if working hardware RNG get auto-detached whenever a 1-in-1 test trips, long-lived systems _will_ lose their RNGs. I think this is suboptimal. Indeed, a hardware RNG that _didn't_ fail that test once in a while would be suspect. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
4.x - 5.x locking?
I've got some questions about locking. I've got some kernel code to move from 4.x to 5.x. Looking at it, I can see it effectively assumes the kernel is giantlocked: it assumes that at most one CPU is executing in the kernel. (This particular code never runs in device interrupt context, though it may get called from callouts and the like.) I seem to recall being told, when I asked more or less this question on the lists some time back, that this was a safe assumption under 4.x - and, indeed the thing seems to work on 4.x. Thus, my first question: is this also true of 5.x? I found mutex(9), condvar(9), and the like. But it is not clear to me what I need to do to be MP-ready. Do I need to use the stuff from mb(9), or membar_ops(3), or what? It's not clear from the manpages whether, for example, membar_enter is usable within the kernel; the reference from mutex(9) seems to imply so, but I've been surprised before. It's also not clear whether it would even work; I see no statements promising that if I, for example, do mutex_enter(mtx); ...update a data structure... membar_sync(); mutex_exit(mtx); that the updates will necessarily be visible to another processor which later takes the same mutex; membar_sync() is specified to synchronize memory accesses with respect to other memory accesses, not necessarily with respect to (for example) mutex operations, and it's not clear whether the other memory accesses includes accesses by other processors. I could have the other processor do a membar_enter() after taking the mutex, but, again, it's not clear whether the accesses the manpage talks about refer to this CPU or any CPU. (Any CPU is more useful here (and probably mroe expensive), but this CPU is what I'd expect from what I've read of memory barriers in CPU documentation.) The mb(9) page specifically warns that it does not entail any promises about pushing stores to visibility by other processors, so I don't think it's useful here - am I wrong? And, finally, with reference to the membar_ops(3) page, what does it mean for a load to reach global visibility? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: 4.x - 5.x locking?
membar_sync(); mutex_exit(mtx); mutex_enter and mutex_exit are implicit memory barriers (reads and writes respectively are not allowed to be reordered). Oh! Thank you. Has that made it into mutex(9) in -current? If not, I offer my opinion that it should. Does mutex_exit also implicitly push writes to main RAM, or whatever else is necessary to make them visible to other CPUs? (A reordering barrier does not necessarily imply a global visibility barrier.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: 4.x - 5.x locking?
You can still have non MP-SAFE drivers in netbsd-5 these days. If you do not set D_MPSAFE flag they will be giant-locked[1], [2], [3]. Some of the code is not a device driver, in the sense that it is entered other than via {b,c}devsw[] entries or interrupts. [...mutex...condvar...] b) In cv_wait mtx is used this mutex is released before thread went to sleep and acquired before it's woken up, Yes. which means that you can safely do required work in side mtx guarded producer area. Not quote. On modern MP systems, mutual exclusion such as you outline (which affects flow-of-control only) is not enough; you also need memory barriers. Joerg Sonnenberger just said that mutex_enter() and mutex_exit() include the appropriate reordering barriers. I don't yet know whether they include global visibility barriers (data cache pushes on this CPU and invalidates or snoops on others) or not - you may have seen my note to the list asking - but those are needed too; if they are part of the mutex routines, then your skeleton code is correct, though your explanation omits part of the reason why. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: 4.x - 5.x locking?
However, since we aren't talking about non-cache-coherent architectures (which require even more manual manipulation) it's only about access reordering in the memory hierarchy. I'm not totally clear on what cache coherency is. Based on these remarks, I'm going to guess that a cache-coherent architecture is one on which, as far as the model visible to the programmer (including kernel programmer) goes, it is not possible to have conflicting data in two CPUs' caches: either different CPUs don't have distinct caches, or there is automatic cache update and/or invalidation in hardware (at least optionally, and if it's optional then NetBSD runs the hardware in that mode). Correct? If so, that completely annuls the hairiest of my worries. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fs-independent quotas
The arguments that ufs_quota_entry (or whatever its name is) will be good enough for any future filesystem is just not true. You have asserted that. Proof by repeated assertion is...unconvincing. Not that I think nuermic IDs will be good forever. But, given the lack of any _okther_ filesystem interfaces that represent such things as strings, I think they are far enough off that it is much too early to try to design them in here. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: bumping ARG_MAX
valkyrie% grep foo */*/Makefile Use: grep -r --include Makefile foo . That (a) will include Makefiles at other depths than two (which may not be a problem in the specific example of pkgsrc, but in general makes it non-equivalent), (b) is grep-specifc, and (c) will walk the whole tre to full depth even if there aren't any more Makefiles for it to find. I think the point still stands. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Patch: new random pseudodevice
In what sense are bits really ever taken out? Revealed to userland, of course. The idea here is that entropy that has been revealed to userland might as well not be present. With good mixing at appropriate points, this is of questionable truth, but it is, as you said, a very conservative approach; it amounts to assuming userland has unlimited computational power available to invert the mixing. Combined with the conservative approach to estimating how much entropy was put into the pool, it is a reasonably good way of making sure that when you ask for strongly random bits, you get strongly random bits uncorrelated with anyone else's bits. Your change loses this property, depending on something I might call an entropy stretcher, something which takes some number of random bits and produces a much larger number of no-longer-random bits: essentially, even the supposedly-strongly random device becomes just a PRNG. (A complex one among PRNGs, but still a PRNG.) If there is some kind of correlation between the bits you get from the pool now and the bits you got from the pool then, the right answer is not to put more bits in and hope the correlation gets worse; it is to correct the output function so that finding such a correlation is actually cryptographically hard. It's true that better mixing on output is a good thing. However, it does not fix the fundamental problem that you can't get out more information than you put in. Even a _good_ PRNG can't avoid correlated output bits, even if the correlation is complex enough to be hard to exploit. You are replacing a very conservative and well-defined concept (the amount of unrevealed information remaining in the pool), even if a somewhat misleading term (entropy) is used for it, with a vague hope/belief that your PRNGs are hard to invert. Before, bits were extracted from the pool with a construct nobody had really studied, and we counted every bit output as if it had been somehow consumed. Even though we didn't actually understand what consumed meant. Maybe you didn't. I thought it was perfectly cleaer: information exposed to userland reduces the amount of secret random information content remaining in the pool. In practice, I doubt your changes weaken it much...yet. In theory, they are pretty horrible. Information content is a fairly well-defined concept, and the old code took a conservative approach to measuring it and doling it out. You are replacing that with something that appears to think it can turn a small amount of information into a large amount, which is not possible; the information content of the output of your per-device PRNG cannot be more than the amount of information you keyed it with, even if the correlations are currently difficult to see. I would welcome better mixing on output. But this information stretching for the supposedly-strongly-random device is, in my opinion, just plain broken. And note that at least one highly-thought-of modern design for an entropy collector (Fortuna) doesn't even _try_ to keep an entropy estimate Because one popular system makes a mistake, we should make the same mistake? (Actually, see my last paragraph, below.) -- the whole concept is pretty fuzzy when you start trying to count how many bits you took out. Not fuzzy at all. Read unrevealed information content for entropy and it amkes a whole lot of sense. The number of bits you took out is the number of bits of information revealed to elsewhere. (The amount of information content, not necessarily the number of bits of apparent information - if you feed 32 bits of information into a hash function, you get at most 32 bits of information content out, even if they're spread across multiple hundreds of output bits.) It's possible there's something going on here I don't understand, which invalidates these arguments. I'd welcome any pointers to such a thing. But until then, I'm going to stick with the information-theoretical point of view that you can't get more information out than you put in, and call this key a PRNG and then generate more bits of output than there were in the key implementation of the supposedly-strongly-random device broken. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Patch: new random pseudodevice
[I'm pulling together multiple mails from tls here. The second-level quotes are from varying people; I've marked their authors according to the info I have.] [Mouse] Revealed to userland, of course. Combined with the conservative approach to estimating how much entropy was put into the pool, it is a reasonably good way of making sure that when you ask for strongly random bits, you get strongly random bits uncorrelated with anyone else's bits. Look at the implementation. It *never* worked that way. It came reasonably close. To cause bits to actually be taken out, you'd have to maintain two pools, discard the entire contents of one every time any bits were revealed to userspace, and switch to the other. Or something along those lines. And that's just not how it ever worked. To be certain of it, yes. It should have been done that way, and I would support changing it to be done that way. What it actually was is a hybrid. [Alan Barrett] Fair enough, but you still seem to be talking about how good a CSPRNG it is, whereas my concern is that it's pseudorandom, nor random. So was the output from the old entropy pool. Only sort of. As soon as you start accumulating random bits in any manner that leaves the old ones in -- that is, does not entirely eliminate them as inputs to the accumulator function -- even after you take them out -- that is, disclose the accumulator function's output -- you are dealing in precisely what you say you want not to: cryptographically secure pseudorandomness. Not entirely, in two respects: (1) provided you don't draw on the pool for more bits than it has information content, you are in principle getting information out. If the mixing function is good, you will actually be doing so. (2) If you are stirring new randomness into the pool in the meantime, it...complicates things; it is no longer a pure PRNG. To an approximation as good as the mixing function is good, you can draw on the pool as much as you like, provided you never draw enough to reduce the information content to zero. I'm not particularly happy with the old mixing function; see below. I would much prefer to do what you suggest... To get the property you seem to want, you basically have to buffer the purportedly true randomness into pools, blocks, [...] ...to accumulate entropy into unread blocks and release it to the rest of the subsystem in blocks, to be consumed as such. Preferably, it should be whitened as it's accumulated into blocks. _That_ would actually be fixing one of the potential problems the current system has, without introducing more. Let me put it this way: before, you may have thought you were getting some kind of true randomness. You weren't. But you were, to a decent approximation, assuming the input entropy estimate was not an overestimate. You appear to think that anything that's not pure true randomness must be pure PRNG. The old randomness pool was neither; it was a hybrid. If the mixing function is good - and that's the first weakness; we have only heuristic reasons to think it is - then to a correspondingly good approximation, it returned real randomness - bits uncorrelated with anything else - as long as the actual entropy-in-pool was greater than the number of bits returned. This is the second weakness; nobody really knows how good the estimate of the entropy supplied by input is. Personally, I suspect it's an underestimate, but I have no particular evidence for that; as long as I'm right, it's OK in this respect. Now, you still aren't, but at least what sits between you and the entropy source is a lot more clear, and a lot better analyzed I don't know; it sounds a lot less clear to me. I'd still rather fix it right. If you go ahead with inflicting a PRNG on /dev/random, it really really needs to be prominently marked as being a PRNG even for the stronger device, since that's a nontrivial regression over previous versions (which may not have been perfect, but did return at least _some_ real randomness in the bits it produced, the exact amount depending on how good the mixing was and how good the input estimates were). The bottom line: pseudorandomness is the best you're going to get. With your design, perhaps. But you actually outlined a design that does not suffer from those problems; why not do that instead? It might as well be done in the safest way possible, That's fine, for /dev/urandom. If you want to hook what you've outlined, or something like it, up to /dev/urandom, that's fine. For /dev/random, I consider it a bug. That the old system was partially broken (to an extent nobody really knows) does not excuse replacing it with something even more broken. I have been trying to follow the Yarrow/Fortuna design paradigm of rekeying the stream generator from the entropy pool at each request. [And asking, what's a request?] However, when applications use /dev/random, we could consider a request to be a single read from
Re: Patch: new random pseudodevice
You are aware of the fact that 99.99% of computers don't have true random number generators and the bits you claim that are random are not random at all? Actually, practically all computers have true random number generators. The first problem is that neither they nor their interfaces are designed as such, so getting the randomness out of them and into the system is...interesting. The second problem is that nobody really knows just how good the resulting randomness is - that is, while there is true randomness there, nobody knows just how much information content there is in each random bit. (The latter is one reason for whitening input bits as they are gathered.) These random number generators are things like the turbulence inside disk drives and the noise in sound input. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Patch: new random pseudodevice
a small one, but even the small one has sufficient entropy for your purposes. Notably, [Mouse's opinion] differs from the opinons of the people who wrote the several relevant FIPS and X9 standards, who _require_ that cryptosystem keys be generated by an approved DRBG (their terminology for a CSPRNG) -- though they also impose minimum entropy requirements for keying the DRBG itself -- and of SP800-90, which explicitly discusses this issue. I don't know why they chose that. Perhaps the standards body simply made a mistake. Perhaps they had to compromise for any of many possible reasons. Perhaps they simply considered the minimum input entropy to be enough for the purpose the standard is intended for and just use the PRNG as a whitening and stretching function. But I value information-theoretic considerations, such as any determinstic computation's inability to contain more information in its output than is present in its input, over any standards body's output. [Paul Koning, quoting me] [uses for RNGs] (1) Strong bits suitable for direct use as things like crypto keys. Using a PRNG here, even a really good one, is a major fail. The only time it's acceptable is when the data drawn is no larger than the PRNG key, and then you might as well return the bits directly. I don't think this is correct. One thing to keep in mind is that the current standard of quality for a cipher is that its output is indistinguishable from a random string (up to a length limit, 2^blocksize or 2^(blocksize/2), I'm not sure which). Computationally indistinguishable, today. It is never theoretically indistinguishable, as can easily be seen by considering trying all possible keyings and seeing if any of them match. This is why I'm hammering on the information-theoretic considerations: they are fundamental, not subject to change with advances in cryptanalysis. The security of bits drawn from an properly-designed entropy pool depends on much weaker assumptions than the security of bits produced by a PRNG (when the PRNG seed is smaller than the number of bits produced). In practice, today, what Thor is proposing (or, in view of what he's said, perhaps I should say imposing) is probably good enough for most purposes. It is not good enough in theory. That's why it does not satisfy me, especially when there is an easy way to get something that is as good, theoretically, as is available - indeed, Thor himself outlined it. [smb] In my opinion ([and presumably others']), a CSPRNG is more secure. Why? Because we *know* what it does, all the time. True RNGs are devilishly hard to get right, and are susceptible to all sorts of environmental perturbations. Imagine what would happen if someone upgraded the disk to a flash disk or one with a large flash cache You still need a true RNG (to seed your PRNG), though, or you get predictable bits. A CSPRNG makes a good mixing function. But that's really all it's doing, because that's all it's capable of doing. In any case, it's no skin off my nose. I have plenty of other reasons for not using whichever NetBSD this ends up in. I've pointed out the problems; if NetBSD is determined to carry on regardless, that's its lookout. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Use consistent errno for read(2) failure on directories
According to the online OpenGroup specification for read(2) available at [1], read(2) on directories is implementation dependant. If unsupported, it shall fail with EISDIR. Not all our file systems comply, and return random errno values in this case (mostly EINVAL or ENOTSUP). How does that not comply with implementation dependent? From a standards-conformance point of view, that's equivalent to in this implementation, read(2) on directories is supported: on $FILESYSTEM, it always returns EINVAL, on $OTHER_FILESYSTEM, it works according to $REFERENCE; on $THIRD_FILESYSTEM, it always returns EOPNOTSUPP. This is not to say that it shouldn't be cleaned up. Just that I don't think it's actually nonconformant. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Debian OpenSSL desaster (was: Patch: new random pseudodevice)
[I tried to send this as private mail, but get host Sparkle-4.Rodents-Montreal.ORG[216.46.5.7] refused to talk to me: 550-.de's whois server, whois.denic.de, is completely broken, [...] I wrote up a point-by-point reply to this, but then realized, this is tech-kern, not tech-broken-network-governance. So I'll confine myself to saying my respnse is at {ftp,http}://ftp.rodents-montreal.org/mouse/ccTLD-thoughts.txt for anyone interested. (Actually, will be at; as I send this mail, I'm still writing it - the draft is available at .../ccTLD-thoughts-draft.txt and I'll move it when I'm done.) As for the content... I don't recall full details, but I think it was a Linux distro It was the Debian OpenSSL desaster. In essence, they patched OpenSSL's entropy gathering to the point where the PID was the only entropy source being used. Ah. Yeah, that'll do it. Thanks for the correction; I'm not surprised I got some of the details wrong - but the actual incident works just as well for the argument I was making with it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Debian OpenSSL desaster (was: Patch: new random pseudodevice)
[...] The short answer is that Mouse likes tilting at windmills. :-) Eh. I think that is at least a little of a misstatement. I don't do such things because I enjoy doing them. Quite the opposite. I do them because I must. I'm not entirely sure what I mean by that. It's difficult to explain, even to myself. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Lost file-system story
[...], you do indeed seem to think that async-mounted Unix-based filesystems should be able to be repaired, at least some of the time, There's a huge difference between this isn't promsied and this never happens. They _can_ be repaired...some of the time. When they can, it is because, by coincidence, it just so happens that the stuff that got written produces a filesystem fsck can repair. The probablility of any Unix-based filesystem being repariable after a crash is zero (0) if it has been mounted with MNT_ASYNC, and if there was _any_ activity that affected its structure since mount time up to the time of the crash. This is simply false. I just tried it. On a 5.1 i386 system, I used fdisk and disklabel to make a half-gig partition, newfsed it, mounted it normally, copied a file into it, unmounted it, mounted it async, removed the file, and hit the power switch. After the machine came back up, I tried fsck on the filesystem. It said it was clean. I used fsck -f. It was happy. I mounted it and, as far as I can tell, fsck was correct in thinking the filesystem was OK. So, there is an existence-proof-by-example that there are circumstances under which a filesystem mounted async can be changed and still be left in a state fsck can repair. It still might survive after some types of changes, but it _probably_ won't. Right. But that's not probability ... is zero (0). Linux ext2 is not a Unix-based filesystem and Linux itself is not a Unix-based kernel. It's about as Unix-based as NetBSD is. Unless you mean something strange by Unix-based - what _do_ you mean by it? For Unix-based filesystems and their repair tools, any probablility of recovery less than one is as good as if it were zero. That's not how I feel about it when I've lost a filesystem. I'll take a filesystem with a nonzero probability of recovering something useful from over one that guarantees to trash everything any day (other things being equal, of course). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ccTLD filtering (was: Debian OpenSSL desaster)
You can make [your] point, but you won't win against Mouse as he just doesn't care outside of his wall [...] Yeah. I used to. Then I realized that it was sucking away a huge amount of time, energy, and stress tolerance, for, as far as I could tell, zero benefit to anyone, including me. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Lost file-system story
They _can_ be repaired...some of the time. That's totally irrellevant. I don't think so, not when I'm replying to a claim otherwise. Possibilities other than zero or one are not useful in manual pages, Then we can throw away fsck, because there is always _some_ chance the filesystem will be irreparable. Memory, CPUs, disks, and the transports between them do fail, occasionally transiently. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: import of posix_spawn GSoC results
What's clean about importing the VMS process model to Unix? That's hardly the VMS process model - or at least it wasn't back in the '80s when I used VMS. In particular, in the VMS paradigm, the CLI (as close as VMS gets to the shell) and the program being run run in the same process. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: import of posix_spawn GSoC results
What's clean about importing the VMS process model to Unix? That's hardly the VMS process model - or at least it wasn't back in the '80s when I used VMS. In particular, in the VMS paradigm, the CLI (as close as VMS gets to the shell) and the program being run run in the same process. Well, the process model as such will not change just because another system call is introduced in Unix. No, of course not. That's why I think talking about importing the VMS process model is irrelevant - that's not what's happening. Also, the CLI is not really related to the process model either. Well, it is in that the CLI and the program being run run in the same process, which means that process is very long-lived. Starting new processes under VMS is - well, was - a very heavyweight operation, far more costly than fork() under Unix. The paradigm was, a process was created on login and it lived until logout; that the CLI inhabits the same process as programs you run is relevant only in that it eliminates one of the principal reasons Unix needs lots of processes. However, yes, this system call looks like it comes very close to how tasks are run under VMS. Even down to the name. LIB$SPAWN, I think it was. But I don't know about very close; aside from relatively trivial things, like file descriptors having no particularly close analog, the real problem is that LIB$SPAWN is not how most things are - were - run under VMS. Rather, everything runs in the same process, with each program you run replacing the prvious one. Additional processes are necessary only if you want to run something detached, which is not common. Or at least, that's how it was. With POSIX's dominance, it would not surprise me if VMS had tried to jump on the bandwagon and support the Unix paradigm at least to the extent necessary to implement many of the POSIX interfaces. You've probably used more recent VMS than I have; has this happened? I won't go into the pros and cons of different ways of starting a new process to run something. And there we are: under VMS, you don't - well, didn't, back in the '80s when I used it - start a new process to run something. You ran it in the same process you ran everything else in. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: import of posix_spawn GSoC results
From the annals of the POSIX wars: the rationale for posix_spawn() was to support systems without MMUs, where fork() is expensive, and vfork() impossible. I would quibble with calling vfork() `impossible'. Perhaps I'm missing osmehting, but vfork() seems particularly well-suited to such a system to me - the borrow the VM semantics strike me as exactly what you want when context-switching is expensive. (Though of course the `V' of `VM' is a bit of a misnomer in that circumstance.) In any case, even if I'm wrong, vfork isn't impossible, just, at worst, ludicrously expensive. :-) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD/usermode (Was: CVS commit: src)
This all seems simple and elegant enough, but it does not (quite) work: A) It still requires a new system call on the outer kernel. *Perhaps* this could be avoided by using ptrace, which might be simpler with this approach because the rule is simple: just say no to all system calls. Well...sort of. B) There is no way for the usermode userspace process to allocate memory. I don't really see a clean way to fix this: ptrace can support this too. It can let sbrk/mmap through, but tell the usermode kernel as it does so. Or it can consult with the usermode kernel first, and then let them through in a possibly modified form. 2) Using ptrace to allow, but validate, sbrk and mmap arguments seems questionable at best. How would this interact with the NetBSD VM system in the usermode kernel? With difficulty. :) I am confident this could be dealt with; if necessary, sbrk and mmap could be intercepted and turned into different calls of some sort. SysV shm calls? mmap() of /proc/something? I've long thought that something akin to SCM_RIGHTS should exist for passing memory regions between unrelated processes. That would come in extremely handy here. (But given how badly SCM_RIGHTS got botched, it probably would end up exploding somehow.) 4) How exactly does the usermode kernel _end_ the usermode userspace processes in a clean way? Use ptrace to force it to call exit(), and let the exit() call through to the real kernel. Or just use PT_KILL. Working through it makes me really wonder whether there's _any_ portable way to do this stuff. Sure - by instruction-level emulation if naught else. (Not a great way, but it certainly can work.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
raidframe rebuild question
I've got a bit of a practical issue with raidframe. The machine is at 4.0.1. The RAID devices are raid0: L5 /dev/raid5e /dev/raid6e /dev/raid7e /dev/raid4e /dev/raid9e /dev/raid10e /dev/raid11e[failed] /dev/raid12e /dev/raid8e raid1: L1 /dev/raid2e /dev/raid3e raid2: L1 /dev/ld0e raid3: L1 /dev/ld5e /dev/wd3e raid4: L1 /dev/ld8e raid5: L1 /dev/ld2e raid6: L1 /dev/ld4e raid7: L1 /dev/ld3e raid8: L1 /dev/wd4e raid9: L1 /dev/ld1e raid10: L1 /dev/ld7e raid11: L1 /dev/ld6e raid12: L1 /dev/wd2e Just recently, /dev/ld6e decided it didn't like us any longer. (Actually, I think it is probably the twe it's connected to, not ld6 itself.) I manually failed /dev/wd3e in raid3 and added it as a spare to raid11, but now I find myself stymied as to how to get it to rebuild. raid11 is of course failed in raid0; I could raidctl -R it, but that won't help until raid11 is back in operational shape. I can't reconstruct raid11, because it has no operational members. I can't unconfigure it (preparatory to reconfiguring it), because it's held open by raid0. What's the right way to do this? Am I stuck needing a reboot? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fifo and [acm]time
On ufs (and tmpfs and perhaps others), reading from or writing to a fifo updates its [acm]time [...] Note that the same [acm]time updates do not apply to sockets. Aside from whether this is a good idea, this difference may make sense. A FIFO is a single shared object; multiple opens result in multiple references to a single FIFO. You don't say whether the socket case is SOCK_STREAM or SOCK_DGRAM sockets. For SOCK_STREAM, a socket in the filesystem is more like a cloning device: a connection established using it as a rendezvous point results in new sockets, distinct from the socket corresponding to the filesystem entry (though one of them is derived from it). It is these new sockets that the I/O occurs on and whose [acm]time would logically be updated (if they had [acm]times, which they don't, because they don't have [iv]nodes). For SOCK_DGRAM, the above argument does not apply. As a matter of theory, read()/recv()/etc on such a socket should update the atime and write()/send()/etc should update the mtime (and, of course, in each case the ctime as well). (I/O on the peer socket, the one that doesn't have a filesystem entry, should not do anything of the sort, because that socket doesn't have any [iv]node to update the [acm]time of.) Pragmatically speaking, it's not clear to me that there's enough value in either stance to make it worth changing whichever behaviour the implementation happens to provide. And what applications would ever rely on the [acm]time of a fifo? The only value I can see in the [acm]time of either a FIFO or an AF_LOCAL socket file is to see when the relevant software last did anything with it. This is less a matter of an application proper using the timestamp and more one of a human who's investigating something looking for relevant (or possibly-relevant) data. One consequence of this is that in a vanilla NetBSD install, Postfix triggers disk I/O every minute when master tickles the pickup daemon by writing to the fifo /var/spool/postfix/public/pickup. Is one inode update per minute enough to be a significant issue? (Sure, there will be cases where it is enough to matter. But are the common enough that it's worth doing anything to NetBSD in consequence, or are they outliers that call for per-system custom tuning? I can think of at least one approach which likely would address this problem, in cases where it _is_ A problem, already, and that's with no more than a minute or so of thought.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raidframe rebuild question
[raidframe woes] What's the right way to do this? What about creating a (Level 1) raid13 consisting of wd3e, adding (a partition on) that as a spare to raid0, and failing raid0's raid11e component? That's probably what I should have done. I don't seem able to do it now, though; raidctl -r refuses to remove /dev/wd3e from raid11's spares. (It doesn't complain, but wd3e is still listed as a spare when I check with -G afterwards.) I should probably go read the code to see if I can figure out what's really going on here...might be worth setting up a test machine I _can_ reboot casually. (The machine in question is a production machine and I'm not in the right city to deal with it personally.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fifo and [acm]time
The only value I can see in the [acm]time of either a FIFO or an AF_LOCAL socket file is to see when the relevant software last did anything with it. Diagnostic information is useful, but is it useful to store on disk? In many cases storing it in core is good enough, though I'm sure there are at least some cases where it needs to go to disk. It seems to me that for the investigation you describe, systems such as ktrace, dtrace, and filemon would be more appropriate than the [acm]time of the inode. Possibly. I'm not familiar with dtrace or filemon, but ktrace cannot produce that information unless the relevant processes were being traced when they last did I/O. [acm]times allow after-the-fact investigation without needing to leave the processes traced during routine operation. However, I suppose they monitor processes, rather than inodes, There's that too. Is one inode update per minute enough to be a significant issue? It means the disk must continue spinning and, e.g., will continue to draw power from a laptop battery to do so, even when the system is functionally idle. Aren't there lots of things that already do that? Some of them can be suppressed by various mechanisms (eg, nodevmtime mounts); one possibility is to use the same or similar mechanisms here. Another is to address it some other way. In the case of postfix, the first thing that occurs to me is to make the FIFO path a symlink into a tiny mfs mount dedicated to the purpose; updating the mtime of an inode in a ramdisk is very fast, very cheap, and does not require keeping a disk spinning. Depending on whether the relevant support has bitrotted, it could even be turned into a direct mount of a ramdisk whose root inode is a FIFO rather than a directory. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raidframe rebuild question
i seem to recall that we found some missing close calls inside raidframe that are now fixed in -current, and possibly pulled upto netbsd-5 and probably not netbsd-4? Worse than that - see below. i think you do need a reboot, unfortunately. I think so too. I finally got around to looking at the code, and it turns out the ioctl backing raidctl -r is totally unimplemented (quoted code here is from the source tree from which the kernel on that machine was built): case RAIDFRAME_REMOVE_HOT_SPARE: return(retcode); Not only that, but rf_remove_hot_spare, even were it called, is unimplemented too: int rf_remove_hot_spare(RF_Raid_t *raidPtr, RF_SingleComponent_t *sparePtr) { int spare_number; if (raidPtr-numSpare==0) { printf(No spares to remove!\n); return(EINVAL); } spare_number = sparePtr-column; return(EINVAL); /* XXX not implemented yet */ #if 0 if (spare_number 0 || spare_number raidPtr-numSpare) { return(EINVAL); } /* verify that this spare isn't in use... */ /* it's gone.. */ raidPtr-numSpare--; return(0); #endif } So, yeah, I don't see any way out of this but a reboot. :( /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: DADHI drivers for Asterisk?
PCI boards for Asterisk require kernel drivers, [...] [...], which have a FreeBSD port here: http://svn.digium.com/svn/dahdi/freebsd/trunk Anyone started working on porting that to NetBSD? Have you realized that these drivers are apparently GPL/LGPL and thus not suitable for NetBSD kernel inclusion? I once started looking at writing a native driver for one of the Digium FXO/FXS cards, only to find that, as far as I could tell, the only hardware documentation available was the Linux driver source. I spent a little time reading over the Linux drivers, but it was an ugly enough mess to try to glark hardware interfaces from the driver that I lost interest before getting anywhere. If anyone does manage to find hardware doc, I'd be interested. I'm not likely to produce a driver soon (I don't expect to have the time), but I'd like to have the doc against future possibility. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: buffer cache ufs changes (preliminary ffsv2 extattr support)
I'm working on porting the FreeBSD FFSv2 extended attributes support. [...] 1) Add a new bflag, B_ALTDATA. [...] 2) instead of using a new flag, add a new 'int type' member [...] Althrough I've done 1 as a POC, I prefer solution 2 ([...]). What do other think ? As a choice of approach to implementing what you want, I think 2 is better. It's far more generalizable. As a piece of SF I read once said, the number two is ridiculous and can't possibly exist. It was talking about universes, but the basic concept applies here too: there's very little excuse for any number between one and many. However, I think that constitutes a good implementation of a bad idea. This makes a file no longer a long list of octets; it becomes multiple long lists of octets. The Mac did this, with resource forks and data forks, and you may note OS X doesn't do it any longer. I suspect these will seem like a good idea for a while, until people start discovering all the things they break, or that break them, and realize that they didn't learn from history and thus had to repeat it. That said, it's no skin off my nose. I've said my piece, and it won't be affecting me, pragmatically, either way. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: buffer cache ufs changes (preliminary ffsv2 extattr support)
This makes a file no longer a long list of octets; it becomes multiple long lists of octets. [...] [...] I have always found the idea flaky myself (and sorry for the rant): [...] Yeah. I think it's a very interesting direction to take filesystems. But this, interesting as it is, is research experimentation; we do not even nearly understand how to fit multi-fork (to adopt the MacOS term) files into a Unix paradigm (witness all the programs that we don't understand how to change for this), and investigating non-understood things is what research _is_. And I think the master tree for a (supposedly-)production OS is not the place to be carrying out research experiments, not even if another such OS is already doing it. But my opinions seem to correlate negatively with NetBSD's these days. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD on current AMD motherboards
Any RAID controller where the management interface works under NetBSD? Well, it's old enough it's more in the nature of an existence proof by example, but at work there's a NetBSD machine with a 3ware Escalade 12-port SATA RAID card. There's a management program that is depressingly poorly documented, but it does work for us. However, we don't actually use the card's RAID facilities, just using it as a multi-port SATA interface and doing the RAID with RAIDframe; all we use the management program for is getting lists of disks attached with serial numbers and suchlike. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: New bus_space routine: bus_space_sync
Even if originally intended for something else, [...] Why do you think BUS_SPACE_BARRIER_SYNC was intended for something else ? I can't see how a write barrier that doesn't ensure the write has reached the target (main or device memory) can be usefull. I can't comment on why someone else thinks something. But barriers that have nothing to do with write completion to the target can still be useful. There are algorithms that don't require that writes complete on any particular schedule, but do require that _this_ write complete before _that_ one. When faced with write coalescing and reordering, a write barrier that does nothing but enforce ordering (in the sequence A-barrier-B, the barrier enforces the constraint that there is no time at which write B has completed but write A hasn't) can be useful. For example, the standard double-buffering trick of write inactive copy, then write variable indicating which is the active copy does not work if the indicator's write can complete before the (formerly-)inactive copy's writes complete - but, in many uses, there is no requirement that those writes, as a sequence, be pushed to their target at any particular time. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: New bus_space routine: bus_space_sync
I can't see how a write barrier that doesn't ensure the write has reached the target (main or device memory) can be usefull. [...]. But barriers that have nothing to do with write completion to the target can still be useful. [...] That's not what the manpage documenting BUS_SPACE_BARRIER_SYNC says. Read the manpage. Oh, what I wrote wasn't about BUS_SPACE_BARRIER_SYNC specifically. It was about barriers more generally, in response to I can't see how a write barrier that doesn't ensure the write has reached the target (main or device memory) can be usefull. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: extattr namespaces
One thing that I'm wondering: what are the character constraints on those class names in the Linux API? The reason is that if UTF8 is allowed, it'd be possible for two names to show as an equivalent representation to humans, while they'd be different for the system, [...] Only if userland insists on rendering the octet sequences as UTF-8 characters. That would be stupid of it (in security-important contexts, at least) for this reason if no others. I think the kernel should be as encoding-agnostic as feasible, just as it is now for pathname components, file contents, data flowing through pipes and sockets - pretty much all places where octet strings of any sort cross the user/kernel boundary. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: O_NOACCESS?
Why not use O_DIRECTORY (which is part of -current) and add that to flags? Backporting that might be a better alternative. What are its semantics? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: O_NOACCESS?
Why not use O_DIRECTORY (which is part of -current) and add that to flags? Backporting that might be a better alternative. What are its semantics? It means the open will only succeed is the file is a directory. Worth having, but not sufficient by itself, because it still requires something in the low two bits, and without something like O_NOACCESS there is nothing you can pass there that will let you open a directory you have neither read nor write access to (even if you have search access to it). In a private exchange with someone else, I've determiend that it definitely needs more restrictions than I've got on it now, because what I have lets anyone flock() anything - flock does not require FREAD or FWRITE - and lets anyone open any device special file (for no access, but depending on the driver that can still be substantial) and lets anyone keep a big file from being destroyed by having an open descriptor on, in each case requiring no more access than the ability to name the object (ie, search access on the containing directory and the path leading to it). I really should not have needed to have those pointed out to me. My current plan is to add O_DIRECTORY as well and make O_NOACCESS work only when combined with it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: O_NOACCESS?
Right. You add O_DIRECTORY to that check. Ah, I misunderstood. My apologies. if ((flags (FREAD|FWRITE)) == 0 (flags O_DIRECTORY) == 0) return EINVAL; if ((flags O_DIRECTORY) != 0 (flags (FREAD|FWRITE)) != 0) return EINVAL; Actually, what I have now skips the latter of those two checks, because I can't see any reason wby O_DIRECTORY shouldn't be specifiable with, eg, O_RDONLY. Am I missing something important there? After how I missed something pretty blatantly obvious before, I don't trust myself tonight. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: O_NOACCESS?
There is no problem. O_NOACCESS would be 3. When converted from O_* to F* it becomes 0. And that is indeed what I did. FFLAGS() and OFLAGS() become more complex than just adding and subtracting 1, but that's not difficult to deal with. (If anyone's curious exactly what I did, look at the three commits ending with 5215f8f6551df407d7c87c8e6a80c7b04e9ee844 in the git repo git://git.rodents-montreal.org/Mouse/netbsd-fork/4.0.1/src.) The fact that the O_ flags were not intelligently specified aeons ago so that a conversion is required is regrettable, but at this point unfixable. Actually, I disagree. It is totally fixable. Well, it's unfixable in the sense that we can't change the past choice. But it is fixable in that we do not have to be remain crippled by that choice. Quite aside from someone just having the courage to bite the bullet and write off compatability with such ancient code (are there any known extant examples?), it's possible to do something like #define O_MODERN 4 /* or whatever */ #define O_RDONLY (_FREAD|O_MODERN) #define O_WRONLY (_FWRITE|O_MODERN) #define O_RDWR (_FREAD|_FWRITE|O_MODERN) In the libc open() stub, check O_MODERN. If set, just call the syscall. If not, call _really_ancient_compat_open or some such (which latter would be under the control of the relevant COMPAT_* option, and would perform the historical mapping) - or, maybe, just do the mapping in libc. I'm not sure it's worth it, though, just for the sake of eliminating FFLAGS() and OFLAGS(). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: O_NOACCESS?
There are, however, at least three possible things there's currently no open flags for. (1) search/lookup on a directory, as described; (2) execute on an (executable) regular file; (3) really nothing at all. #1 and #2 could be legitimately combined (as the --x permission setting is combined) into something we could reasonably call O_EXEC. That actually makes the most sense, I think. O_NOACCESS as I implemented it a quick kludge to graft the effect I want onto the existing framework. It definitely is not the rightest answer, especially with the ugly works only when O_DIRECTORY is given `fix'. (Note that while there may be no use for #2 in userlevel code, unless perhaps if we add an fexecve() call, having it would be convenient in the kernel.) fexecve() makes a lot of sense too. So would an flink(), and indeed f* versions of any other call which uses a path just to name an object rather than as a relevant part of the syscall. But, taken to its logical conclusion, that also means that all the pathname-taking calls should have versions which take a directory fd and a single pathname component. This would be nice in some respects, though I'm not sure about bind(2) for AF_LOCAL sockets. C is not a right language for where my mind is going with this. #3, which is what I'd call O_NOACCESS, is something else though; [...] That is, it would let you use open() to create a fd for any path you can name, including devices and whatever else, without granting any access permissions at all. And, indeed, without calling device-level open() routines and such. This would also support what Mouse is trying to do, Actually, I don't think it would, not without creating other problems. If it addresses my desire, then it must keep a reference to the underlying object. And if it does that, then it can be used as a DoS by preventing large files (coredumps, logfiles) from being destroyed upon unlinking them. I'm not sure that needs fixing. It does need more thought. I have implemented #3 in research kernels and it doesn't cause the world to blow up, although it does require some extra logic for calling device open routines, and NetBSD in particular might be missing checks entirely in certain places (like flock, as previously cited) that would need to be added. What, if anything, did you do about the anyone who can stat a file can keep it around consuming diskspace indefinitely issue? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Respawn crashed PUFFS filesystems?
Of course the feature would be broken in some cases, but we could make the thing optional using a vfs.puffs.respawn sysctl, which would contain a colon-separated mount points subjected to respawn. What happens if a mount point contains a colon? More to the point, I think this puts the information in the wrong place. Is there any way it could be set as an option at mount time? (That's a serious question; I don't know puffs enough to answer it.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Respawn crashed PUFFS filesystems?
Is there any way it could be set as an option at mount time? The problem is that mount(8) passes the options verbatim to /sbin/mount_xxx, which is supposed to start the xxx filesystem. The filesystem will parse the options on its own before passing appropriate flags to mount(2). We have no way to make sure a third party software will not choke on an unexpected option, and no way to make it pass the option to mount(2). As for choke on an unexpected option, well, third-party software can choke for any reason or none. But I don't see any reason we can't document this as one of the likely options and let anyone who doesn't handle it, or doesn't pass it back at mount(2) time, deal with user censure for not supporting a useful and easy-to-support facility. Alternatively This could be useful in other contexts, from post-unmount cleanup in general to auto-remount of non-puffs filesystems. Perhaps it's appropriate to add vfsctl(2), with an option which can set a run this on unmount command? Or maybe a wait for unmount operation? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: vnode_to_path()
It's not just $ORIGIN that can make use of it. Imagine for a moment getting a backtrace automatically on a segfault. It's a lot easier and more reliable if you get access to the debug sections. Those are normally not mapped though, so you need access to the path. Actually, you don't. You need read access to the executable; whether you get that access via a path or not is irrelevant. /proc/curproc/file can address this; so could some kind of get_RO_fd_on_my_executable(). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: vnode_to_path()
I have a question regarding the vnode_to_path() function [...] The problem is that it works if and only if [...]. That's the immediate pragmatic problem. More serious, I think, is that it exhibits a much more fundamental confusion: it is confusing objects with names for objects. The correct way to handle this is to call getcwd there instead, but there's so far no agreement to accept the possible extra overhead on every exec call. Also, there *are* race conditions and it's not at all clear what the consequences might be. The current directory may not have any name, and if it does, it may not be determinable by the user doing the exec. $ORIGIN is a poorly conceived interface, unfortunately. Not as if _that_'s anything new. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raidframe questions
[...raid1...2 x 500G...each disk fails, replaced with larger...] I want to know if I can recover my lost 100Go... I read that changing raid size is not possible. That's only semi-true. It's not possible in that there's no clean well-defined interface for it, perhaps. But of course it's possible. With RAID 1, you've got a fairly easy case. Here's how I'd do it: - Unconfigure the RAID. - Re-disklabel both disks, enlarging the relevant partitions. - Patch one component label, increasing the size. - Configure the RAID with only the patched component, with the other one missing. - Hot-add the other component. - Let it resync the hot-added component. When the resync finishes, you should be back in operation with a larger RAID. (As with any resync to a hot-added component, you may then want to unconfigure and reconfigure to get the second drive changed from used spare to ordinary member.) Of course, you then have to figure out what to do with the extra space. The RAID pseudo-disk's disklabel is your friend here; you could give it a separate filesystem, or, if the filesystem supports it, grow the existing partition and then grow the filesystem /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD-based file servers
So I need some data for that upcoming discussion. Who is using NetBSD to ope$ Please don't use paragraph-length lines. So I need some data for that upcoming discussion. Who is using NetBSD to operate a file server on a scale comparable to or larger than ours, i.e. ~200 users, ~1TB storage? If so, which version on what kind of hardware? I'm not sure what counts as a file server, but at one of my jobs our main backup host has a dozen 1T (actually about 931G) disks in two RAIDs, one providing a little under 1T of space and the other providing about 7¼T of space. NetBSD 4.0.1 with a few tweaks (stock 4.0.1 had some 32-bit issues that started breaking things in the TB range) on peecee-architecture hardware. Rackmount server, but it's NetBSD/i386. Fairly old hardware, too; when it was first set up, 2.0 had just been released. User count...that depends on how you measure users. The machine has very few logins, because end users don't log in to it directly. It pulls backups from customer machines. But, while I haven't counted (and in some cases am not in a position to count), I'd be surprised if there were fewer than 200 end users whose data this machine handles. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
sin_zero, redux
Back about two weeks ago I wrote about sin_zero and its relevance to the radix tree used by AF_INET's routing table. Sunday, I finally got together the round tuits to try eliminating sin_zero altogether, this reinforced by remarks during the previous thread that not everyone has sin_zero and thus code with suitable portability aspirations won't, or at least shouldn't, use it explicitly. Of course, I don't know whether anyone will care. Consider this a report back to the community on an experiment, if you will. :) Working with 4.0.1, because that's what I have easy to build at the moment, I removed sin_zero from the struct definition in sys/netinet/in.h. Then it took fixing only two other files to get the kernel to build, sys/nfs/nfs_export.c and sys/netinet/raw_ip.c. Then I did a sweep for files which textually contained sin_zero and fixed them; this meant gnu/dist/gdb/sim/arm/main.c, gnu/dist/gdb6/sim/arm/main.c, share/man/man4/inet.4, usr.bin/talk/ctl.c, and usr.sbin/nfsd/nfsd.c, and ignoring a few other occurrences (eg, in RFCs quoted in files under dist/). All of these fixes were fairly obvious upon looking at the references. Then upon attempting a build of the world, I found I had to fix sbin/routed/output.c and sbin/routed/table.c, which contained initializers which assume the presence of a compound field after sin_addr. Then the world built. AF_INET doesn't work, and I think I know why. This also explains why it didn't work when I just reconfigured the routing table in inetdomain. ARP processing uses sockaddr_inarp (netinet/if_inarp.h) for its routes. This struct is just like standard sockaddr_in except that, in place of sin_zero, it has a second struct in_addr and two 16-bit values, and it actually cares what's there, so for inetdomain's routing table to ignore that data breaks it. Unwarranted chumminess with the implementation at its finest; at the very least this deserves big comments on inetdomain where its routing table is configured, and on struct sockaddr_in explaining why sin_zero has to be there. AF_INET networking seems to work on-subnet anyway, and I'm not sure why, since that uses ARPs too - perhaps it was total coincidence I am relieved to finally (think I) understand why sin_zero is as necessary as it appears to be. I still think requiring sin_zero to be zero for most interfaces (bind(), routing socket messages, etc) to work is a bug, or at best a misfeature; I think it should be zeroed upon reception by the kernel from userland rather than letting userland leave trash there. But I feel much better understanding why things broke so mysteriously when I shrank the routing table. I'm now going back to a source tree with sin_zero and will be adding prominent comments to it explaining why sin_zero is necessary. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Adding SMBus block transfers to iic(4)
[...] i2c block-mode transfers [...] #define I2C_F_BLOCK 0x20 Comments? Suggestions? Alternatives? BLOCK has a second, quite different, meaning (as in, blocking I/O). It may not apply here, but defining a bit in the interface that can be misunderstood as indicating it does could, at the very least, be confusing. Might I suggest BLKMODE instead of BLOCK? At least to my eye, that's a lot less ambiguous. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: GSOC 2012 project clarification
The deletion is generally not at the time of unlink. It happens when the file isn't referenced by anything anymore. Yeah, but in most cases isn't that at unlink time? File destruction _can_ be delayed well beyond the no names refer to it point, but at least in my experience that is very much more the exception than the rule. Mouse
Re: add disk size to struct disk?
Design question: do you expect the checks to be performed in userland, so anyone can be free to have overlaps/overflows, or let the kernel do the checks and return errors using the size obtained through disk(9)? Speaking as someone who occasionally causes overlaps and such deliberately: I don't care, as long as, wherever it is, it's easy to disable the check, or at least downgrade the error to a warning. To put it another way, I think this is an good time to apply the principle Unix does not prevent you from doing stupid things, because that would also prevent you from doing clever things. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How to get a struct mount
I hace a filesystem mounted on /foo how do I retreive its struct mount? Using namei_simple_user(), I can get a vnode for /foo but its v_mount is the one for the root filesystem. Looking up /foo/ produces an error. (bad address). Isn't that what v_mountedhere is for? Or has that gone away in the NetBSD version you're using? (I don't see any indication what version you're doing this under.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ENOATTR vs ENODATA
There is a choice to be made about returing ENOATTR or ENODATA [...] In order to get the broader compatibility, I suggest patching our errno.h to define ENOATTR as ENODATA. Opinions? As a code author, I don't like this. A similar situation already exists with EAGAIN and EWOULDBLOCK: some systems define only one, some only the other, some both with different values, and some both with the same value (often one in terms of the other). The last of these is rather annoying, because it means that a simple #ifdef EAGAIN case EAGAIN: #endif #ifdef EWOULDBLOCK case EWOULDBLOCK: #endif produces a compile-time error. So, my opinion would be to prefer one of the other alternatives. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Thinking about branes for netbsd...
After spending some time thinking about what would be required to implement branes as part of the SMP networking project, [...] What's a brane in this context? The only meaning I'm familiar with for the term is from particle physics and makes no sense here. I did a little searching, and, while my Web-fu is admittedly weak, I didn't find anything the least bit helpful. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
watching dynamic device attachment?
I have an application where I want to watch USB devices come and go. I've written code basd on usb(4), and it works - but the devices in question are disks, which show up as, for example, umass0: using SCSI over Bulk-Only scsibus1 at umass0: 2 targets, 1 lun per target sd0 at scsibus1 target 0 lun 0: Generic, External, 2.10 disk fixed So, I'm wondering if there's some way to watch devices come and go beyond what /dev/usb gives me (which stops with the umass attachment). Even some way to inspect the current device tree would help; I can watch umass attach and then query the tree to see what's underneath it. I have a fuzzy memory of seeing something that looked like device-tree data, but the memory's too fuzzy to be of much use here. kern.drivers appears to be part of what I'd want, but only part; I'd much rather not have to scrape dmesg output. :/ The machine in question is currently at 4.0.1. If this is possible with a more recent version but not with 4.0.1, I might be able to talk its admins into switching, but I suspect they'd rather not; it _is_ a production machine. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: watching dynamic device attachment?
Even some way to inspect the current device tree would help; [...] Device properties: drvctl -p That's close. I don't have an easy way to test it in the particular case at issue (umass - scsibus - sd), since the machine in question is running a kernel based on GENERIC, so drvctl means a new kernel, and, as a production machine, I can't casually reboot it now. But I do have another 4.0.1 machine I can play with; I built a new kernel for it with drvctl(4) in it and poked around with drvctl(8). It doesn't look suitable. For example, my test machine's disk attaches via pci0 at mainbus0 bus 0: configuration mode 1 ... piixide0 at pci0 dev 31 function 2 ... atabus0 at piixide0 channel 0 ... wd0 at atabus0 drive 0: ST9500420AS but drvctl -p on pci0, piixide0, and atabus0 print, essentially, nothing - they print output, but it contains an empty dict. Nothing that would let me walk the device tree down from the umass attachment to the relevant sd. Or is there an option I'm missing? Neither the manpage nor the source lead me to think so. I may be able to extend it a little, to include parent and/or child data in the result, but if there's something already present that will let me get the info I want I'd prefer that. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: watching dynamic device attachment?
drvctl -tl will give you a recursive list of the device tree Maybe you'd expect it to, but what it actually gives me is drvctl: unknown option -- t Usage: drvctl -r [-a attribute] busdevice [locator ...] drvctl -d device drvctl -p device So I assume you're talking about something newer than 4.0.1, which comes back to what I said about switching versions. I suspect my tweaking 4.0.1 will be an easier sell to them than switching versions, especially to a version which isn't even released yet (I have access to a 5.1 machine, and drvctl still says t is an unknown option there, though the list of options it shows is longer, so presumably what you describe won't work before 6.x). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: choosing the file system block size
at least if you ignore the space used by the inode. I guess I can indeed ignore inodes because the space occupied by them doesn't vary with fsbsize, or does it? I think you're right that an inode's size is independent of the filesystem's blocksize; possibly also relevant is that inode space is not available for ordinary data storage even if the inodes in question are not being used. (Whether these means you can ignore them depends on exactly what you care about; you know that better than I.) For files large enough to need indirect blocks, (a) the size is rounded up to the block size, not the frag size, Oops, I didn't know that. Also, it has to occupy only whole blocks. (This can lead to an out-of-space error while there is still space on the disk, if all the available space is in sub-block fragments but the file being extended is large enough that it uses only whole blocks. I have a fuzzy memory that there's code in the allocation routines that tries to allocate fragments out of sub-block pieces rather than splitting whole blocks, in an attempt to reduce this effect.) and (b) you also need to account for indirect blocks. Ah, yes. Although that's probably negledgible Again, it depends on your purpose. It is a fairly small fraction, though; even at its most egregious - a 512/512 filesystem - single indirect block overhead adds 1/128 (128 = 512/4), double indirect adds 1/16384 above the single indirect overhead, and triple another 1/2097152 above that. More typical would be, say, a 1k/8k filesystem, for which the fractions are 1/2048 (2048 = 8192/4), double 1/4194304, and triple 1/8589934592. (This is for FFSv1. I think FFSv2 has 64-bit block addresses, in which case those fractions increase - 1/64, 1/4096, and 1/262144 for 512; 1/1024, 1/1048576, and 1/1073741824 for 8k.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: choosing the file system block size
What I'm trying to do is to figure out the optimal block/fragment size for a filesystem. My idea is to take the given data set and, for various block/fragment sizes, compute the overhead caused by that choice. This is reasonable, if your only metric for optimal is amount of overhead space required. (Which may be true in your case, but it seems to me to be worth mentioning anyway.) Since I'm not interested in the real amount of space required, I can ignore super blocks, cylinder group heads and inodes, since the space required by them doesn't vary with the choice of block size. Yes and no. There are aspects of cylinder groups which do change with block size, though I haven't though about it enough to figure out whether the proportion of space dedicated to overhead changes (ie, whether there are multiple effects which cancel out). I do know that when I newfs with different block sizes, I often get different numbers of CGs and thus different proportions of space dedicated to CG heads. Now, what else do I need? [...freelist...cluster map...] But what do I need for the summary infomation? What else have I forgotten? I'm not sure. I'd suggest reading over the source to newfs and/or fsck; they know a good deal about that stuff, and are much smaller and more comprehensible than the filesystem kernel code. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: choosing the file system block size
I'd suggest reading over the source to newfs and/or fsck; they know a good deal about that stuff, and are much smaller and more comprehensible than the filesystem kernel code. Basically, I gave up on that after realising that both number of blocks and number of data blocks where actually in units of -- fragments! There seems to be a lot of stuff in that which is probably perfectly clear for those actually dealing with FS code, but close to incomprehensible for a newcomer in that area. Heh. The FFS code is full of delightful little surprises like that. In fsresize.c, the source to my program which becamse resize_ffs, there are a number of minor rants about other filesystem programs, such as fsck and newfs/mkfs. Most/all of them are still present in resize_ffs source as of 4.0.1; I haven't bothered checking anything more recent. Is there any good book on the subject? I don't know. I don't know of any such book, but I've never looked; my own knowledge of such things comes from experimentation and code reading. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: accessing another process' resource limits
Is there an interface for reading (or even writing) another process' ulimits? Yes. Command-line: sysctl proc.$PID.rlimit.$RESOURCE.{soft,hard} (use -w to change them, of course). API: sysctl(3) with the analogous MIB. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raw/block device disc troughput
It seems that I have to update my understanding of raw and block devices for discs. [...performance oddities...] Mostly I have nothing useful to say here. But... 2. I would have expected inceasing the block size above MAXPHYS not improving the performance. There is at least one aspect of performance that will not be cut off by MAXPHYS, that being syscall overhead. I don't know your system (you don't say which port you're running on, for example), but if syscall overhead for your hardware is not ignorably small compared to the costs of doing the disk transfer, then doing one syscall per 256K will be four times as costly in syscall overhead as doing one syscall per 1M, even if it is four times as costly in disk transfers. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: raw/block device disc troughput
dd if=/dev/zero od=/dev/[r]sd0b bs=nn, count=xxx. (I've been assuming od= should be of=) The block device will cause readahead at the OS layer. I thought of that too, but didn't mention it, because it's not relevant. dd isn't reading from the disk; it's writing to it. I suspect that if you double-buffered at the client application layer this effect might disappear, I suspect this is a significant effect. I was once using dd to copy from one disk to another, and both drives happened to have activity lights on them. Watching each drive wait for the other convinced me dd is an inefficient way to do that. I built a program that uses two processes, one reading and one writing, with a large chunk of memory shared between them for buffer space. Disk-to-disk copies (not on the same spindle) got significantly faster. :) In this case, dd has to block after each disk write to wait for its buffer to be (unnecessarily, as it happens, though it can't know that) zeroed for the next write. This both imposes additional delay and enforces a lack of overlap between each write and the next. I speculate that the cooked device helps because it means that dd's write finishes when the bits are in the buffer cache, rather than waiting for them to hit disk. Flushing from the buffer cache to the disk then (a) gets overlapped with zeroing memory for the next cycle and (b) allows writes to adjacent disk blocks to be collapsed as they get pushed from the buffer cache to the drive. Unless the host is unusually slow, writing to the disk will be the limiting factor here, meaning the buffer cache will have a large number of writes pending, so coalescing writes is plausible, even likely. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Should kqueue descriptors work outsid of the creating process?
Recently we found out (PR kern/46463) that kqueue() file descriptors, which originaly were designed to be local process only objects, could be passed with SCM_RIGHTS messages to other processes. [...] I propose to not allow sending kqueue file descriptors [...] Or are there any legit uses for foreign kqueue()s? It seems to me, for what it may be worth, that this is asking the wrong question. Rather, I would ask whether there are illegitimate uses for `foreign' kqueue descriptors, and, if not, fix them to be passable like any other descriptors. It's certainly possible there are such uses we want to forbid. I don't know kqueue well enough to address that point myself. But your post doesn't give any particular reason to think there are. I don't see any, the alien process could just create its own kqueue() and add the same events instead of passing the filedescriptor over. The same argument could be applied to descriptors on /dev/null, too, but we don't forbid passing them. That's a somewhat silly analogy, but I think at its core it's basically my argument: we shouldn't forbid things by default, and there are other ways to accomplish the same effects isn't reason enough to prohibit something. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: per-mount maxvnodes
Therefore comes the idea to have a per-mount maxvnodes. I tried implementing it, the biggest problem is how to set the value. sysctl kern./usr/local.maxvnodes? It's a little ambiguous, in that it's possible - or at least it was last time I tried it - to have multiple mounts with the same mounted-on string. But that's definitely an unusual case, and I see nothing wrong with accessing the topmost mount in that case; that's what normal filesystem accesses will do, after all. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: selectively disabling atime updates?
I can think of two ways to acheive this (each of which may be absurd given better knowledge of fs internals than I have): Either a per-process switch disabling atime updates or a way to obtain a read-only clone of a block device which can be mounted ro,noatime. The latter will not work, at least not for FFS and probably not for any filesystem whose implementation was not specifically designed to support it. The problem is that the `read-only' device is changing behind the filesystem's back. Unless you mean a read-only *snapshot* of a block device, in which case you're basically back at snapshots, only at the block device level instead of the filesystem level. (Actually, looking at the existing snapshot support, it's not clear to me that's not exactly what it already is.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RAID 1 with 3 disks
Iis RAIDframe smarter when it has 3 disaks in a RAID 1? Does RF even support 3-disk RAID 1? It didn't last time I looked, but that was long enough ago it could very well have changed since then. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: software interrupts scheduling oddities
Maybe this has happened to you: you tune your NetBSD router for fastest packet-forwarding speed. Presented with a peak packet load, [...] the user interface doesn't get any CPU cycles. [...] [I]f there is any software interrupt pending, then it will run before any user process gets a timeslice. So if the softint rate is really high, then userland will scarcely ever run. Or that is my current understanding. Is it incorrect? No, I think. At least, that's how I'd expect it to work, and I've occasionally seen behaviour close enough to that to make me think it's reasonably accurate. I find your discovery about changing a user process's priority making a difference surprising. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Quota on tmpfs
I don't care about low-level storage and only manage visible file sizes. Sparse files are [...]. Counting them as if they [weren't sparse] [will] [...] or render your new quota system unuseful to a large number of users. How large a number? I have very little basis for more than wild guessing here; I rarely use sparse files, and even more rarely use files sparse enough to make a significant difference from a quota point of view. Furthermore, those few uses are generally administrative, the kind of thing that either is on a non-quota filesystem or is owned by a user who can be exempted from quotas without harm (eg, root). Do you have experience or studies indicating that this is another respect in which I am an outlier? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Quota on tmpfs
I would also guess that sparse files are very rarely used. I suspect process 'core' files are written sparse. I just tried one, and it did not appear so. But that was just one test, and quite possibly one of the probably numerous differences between your test and mine is relevant. (On 1.4T/sparc and 4.0.1/i386, I ran sleep 60 and typed my quitc to get a core dump. In each case, dd conv=notrunc if=sleep.core of=sleep.core did not change the number reported by ls -s.) I had to uncompress one yesterday and it would have a lot smaller if written as a sparse file. It may have had large runs of 0x00s, but could that have been because the process's VM contained them? That is, was it actually sparse when written, or was it just a file which happened to contain data such that some disk could be saved by making it sparse? You say you uncompressed it, and most compression programs do not distinguish between a sparse file and a file with long runs of 0x00s, so that's not evidence for whether it was dumped sparse. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Syscall kill(2) called for a zombie process should return 0
+ if (p != NULL P_ZOMBIE(p)) { + mutex_exit(proc_lock); + return 0; + } mutex_exit(proc_lock); return ESRCH; This is a general question, not necessarily specific to the patch. Which is more costly? Two function calls as above, or storing the return value in a variable to return with just one function call to mutex_exit? It depends. A good optimizer could turn either one into the other, so it may make no real difference. If optimization is disabled or limited, the version quoted above will probably be marginally larger and, assuming larger code doesn't mean more cache line fills, marginally faster. Which is `more costly' depends on what costs you care about and to what extent. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: disklabel problems on 3TB disc
hello. You can put a wedge on the disk or put the raid on the raw disk itself. Can you RAID the raw disk? I thought you had to use partitions of type RAID for that, which RAW_PART isn't. Am I just confused? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RAID on raw partitions (was: disklabel problems on 3TB disc)
You could always have raid sets on raw partitions. I thought I just learned from Greg Oster on May 11 (in 2012053355.38f88...@mickey.usask.ca) that I couldn't have raw partitions als RAIDframe components. There are two meanings of `raw' as applied to disk partitions. There's `raw' as in the message you mention, which is (eg) /dev/rsd0a instead of /dev/sd0a. This is `raw' in that I/O goes more directly to the disk. In this sense, you cannot use raw partitions as RAIDframe members. And there's `raw' as in RAW_PART, which is (eg) /dev/sd0d instead of some other /dev/sd0? (on x86; on most other ports, /dev/sd0c). This is `raw' in that it bypasses partitioning, allowing access to the whole disk regardless of partitioning. In this sense, you can use raw partitions as RAIDframe members, provided you don't autoconfigure, or provided you apply the patch that appeared upthread (or a suitable porting of it if you're not using the version the patch is for). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: full-disc partition (was: RAID on raw partitions)
This is `raw' in that it bypasses partitioning allowing access to the whole disk regardless of partitioning. It looks that I remember incorrectly what ws@ taught me back in the days how the ``full disc'' partition works. What I remember is that (let's assume sparc) partition ``c'' was actually present in the disklabel and it was just by convention that one would allocate that 0-$. The only magic I thought there was is that in case the kernel can't find a disklabel, it would invent one having a single ``c'' partition spanning the whole disc. Perhaps that's how it was back in the day. But now, looking at, for example, sd.c, I see code that actually does bypass partitioning when using RAW_PART. For example, if (SDPART(bp-b_dev) == RAW_PART) { if (bounds_check_with_mediasize(bp, DEV_BSIZE, sd-params.disksize512) = 0) goto done; } else { if (bounds_check_with_label(sd-sc_dk, bp, (sd-flags (SDF_WLABEL|SDF_LABELLING)) != 0) = 0) goto done; } (Using RAW_PART also affects various other things; for example, using RAW_PART prevents the driver from loading the disklabel off the drive the way it does for other partitions.) But then, due to the 32-bit-limitation of the disklabel structure, it couldn't make a ``c'' partition spanning 3TB. Is my memory wrong? Has that been changed? Depends on the version and the disk driver in question. The above quote is from sd.c,v 1.258.2.1, the one that shipped with 4.0.1. Of course, for the full truth, you should check the source to the version you're using. Since this is done in the disk drivers, it will also depend on what driver you're using (sd? wd? ld? etc). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: disklabel problems on 3TB disc
or put the raid on the raw disk itself. That doesn't work. It truncates the component capacity to the truncated value in the disklabel. That could easily just be a code bug. Back some years ago, I had occasion to (for work) set up a RAID of something like six or seven TB. The individual drives were well under the 2T limit, but even so I had some 32-bit bugs to fix. It's possible there's another one in the code path that passes the raw disk size to the raidframe code. Of course, it also might not be, too; I haven't looked. I mention it mostly to say don't just give up on it...well, unless for your purposes custom bugfixes aren't acceptable, or finding them isn't going to happen, or some such. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: disklabel problems on 3TB disc
Back some years ago, I had occasion to (for work) set up a RAID of something like six or seven TB. The individual drives were well under the 2T limit, but even so I had some 32-bit bugs to fix. support for 2TB raidframe was not implemented until this comment: date: 2010/11/01 02:35:25; author: mrg; state: Exp; lines: +19 -21 add support for 2TB raid devices. which was later pulled upto netbsd-5. before this, you could not create or use rf devices larger than 2TB, regardless of the size of the individual components. Perhaps not, but it was pretty close. It took only a few fixes. I did the first version of this under 2.x, but I no longer have that tree (or, if I do, I don't know where). Looking at my 4.0.1 tree, I see only these changesets touching sys/dev/raidframe: * aec5aaa Bugfix: when clearing diskwatch on a raid `disk', use the correct value for `none set'. * 096b537 Add diskwatch support to raidframe pseudo-disks. * 5ba6878 Autoconfiguration rework in raidframe. * 4d7b7d3 A few 32-bit fixes in raidframe. * 6174918 Handle shutdown during reconstruction/rebuild better. * 20a3038 Import newer raidframe code. The first two are not relevant to this discussion; they're entirely diskwatch-related. Taking the other four in chronological order, * 20a3038 Import newer raidframe code. This moved five files (rf_reconmap.c, rf_reconmap.h, rf_reconstruct.c, rf_reconstruct.h, and rf_revent.c) to 2008-05-25 versions (I can give CVS version numbers, but don't see much point). Looking at the diffs, none of them look directly related to 32-bit issues. * 6174918 Handle shutdown during reconstruction/rebuild better. This is not 32-bit related. * 4d7b7d3 A few 32-bit fixes in raidframe. This is most of the 32-bit stuff. There are only four hunks in the diff, all applying to rf_netbsdkintf.c. Three of them are writing 100ULL instead of 100 when computing percentages; the fourth is - if (lp-d_secperunit != rs-sc_size) + if ((long)lp-d_secperunit != (long)rs-sc_size) * 5ba6878 Autoconfiguration rework in raidframe. A quick skim here makes me think none of these are related to 32-bit issues either. The resulting 4.0.1 tree does indeed work for a RAID 5 that's about 7T or so. Of course, the filesystem is on the d partition of the resulting RAID, since I didn't do anything to deal with the disklabel on 2T disk issue. (Actually, it's a RAID 15 - a RAID 5 whose members are RAID 1s.) I can supply full diffs if desired, or anyone with git installed is welcome to clone the repo and look at the aforementioned changesets. (For those interested, it's at git://git.rodents-montreal.org/Mouse/netbsd-fork/4.0.1/src.) It's possible there were other fixes required elsewhere in the tree, but I don't think so. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: pinning down dk? assignment
Let wd1 disappear and the raid will try to use wd0a (dk0) and sd0a (dk1). This is one reason to use autoconfigured RAIDs when you can. They are far more immune (completely immune, in my experience) to confusion from disks attaching in new orders or at new places. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B