Re: Understanding PR kern/43997 (kernel timing problems / qemu)
k...@munnari.oz.au (Robert Elz) writes: > | This is not to be confused with the kernel idea of wall-clock time > | (i.e. what date reports). wall-clock time is usually maintained > | by hardware seperated from the interrupt timers. The 'date; sleep 5; date' > | sequence therefore can show that 10 seconds passed. >But that is totally broken. Broken is the HZ=100 configuration that doesn't match the broken (emulated) hardware. Given a sane time reference you could adjust HZ automatically. Without you could make it a boot time parameter. For the Anita test harness you could probably just boot into ddb and set the hz variable. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Date:Sun, 30 Jul 2017 16:04:38 - (UTC) From:mlel...@serpens.de (Michael van Elst) Message-ID: | There are slower emulated systems that don't have these issues. (*) Yes, that it is not qemu's execution speed was (really, always) becoming obvious. | If the host misses interrupts, time in the guest just passes slower | than real-time. But inside the guest it is consistent. If we could achieve that (which changing the timecounter in qemu apparently achieves) it would at least make the world become rational. Of course, keeping the timing running faster would be better - if we were able to get to a state where the client/guest were actually able to talk to the outside world (that part is easy) and run NTP, and act as a time server that others could trust, that would be ideal. | This is not to be confused with the kernel idea of wall-clock time | (i.e. what date reports). wall-clock time is usually maintained | by hardware seperated from the interrupt timers. The 'date; sleep 5; date' | sequence therefore can show that 10 seconds passed. But that is totally broken. While there is no guarantee that a sleep will wake up after exactly the time requested, it should be as close as is reasonably possible - and on an unloaded system, where there is sufficient RAM, and nothing swapped out, and nothing computing for cpu cycles, that sequence should (always) show something between 5 and 5 and a bit seconds have passed. If the cpu is busy, or things are getting swapped/paged out, then we can expect slower (not only for processes waiting upon timer signals, but for everything), and that's acceptable. But otherwise, inconsistent timing is not acceptable. All kinds of applications (including network protocols) require time to be kept in a way that is at least close to what others observe, even if not identical. One easy (poor) fix is simply to do as used to be done, and have kernel wall clock time maintained by the tick interrupt - that makes things consistent, but without any real expectation of accuracy. The alternative is to make the tic counts depend upon the external wall clock time source, so they keep in sync - much the same as the power companies do with frequency, over any short period, the nominal 50/60 Hz frequency can drift around a lot, but when measured over any reasonable period, those things are highly accurate (which is why old AC frequency based tick systems used to have very good long term time stability, provided they never lost clock interrupts.) | The problem with qemu is that it's running on a NetBSD host and | therefore cannot issue interrupts based on host time unless the | host has a larger HZ value. In the system of most interest, the host, and the guest, are the exact same system (the exact same binary kernel) - unless we alter the config of one of them explicitly to avoid this issue, they cannot help but have the same HZ value. As long as the emulated qemu client has access to a reasonably accurate ToD value (which it obviously does, as the host's time is available to qemu, and can be, and is it seems, made available to the guest) there's no reason at all the guest cannot produce the correct number of ticks. And doing so (since it is just a generic NetBSD) would solve the similar, but less blatant issue for any other system using ticks, where the occasional clock interrupt might get lost, and where there is some other ToD source available. | With host and guest running at HZ=100, it's obvious that interrupts | mostly come just too late and require two ticks on the host, thus | slowing down guest time by a factor of two. Yes, that is a very good explanation for the observed behaviour, and I cannot help but be grateful that simply beginning to discuss this issue has provided so many insights into what is happening, and what we can do to fix things. When there is no alternative than tick interrupts, we can, and do, use those to measure time, and everything works - just if the ticks are not received at the expected rate time keeping drifts away from real time (but invisibly when considered only within the system.) When there is some better measure of real time we can use, we can make sure that keeps all time keeping synchronised better, regardless of whether the system is "tickless" or still tick based - it isn't required that every single tick be 1/HZ apart (they never are precisely anyway) just that over the long term (which in computing is a half second or so) the correct number of ticks have occurred. I think it should be possible to make that happen, and that is what I am going to see if I can do. Then we can see if we can find a (good enough) way to make nanosleep() less ticky - whether by giving up on ticks altogether (which is probably not the best solution - even if we don't use ticks for timing, we'd end up emulating them for other things, if only to avoid needing to rewrite too much of the kernel in one ste
Re: kmem_alloc(0, f)
On Sun, Jul 30, 2017 at 03:23:50PM -, Michael van Elst wrote: > So what does kmem_alloc(0, KM_SLEEP) do? fail where KM_SLEEP says it > cannot fail? I don't think that it can return a zero sized allocation > (i.e. ptr != NULL that cannot be dereferenced). Sure it could, return a pointer inside some red zone unmapped (but reserved kva) page. On typical setups and modulo syscctl vm.user_va0_disable e.g. "return (void*)16;" just as a simple example. Martin
Re: kmem API to allocate arrays
On Sun, Jul 30, 2017 at 03:30:59PM -, Michael van Elst wrote: > Reallocation is usually a reason for memory fragmentation. I would > rather try to avoid it instead of making it easier. Agreeed. Also for kernel drivers, resizing an array allocation is a very rare operation and no good reason to overcomplicate the api. Martin
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
g...@gson.org (Andreas Gustafsson) writes: >Frank Kardel wrote: >> Fixing that requires some more work. But I am surprised that the qemu >> interrupt rate is seemingly somewhat around 50Hz. It shouldn't have a problem on Linux. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: kmem API to allocate arrays
On 30.07.2017 16:51, Taylor R Campbell wrote: >> Date: Sun, 30 Jul 2017 16:24:07 +0200 >> From: Kamil Rytarowski >> >> I would allow size to be 0, like with the original reallocarr(3). It >> might be less pretty, but more compatible with the original model and >> less vulnerable to accidental panics for no good reason. > > Hard to imagine a legitimate use case for size = 0. Almost always, > the parameter will be sizeof(struct foo), or some kind of blocksize > which necessarily has to be nonzero. > > I started writing some example code, and I'm not too keen on having to > write kmem_reallocarr for initial allocation and final freeing, so if > we adopted this, I'd like to have > > int kmem_allocarr(void *ptrp, size_t size, size_t count, km_flag_t flags); > int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, > km_flag_t flags); > void kmem_freearr(void *ptrp, size_t size, size_t count); > > ...at which point it's actually not clear to me that we have much of a > use for kmem_reallocarr. Maybe we do -- I haven't surveyed many > users. > > This still doesn't address the question of whether or how we should > express bounds on the allowed sizes of the arrays. > I see, perhaps it's legitimate to avoid realloc due to fragmentation. Without this reallocarr has little point. signature.asc Description: OpenPGP digital signature
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
k...@munnari.oz.au (Robert Elz) writes: >kern/43997 is the "qemu is too slow, clock interrupts get lost, timing >gets all messed up" problem that plagues many of the ATF tests that kind >of expect time to be maintained rationally. There are slower emulated systems that don't have these issues. (*) >The problem is really (again from the PR) > The routines sleep(3), usleep(3), and nanosleep(2) wake-up based on the > occurrence of clock ticks. However, the timer interrupt routine > determines the actual absolute time. If the host misses interrupts, time in the guest just passes slower than real-time. But inside the guest it is consistent. This is not to be confused with the kernel idea of wall-clock time (i.e. what date reports). wall-clock time is usually maintained by hardware seperated from the interrupt timers. The 'date; sleep 5; date' sequence therefore can show that 10 seconds passed. The problem with qemu is that it's running on a NetBSD host and therefore cannot issue interrupts based on host time unless the host has a larger HZ value. With host and guest running at HZ=100, it's obvious that interrupts mostly come just too late and require two ticks on the host, thus slowing down guest time by a factor of two. (*) This emulator derives timer information from the emulation itself. I.e. after N emulated cycles, the timers advances accordingly. If the emulation is too slow, it may even skip timer values to keep pace. Only when it is too slow to even adjust the timers once per HZ, you see similar issues as with in qemu. dummy# date; sleep 5; date Sun Jul 30 18:01:23 CEST 2017 Sun Jul 30 18:01:28 CEST 2017 dummy# atf-run t_ldp_regen | atf-report Tests root: /usr/tests/net/mpls t_ldp_regen (1/1): 1 test cases ldp_regen: [48.646148s] Passed. [48.646421s] Summary for 1 test programs: 1 passed test cases. 0 failed test cases. 0 expected failed test cases. 0 skipped test cases. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: kmem API to allocate arrays
campbell+netbsd-tech-k...@mumble.net (Taylor R Campbell) writes: >Initially I was reluctant to do that because (a) we don't even have a >kmem_realloc, perhaps for some particular reason, and (b) it requires >an extra parameter for the old size. But I don't know any particular >reason in (a), and perhaps (b) not so bad after all. Here's a draft: Reallocation is usually a reason for memory fragmentation. I would rather try to avoid it instead of making it easier. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: kmem_alloc(0, f)
mar...@duskware.de (Martin Husemann) writes: >On Sat, Jul 29, 2017 at 02:04:42PM +, Taylor R Campbell wrote: >> This seems like a foot-oriented panic gun, and it's been a source of >> problems in the past. Can we change it? >I think it is a valuable tool to catch driver bugs early during >developement, but wouldn't mind to reduce it to a KASSERT. So what does kmem_alloc(0, KM_SLEEP) do? fail where KM_SLEEP says it cannot fail? I don't think that it can return a zero sized allocation (i.e. ptr != NULL that cannot be dereferenced). -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
>> # time sleep 10 >>10.02 real 0.00 user 0.00 sys >> This actually took 20 seconds of real time (manually timed with a >> stopwatch). > [...], but an error of a factor 2 looks suspicious. This is tickling old memories. I think I ran into a case where requesting timer ticks at 100Hz actually got them at 50Hz instead, even though the kernel was running with 100Hz ticks. I've done some searching and completely failed to find either the program exhibiting the symptom (I _think_ it was userland) or the fix, but it might be worth looking into the possibility that this is another manifestation of the same underlying problem, whatever it was. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kmem API to allocate arrays
> Date: Sun, 30 Jul 2017 16:24:07 +0200 > From: Kamil Rytarowski > > I would allow size to be 0, like with the original reallocarr(3). It > might be less pretty, but more compatible with the original model and > less vulnerable to accidental panics for no good reason. Hard to imagine a legitimate use case for size = 0. Almost always, the parameter will be sizeof(struct foo), or some kind of blocksize which necessarily has to be nonzero. I started writing some example code, and I'm not too keen on having to write kmem_reallocarr for initial allocation and final freeing, so if we adopted this, I'd like to have int kmem_allocarr(void *ptrp, size_t size, size_t count, km_flag_t flags); int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, km_flag_t flags); voidkmem_freearr(void *ptrp, size_t size, size_t count); ...at which point it's actually not clear to me that we have much of a use for kmem_reallocarr. Maybe we do -- I haven't surveyed many users. This still doesn't address the question of whether or how we should express bounds on the allowed sizes of the arrays.
Re: kmem API to allocate arrays
On 30.07.2017 15:45, Taylor R Campbell wrote: >> Date: Sun, 30 Jul 2017 10:22:11 +0200 >> From: Kamil Rytarowski >> >> I think we should go for kmem_reallocarr(). It has been designed for >> overflows like realocarray(3) with an option to be capable to resize a >> table fron 1 to N elements and back from N to 0 including freeing. > > Initially I was reluctant to do that because (a) we don't even have a > kmem_realloc, perhaps for some particular reason, and (b) it requires > an extra parameter for the old size. But I don't know any particular > reason in (a), and perhaps (b) not so bad after all. Here's a draft: > > int > kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, int flags) > { > void *optr, *nptr; > > KASSERT(size != 0); > if (__predict_false((size|ncnt) >= SQRT_SIZE_MAX && > ncnt > SIZE_MAX/size)) > return ENOMEM; > > memcpy(&optr, ptrp, sizeof(void *)); > KASSERT((ocnt == 0) == (optr == NULL)); > if (ncnt == 0) { > nptr = NULL; > } else { > nptr = kmem_alloc(size*ncnt, flags); > KASSERT(nptr != NULL || flags == KM_NOSLEEP); > if (nptr == NULL) > return ENOMEM; > } > KASSERT((ncnt == 0) == (nptr == NULL)); > if (ocnt & ncnt) > memcpy(nptr, optr, size*MIN(ocnt, ncnt)); > if (ocnt != 0) > kmem_free(optr, size*ocnt); > memcpy(ptrp, &nptr, sizeof(void *)); > > return 0; > } > I would allow size to be 0, like with the original reallocarr(3). It might be less pretty, but more compatible with the original model and less vulnerable to accidental panics for no good reason. signature.asc Description: OpenPGP digital signature
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Frank Kardel wrote: > Fixing that requires some more work. But I am surprised that the qemu > interrupt rate is seemingly somewhat around 50Hz. > Could it be a bug in qemu getting the frequeny not right. qemu should > read the clock to get the frequencies right an possibly skip > usleeps less than i/HZ possibly managing an error-budget. I haven't > looked into qemu at all, but an error of a factor 2 looks suspicious. I fully agree. -- Andreas Gustafsson, g...@gson.org
Re: kmem API to allocate arrays
> Date: Sun, 30 Jul 2017 10:22:11 +0200 > From: Kamil Rytarowski > > I think we should go for kmem_reallocarr(). It has been designed for > overflows like realocarray(3) with an option to be capable to resize a > table fron 1 to N elements and back from N to 0 including freeing. Initially I was reluctant to do that because (a) we don't even have a kmem_realloc, perhaps for some particular reason, and (b) it requires an extra parameter for the old size. But I don't know any particular reason in (a), and perhaps (b) not so bad after all. Here's a draft: int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, int flags) { void *optr, *nptr; KASSERT(size != 0); if (__predict_false((size|ncnt) >= SQRT_SIZE_MAX && ncnt > SIZE_MAX/size)) return ENOMEM; memcpy(&optr, ptrp, sizeof(void *)); KASSERT((ocnt == 0) == (optr == NULL)); if (ncnt == 0) { nptr = NULL; } else { nptr = kmem_alloc(size*ncnt, flags); KASSERT(nptr != NULL || flags == KM_NOSLEEP); if (nptr == NULL) return ENOMEM; } KASSERT((ncnt == 0) == (nptr == NULL)); if (ocnt & ncnt) memcpy(nptr, optr, size*MIN(ocnt, ncnt)); if (ocnt != 0) kmem_free(optr, size*ocnt); memcpy(ptrp, &nptr, sizeof(void *)); return 0; }
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Hi Andreas ! On 07/30/17 15:20, Andreas Gustafsson wrote: > Frank Kardel wrote: >> Could you check which timecounter is used under qemu? >> >> sysctl kern.timecounter.hardware > > # sysctl kern.timecounter.hardware > kern.timecounter.hardware = hpet0 > >> Usually the timecounters are hardware-based and have no relation >> to the clockinterrupt. In case of qemu you might get a good >> emulated timecounter, but a suboptimal clockinterupt. >> If this is the case it helps to use the clockinterrupt >> itself as timecounter for the wall clock time to avoid a discrepancy >> between clockinterrupt-driven timeout handling and wall-clock time tracking. >> >> sysctl -w kern.timecounter.hardware=clockinterrupt > > # sysctl -w kern.timecounter.hardware=clockinterrupt > kern.timecounter.hardware: hpet0 -> clockinterrupt > # time sleep 10 >10.02 real 0.00 user 0.00 sys > > This actually took 20 seconds of real time (manually timed with a > stopwatch). > >> This is the opposite from deducing the missed clock interrupts >> from the wall clock time and keeps timeout handling and in the emulation >> observed wall-time synchronized no matter how slow >> the clock-interrupts are - the emulated wall clock time will be >> at the same rate. > > Right, but I would still rather see the bug fixed than worked around > this way. Fixing that requires some more work. But I am surprised that the qemu interrupt rate is seemingly somewhat around 50Hz. Could it be a bug in qemu getting the frequeny not right. qemu should read the clock to get the frequencies right an possibly skip usleeps less than i/HZ possibly managing an error-budget. I haven't looked into qemu at all, but an error of a factor 2 looks suspicious. Frank
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Frank Kardel wrote: > Could you check which timecounter is used under qemu? > > sysctl kern.timecounter.hardware # sysctl kern.timecounter.hardware kern.timecounter.hardware = hpet0 > Usually the timecounters are hardware-based and have no relation > to the clockinterrupt. In case of qemu you might get a good > emulated timecounter, but a suboptimal clockinterupt. > If this is the case it helps to use the clockinterrupt > itself as timecounter for the wall clock time to avoid a discrepancy > between clockinterrupt-driven timeout handling and wall-clock time tracking. > > sysctl -w kern.timecounter.hardware=clockinterrupt # sysctl -w kern.timecounter.hardware=clockinterrupt kern.timecounter.hardware: hpet0 -> clockinterrupt # time sleep 10 10.02 real 0.00 user 0.00 sys This actually took 20 seconds of real time (manually timed with a stopwatch). > This is the opposite from deducing the missed clock interrupts > from the wall clock time and keeps timeout handling and in the emulation > observed wall-time synchronized no matter how slow > the clock-interrupts are - the emulated wall clock time will be > at the same rate. Right, but I would still rather see the bug fixed than worked around this way. -- Andreas Gustafsson, g...@gson.org
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Could you check which timecounter is used under qemu? sysctl kern.timecounter.hardware Usually the timecounters are hardware-based and have no relation to the clockinterrupt. In case of qemu you might get a good emulated timecounter, but a suboptimal clockinterupt. If this is the case it helps to use the clockinterrupt itself as timecounter for the wall clock time to avoid a discrepancy between clockinterrupt-driven timeout handling and wall-clock time tracking. sysctl -w kern.timecounter.hardware=clockinterrupt This is the opposite from deducing the missed clock interrupts from the wall clock time and keeps timeout handling and in the emulation observed wall-time synchronized no matter how slow the clock-interrupts are - the emulated wall clock time will be at the same rate. This might be a workaround for the current qemu issue and does not affect any discussion about improving sleep timing or migrating to a tick-less kernel. BTW: even a tick-less kernel will need to e a minimum interrupt frequency in order to avoid undetected timecounter wrapping. Frank On 07/30/17 14:22, Robert Elz wrote: Date:Sun, 30 Jul 2017 13:01:50 +0300 From:Andreas Gustafsson Message-ID: <22909.44686.188004.117...@guava.gson.org> | I don't think the slowness of qemu's emulation is the actual cause of | its inability to simulate clock interrupts at 100 Hz. Yes, I was wondering about that, as if it was, there'd often be no time left for anything else... | If my theory is correct, there are at least three ways the problem | could be fixed: | | - Improve the time resolution of sleeps on the host system, | - Make qemu deal better with hosts unable to sleep for short periods Either, or both, of those should be fixed, and I might get to take a look at the first one (the insides of qemu are not all that appealing...) but | - Make the guest system deal better with missed timer interrupts. This one needs to be fixed. an idle system that says it takes 13 seconds to do a sleep 10 is simply broken. Fixing the other issues (or either one of them) would make it much harder to work on this one - that is keeping the qemu/host relationship stable allows a platform where the timekeeping issues in the kernel are known to occur, so a good way to verify any fix, so I think this should be fixed first. kre
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Date:Sun, 30 Jul 2017 13:01:50 +0300 From:Andreas Gustafsson Message-ID: <22909.44686.188004.117...@guava.gson.org> | I don't think the slowness of qemu's emulation is the actual cause of | its inability to simulate clock interrupts at 100 Hz. Yes, I was wondering about that, as if it was, there'd often be no time left for anything else... | If my theory is correct, there are at least three ways the problem | could be fixed: | | - Improve the time resolution of sleeps on the host system, | - Make qemu deal better with hosts unable to sleep for short periods Either, or both, of those should be fixed, and I might get to take a look at the first one (the insides of qemu are not all that appealing...) but | - Make the guest system deal better with missed timer interrupts. This one needs to be fixed. an idle system that says it takes 13 seconds to do a sleep 10 is simply broken. Fixing the other issues (or either one of them) would make it much harder to work on this one - that is keeping the qemu/host relationship stable allows a platform where the timekeeping issues in the kernel are known to occur, so a good way to verify any fix, so I think this should be fixed first. kre
Re: kmem_alloc(0, f)
On Sat, Jul 29, 2017 at 02:04:42PM +, Taylor R Campbell wrote: > This seems like a foot-oriented panic gun, and it's been a source of > problems in the past. Can we change it? I think it is a valuable tool to catch driver bugs early during developement, but wouldn't mind to reduce it to a KASSERT. Martin
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Robert Elz wrote: > I want to leave /bin/sh to percolate for a while, make sure there are > no issues with it as it is, before starting on the next round of > cleanups and bug fixes, so I was looking for something else to poke > my nose into ... > > [Aside: the people I added to the cc of this message are those who have > added text to PR kern/43997 and so I thought might be interested, if you're > not, just say...] > > kern/43997 is the "qemu is too slow, clock interrupts get lost, timing > gets all messed up" problem that plagues many of the ATF tests that kind > of expect time to be maintained rationally. Thank you for looking into this. > Now there's no question that qemu is slow, for example, on my amd64 Xen > DomU test system, the shell arithmetic test of ++x (etc) takes: > var_preinc: [0.077617s] Passed. > whereas from the latest completed b5 (qemu) test run (as of this e-mail) > var_preinc Passed N/A 6.200489s > > That's about 80 times slower (and most of the other tests show similar > factors). I don't think we can blame qemu for that, given what it is > doing. > > So, it is hardly surprising that, to borrow Paul's words from the PR: > On (at least) amd64 architecture, qemu cannot simulate clock > interrupts at 100Hz. I don't think the slowness of qemu's emulation is the actual cause of its inability to simulate clock interrupts at 100 Hz. Rather, I think it is more likely caused by the inability of qemu to sleep for periods shorter than 10 ms due to limitations of the underlying host OS, such as that documented in the BUGS section of nanosleep(2). That this is at least partly a host system issue is supported by the observation that when qemu is hosted on a Linux system, the timing in the NetBSD guest is much more accurate than when qemu is hosted on NetBSD, on similar hardware: NetBSD-on-qemu-on-NetBSD# time sleep 10 13.00 real 0.00 user 0.03 sys NetBSD-on-qemu-on-Linux# time sleep 10 10.13 real 0.02 user 0.02 sys If my theory is correct, there are at least three ways the problem could be fixed: - Improve the time resolution of sleeps on the host system, as recently discussed on tech-kern in a thread starting with http://mail-index.netbsd.org/tech-kern/2017/07/02/msg022024.html - Make qemu deal better with hosts unable to sleep for short periods of time, or - Make the guest system deal better with missed timer interrupts. -- Andreas Gustafsson, g...@gson.org
Re: kmem API to allocate arrays
On 29.07.2017 16:19, Taylor R Campbell wrote: > It's stupid that we have to litter drivers with > > if (SIZE_MAX/sizeof(struct xyz_cookie) < iocmd->ncookies) { > error = EINVAL; > goto out; > } > cookies = kmem_alloc(iocmd->ncookies*sizeof(struct xyz_cookie), > KM_SLEEP); > ... > > and as you can tell from some recent commits, it hasn't been done > everywhere. It's been a consistent source of problems in the past. > > This multiplication overflow check, which is all that most drivers do, > also doesn't limit the amount of wired kmem that userland can request, > and there's no way for kmem to say `sorry, I can't satisfy this > request: it's too large' other than to panic or wedge. > > In userland we now have reallocarr(3). I propose that we add > something to the kernel, but I'm not immediately sure what it should > look like because kernel is a little different. Solaris/illumos > doesn't seem to have anything we could obviously parrot, from a > cursory examination. > > We could add something like > > void*kmem_allocarray(size_t n, size_t size, int flags); > voidkmem_freearray(size_t n, size_t size); > > That wouldn't address bounding the amount of wired kmem userland can > request. Perhaps that's OK: perhaps it's enough to have drivers put > limits on the number of (say) array elements at the call site, > although then there's not as much advantage to having a new API. > Instead, we could make it > > void*kmem_allocarray(size_t n, size_t size, size_t maxn, > int flags); > or > void*kmem_allocarray(size_t n, size_t size, size_t maxbytes, > int flags); > > It's not clear that the call site is exactly the right place to > compute a bound on the number of bytes a user can allocate. On the > other hand, if it's not clear up front what the bound is, then that > makes a foot-oriented panic gun, or an instawedge, if the kernel can > decides at run-time how many bytes is more than it can possibly ever > satisfy, which is not so great either. If you specify up front in the > source, at least you can say by examination of the source whether it > has a chance of working or not on some particular platform. And maybe > we can make it easier to write an expression for `no more than 10% of > the machine's current RAM' or something. > > Either way, kmem_allocarray would have to have the option of returning > NULL, unlike kmem_alloc(..., KM_SLEEP), which is a nontrivial change > to the contract now that chuq@ recently dove in deep to make sure it > never returns NULL. > > Thoughts? > I think we should go for kmem_reallocarr(). It has been designed for overflows like realocarray(3) with an option to be capable to resize a table fron 1 to N elements and back from N to 0 including freeing. signature.asc Description: OpenPGP digital signature