use cpu sensor for cpuspeed

2022-05-14 Thread Ted Unangst
The cpu hz sensor is more accurate and updates faster than than the value
currently used for hw.cpuspeed. So return that value (scaled).

This doesn't set cpuspeed directly because the acpi does that and it's hard
to create a whole system of priority overrides. I think it's simpler and
maybe even better to track every value, and then return the best one
available.


Index: identcpu.c
===
RCS file: /home/cvs/src/sys/arch/amd64/amd64/identcpu.c,v
retrieving revision 1.124
diff -u -p -r1.124 identcpu.c
--- identcpu.c  26 Apr 2022 10:48:20 -  1.124
+++ identcpu.c  15 May 2022 00:15:39 -
@@ -64,6 +64,7 @@ void  cpu_check_vmm_cap(struct cpu_info *
 /* sysctl wants this. */
 char cpu_model[48];
 int cpuspeed;
+uint64_t sensorspeed;
 
 int amd64_has_xcrypt;
 #ifdef CRYPTO
@@ -244,6 +245,8 @@ int
 cpu_amd64speed(int *freq)
 {
*freq = cpuspeed;
+   if (sensorspeed)
+   *freq = sensorspeed / 1ULL;
return (0);
 }
 
@@ -337,6 +340,8 @@ cpu_hz_update_sensor(void *args)
val = (adelta * 100) / mdelta * tsc_frequency;
val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ; 
ci->ci_hz_sensor.value = val;
+   if (CPU_IS_PRIMARY(ci))
+   sensorspeed = val;
}
 
atomic_clearbits_int(>p_flag, P_CPUPEG);



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Philip Guenther
On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> Stuart Henderson wrote in
...
>  |what's the perceived problem you're wanting to solve? and does that
>  |problem actually exist in the first place?
> 
> The problem is that if have a low upper bound then modulo will "remove a 
> lot of randomization".  For example if you have a program which 
> generates Lotto numbers (1..49), then using _uniform() as it is will 
> generate many duplicates.

Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
distribution.  Says right so in the manpage.  If an implementation of that 
API fails to do that, it's a broken implementation.

The OpenBSD implementation achieves that.  NetBSD's implementation has the 
same core logic.  Here's my quick lotto pick demo program, doing 490 
picks, so they should all be somewhere near 10:

#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++)
counts[arc4random_uniform(LIMIT)]++;
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\n", i+1, counts[i]);
return 0;
}

And a sample run:

: bleys; ./a.out
   
 1   100100
 2   100639
 399965
 499729
 599641
 699650
 7   100299
 8   100164
 999791
10   100101
11   100657
12   100042
1399661
1499927
1599426
1699491
1799646
18   100133
19   100013
2099942
2199873
2299924
2399567
24   100152
25   100688
26   100011
27   100481
2899980
29   100406
3099726
3199808
3299929
33   100050
3499983
35   100048
3699771
3799906
38   100215
39   100261
40   100426
4199847
4299533
43   100368
4499695
45   100041
46   100465
4799875
48   100034
4999920
: bleys; 

Looks pretty good to me, with repeated runs showing the values bouncing around.


...
> I was a bit interested when i saw Luke's message, but i am no
> mathematician and, like i said, i in fact never needed the
> functionality.  And the question i would have is how small
> upper_bound has to be for the modulo problem to really kick in.

If the implementation isn't broken, it's not a problem, period.


> And even if, whether it matters.  For example, if you have
> a bitset of 1024 "in-flight things", and the small upper bound
> hits a used slot, you could simply search forward until you find
> a free one.  I mean, with 1024 possible values the number of
> possibilities is anyhow restricted.

Well hey, let's give Luke's a try and see how uniform it is:


#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
unsigned counts2[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++) {
counts[arc4random_uniform(LIMIT)]++;
counts2[arc4random_uniform_fast_simple(LIMIT)]++;
}
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]);
return 0;
}

: bleys; ./a.out  
 1   100188   76502
 299983   76602
 3   100521   76522
 4   100416   76682
 5   100171   76387
 6   100163   76759
 7   100024   76336
 8   19   76703
 999769   76237
1099892   76532
11   100197   76730
12   100483   76398
1399769   76310
14   100075   76474
1599781   76599
1699846   76439
1799814   76430
18   100313   76648
19   100259   76813
2099885   77068
21   100302   76546
228   76698
2399491   76678
24   100340   76324
2599763  115263
2699872  153008
27   100022  152979
2899481  153793
29   100018  210714
3099617  229286
31   100167  297003
32   100270  449664
33   100468   76790
3499115   76452
3599921   76392
3699862   76140
37   100485   76607
38   100029   75885
3999577   76498
4099479   76727
41   100139   76746
42   100883   76698
43   100102   76474
4499801   76592
45   100117   76124
4699678   76417
4799770   76639
4899524   77034
49   100151   76658
: bleys; 

Wow, that last column is *bad*.  Repeated runs show 25-32 are consistently 
high, 32 being *outrageously* off, and the others are low.

Luke's implementation does not correctly implement the API.  Doesn't 
matter if it's a million times faster when it doesn't deliver the goal.


Philip Guenther



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Steffen Nurpmeso
Stuart Henderson wrote in
 :
 |On 2022/05/14 06:56, Luke Small wrote:
 |> If I use arc4random_uniform() repeatedly to create a random distribution \
 |> of
 |> say numbers less than 0x1000 or even something weird like 0x1300 will the
 |> random distribution be better with arc4random_uniform() or with mine?
 |
 |there's no point to have a choice of different arc4random_uniform_xyz
 |with different characteristics, how is somebody going to pick the
 |"right" one?
 |
 |the bottom of netbsd's version of the arc4random(3) manual says it well:
 |
 | One may be tempted to create new APIs to accommodate different \
 | security
 | models and performance constraints without unpleasant surprises \
 | on dif-
 | ferent operating systems.  This should not be done lightly, though,
 | because there are already too many different choices, and too \
 | many oppor-
 | tunities for programmers to reach for one and pick the wrong one.
 |
 |what's the perceived problem you're wanting to solve? and does that
 |problem actually exist in the first place?

The problem is that if have a low upper bound then modulo will
"remove a lot of randomization".  For example if you have
a program which generates Lotto numbers (1..49), then using
_uniform() as it is will generate many duplicates.

I used the approach i found in a TeX file over twenty years ago
("RANDOM.TEX, v.1 (Donald Arseneau) from TeTeX,
texmf/tex/generic/misc/random.tex"; and now that i look, he
published a .v2 not too long ago), it was named _clip() as it
produces something in between a minimum and a maximum:

_max -= _min;
++_max;
_min = Maximum::sir - 2; // m - 2 = 2^(32|64) - 3
if (_max != 0)
_min /= _max;

for(;;) {
uir i;

i = _random;
i += M1::uir;
if(_min != 0)
i /= _min;

if(i >= _max && (i != 0 || _max != 0)) {
_random *= 16807; // just MUL
if(_random == 0)
_random = 1;
continue;
}
_random = i;
break;
}

_random += origmin;
_Assert(1 && _random >= origmin && _random <= origmax);

That is the random number is multiplied with a, eh, non-prime, as
long as we do not fit.
However i found this became a terrible mess with 64-bit numbers,
so i removed the function, aka did not "port it over" when
i implemented a random number generator for my own use cases.
Maybe i should have increased the multiplicator.

It is really, really bad.  :-)

I only ever used it for a Lotto number program anyhow, i think,
and so i thought it would be better to simply read a one byte
random and skip those that do not fit, instead of making them
always fit with modulo 50.  That is, application specific
workaround.  Things are less harmful for large numbers, anyway.
In fact i have to say i was astonished to see someone using this
function, not too long ago i saw a FreeBSD commit fly by which
uses it!

I was a bit interested when i saw Luke's message, but i am no
mathematician and, like i said, i in fact never needed the
functionality.  And the question i would have is how small
upper_bound has to be for the modulo problem to really kick in.
And even if, whether it matters.  For example, if you have
a bitset of 1024 "in-flight things", and the small upper bound
hits a used slot, you could simply search forward until you find
a free one.  I mean, with 1024 possible values the number of
possibilities is anyhow restricted.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Matthew Martin
int
main() {
int results[3] = { 0, 0, 0 };
for (int i = 0; i < 10; i++) {
results[arc4random_uniform_fast_simple(3)]++;
}
for (int i = 0; i < 3; i++)
printf("%d: %d\n", i, results[i]);

return 0;
}

% ./a.out
0: 24809
1: 50011
2: 25180

You can't reuse bits because they'll be biased.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Stuart Henderson
On 2022/05/14 06:56, Luke Small wrote:
> If I use arc4random_uniform() repeatedly to create a random distribution of
> say numbers less than 0x1000 or even something weird like 0x1300 will the
> random distribution be better with arc4random_uniform() or with mine?

there's no point to have a choice of different arc4random_uniform_xyz
with different characteristics, how is somebody going to pick the
"right" one?

the bottom of netbsd's version of the arc4random(3) manual says it well:

 One may be tempted to create new APIs to accommodate different security
 models and performance constraints without unpleasant surprises on dif-
 ferent operating systems.  This should not be done lightly, though,
 because there are already too many different choices, and too many oppor-
 tunities for programmers to reach for one and pick the wrong one.

what's the perceived problem you're wanting to solve? and does that
problem actually exist in the first place?



Re: stop using mquery(2) inside realloc(3)

2022-05-14 Thread Philip Guenther
On Sat, 14 May 2022, Philip Guenther wrote:
> On Sat, 14 May 2022, Theo de Raadt wrote:
> > I worry a little about having libc use an undocumented mmap(2) flag.
> > About as much as using mquery, which is non-standard.
> > 
> > Maybe __MAP_NOREPLACE should get documentation?  __MAP_NOFAULT is in the
> > same situation.  The behaviour of these flags should be documented
> > (set in stone), which may also discourage accidental behaviour changes
> > by kernel developers in the future?
> 
> To throw some spaghetti at the wall...

Fix the grammar of the __MAP_NOFAULT description and mention signals there.

Index: sys/mmap.2
===
RCS file: /data/src/openbsd/src/lib/libc/sys/mmap.2,v
retrieving revision 1.68
diff -u -p -r1.68 mmap.2
--- sys/mmap.2  31 Mar 2022 17:27:16 -  1.68
+++ sys/mmap.2  14 May 2022 19:31:28 -
@@ -58,8 +58,19 @@ The
 argument describes the address where the system should place the mapping.
 If the
 .Dv MAP_FIXED
-flag is specified, the allocation will happen at the specified address,
+flag is specified and the
+.Dv __MAP_NOREPLACE
+flag is not specified,
+the allocation will happen at the specified address,
 replacing any previously established mappings in its range.
+If both the
+.Dv MAP_FIXED
+and
+.Dv __MAP_NOREPLACE
+flags are specified,
+the allocation will happen at the specified address,
+failing with no effect if any previously established mappings are
+in its range.
 Otherwise, the mapping will be placed at the available spot at
 .Fa addr ;
 failing that it will be placed "close by".
@@ -153,6 +164,18 @@ mappings)
 must be multiples of the page size.
 Existing mappings in the address range will be replaced.
 Use of this option is discouraged.
+.It Dv __MAP_NOFAULT
+Indicate that access to pages that are not backed by the mapped
+file or device will result in zero-filled anonymous pages being
+provided instead of a signal being delivered.
+This flag must not be used with an anonymous mapping.
+.It Dv __MAP_NOREPLACE
+Indicates that a
+.Dv MAP_FIXED
+mapping should fail if it would require replacing any existing
+mappings.
+This flag must be used in combination with
+.Dv MAP_FIXED .
 .It Dv MAP_STACK
 Indicate that the mapping is used as a stack.
 This flag must be used in combination with
@@ -278,6 +301,12 @@ device does not support memory mapping.
 The allocation
 .Fa len
 was 0.
+.It Bq Er EINVAL
+.Dv __MAP_NOFAULT
+was specified for an anonymous mapping or
+.Dv __MAP_NOREPLACE
+was specified without
+.Dv MAP_FIXED .
 .It Bq Er ENOMEM
 .Dv MAP_FIXED
 was specified and the
@@ -323,6 +352,14 @@ A fully functional
 system call first appeared in SunOS 4.0
 and has been available since
 .Bx 4.3 Net/2 .
+The
+.Dv __MAP_NOFAULT
+flag appeared in
+.Ox 5.7 .
+The
+.Dv __MAP_NOREPLACE
+flag appeared in
+.Ox 5.3 .
 .Sh CAVEATS
 .St -p1003.1-2008
 specifies that references to pages beyond the end of a mapped object



Re: stop using mquery(2) inside realloc(3)

2022-05-14 Thread Philip Guenther
On Sat, 14 May 2022, Theo de Raadt wrote:
> I worry a little about having libc use an undocumented mmap(2) flag.
> About as much as using mquery, which is non-standard.
> 
> Maybe __MAP_NOREPLACE should get documentation?  __MAP_NOFAULT is in the
> same situation.  The behaviour of these flags should be documented
> (set in stone), which may also discourage accidental behaviour changes
> by kernel developers in the future?

To throw some spaghetti at the wall...

Index: sys/mmap.2
===
RCS file: /data/src/openbsd/src/lib/libc/sys/mmap.2,v
retrieving revision 1.68
diff -u -p -r1.68 mmap.2
--- sys/mmap.2  31 Mar 2022 17:27:16 -  1.68
+++ sys/mmap.2  14 May 2022 19:06:00 -
@@ -58,8 +58,19 @@ The
 argument describes the address where the system should place the mapping.
 If the
 .Dv MAP_FIXED
-flag is specified, the allocation will happen at the specified address,
+flag is specified and the
+.Dv __MAP_NOREPLACE
+flag is not specified,
+the allocation will happen at the specified address,
 replacing any previously established mappings in its range.
+If both the
+.Dv MAP_FIXED
+and
+.Dv __MAP_NOREPLACE
+flags are specified,
+the allocation will happen at the specified address,
+failing with no effect if any previously established mappings are
+in its range.
 Otherwise, the mapping will be placed at the available spot at
 .Fa addr ;
 failing that it will be placed "close by".
@@ -153,6 +164,17 @@ mappings)
 must be multiples of the page size.
 Existing mappings in the address range will be replaced.
 Use of this option is discouraged.
+.It Dv __MAP_NOFAULT
+Indicate that access to pages that are not backed by the mapped
+file or device will be replaced by zero-filled anonymous pages.
+This flag must not be used with an anonymous mapping.
+.It Dv __MAP_NOREPLACE
+Indicates that a
+.Dv MAP_FIXED
+mapping should fail if it would require replacing any existing
+mappings.
+This flag must be used in combination with
+.Dv MAP_FIXED .
 .It Dv MAP_STACK
 Indicate that the mapping is used as a stack.
 This flag must be used in combination with
@@ -278,6 +300,12 @@ device does not support memory mapping.
 The allocation
 .Fa len
 was 0.
+.It Bq Er EINVAL
+.Dv __MAP_NOFAULT
+was specified for an anonymous mapping or
+.Dv __MAP_NOREPLACE
+was specified without
+.Dv MAP_FIXED .
 .It Bq Er ENOMEM
 .Dv MAP_FIXED
 was specified and the
@@ -323,6 +351,14 @@ A fully functional
 system call first appeared in SunOS 4.0
 and has been available since
 .Bx 4.3 Net/2 .
+The
+.Dv __MAP_NOFAULT
+flag appeared in
+.Ox 5.7 .
+The
+.Dv __MAP_NOREPLACE
+flag appeared in
+.Ox 5.3 .
 .Sh CAVEATS
 .St -p1003.1-2008
 specifies that references to pages beyond the end of a mapped object



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
Look at my code. I don’t even use a modulus operator. I perform hit and
miss with a random bitstream.

How can I have a bias of something I don’t do? I return a bitstream which
meets the parameters of being a value less than the upper bound. Much like
arc4random_buf().

If I use arc4random_uniform() repeatedly to create a random distribution of
say numbers less than 0x1000 or even something weird like 0x1300 will the
random distribution be better with arc4random_uniform() or with mine? For
0x1000 mine will simply pluck 12 bits of random data straight from the
arc4random() (and preserve the remaining 20 bits for later) on the first
try, just like it’s arc4random_buf().

arc4random_uniform() will perform a modulus of a 32 bit number which adds
data to the bitstream. Does it make it better? Perhaps it makes it harder
to guess the source bits.

I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
looking at modulo bias.

I didn’t know what it was, before, but I basically “rejection sample”:

https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/

On Sat, May 14, 2022 at 6:14 AM Otto Moerbeek  wrote:

> On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:
>
> > arc4random_uniform_fast2 that I made, streams in data from arc4random()
> and
> > uses the datastream directly and uses it as a bit by bit right "sliding
> > window" in the last loop. arc4random_uniform() uses a modulus which I is
> > simple to implement, but I wonder how cryptographically sound or even how
> > evenly it distributes. Adding a modulus seems sloppy without something
> > better. I did make arc4random_fast_simple() which merely takes an
> > upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> > the top function was into it which binary searches to find the correct
> size
> > bitfield (return value) needed to barely fit the upperbound while also
> > being able to discover every possible value below the upperbound. It
> isn't
> > as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> > single use of arc4random_uniform_fast_bitsearch() , but it does exactly
> the
> > same thing and appears faster than repeatedly using arc4random_uniform()
> > and it's wasteful use of arc4random() and calling the expensive rekeying
> > function more often.
> >
> > It may be interesting to determine even without looking at performance,
> > whether arc4random_fast_simple() creates a more superior, secure use of
> the
> > chacha20 stream than arc4random_uniform() with the modulus. what exactly
> > does all that extra data from the modulus do to the random distribution?
> >
> > -Luke
>
> I don't follow you at all. Your blabbering does not even use the terms
> "uniform" and "modulo bias". I wonder even if you realize what they
> mean in this context.
>
> -Otto
>
> --
-Luke


Re: stop using mquery(2) inside realloc(3)

2022-05-14 Thread Theo de Raadt
I worry a little about having libc use an undocumented mmap(2) flag.
About as much as using mquery, which is non-standard.

Maybe __MAP_NOREPLACE should get documentation?  __MAP_NOFAULT is in the
same situation.  The behaviour of these flags should be documented
(set in stone), which may also discourage accidental behaviour changes
by kernel developers in the future?



Re: librthread: validate timespec inputs with timespecisvalid(3)

2022-05-14 Thread Todd C . Miller
On Sat, 14 May 2022 09:31:26 -0500, Scott Cheloha wrote:

> ok?

OK millert@

 - todd



librthread: validate timespec inputs with timespecisvalid(3)

2022-05-14 Thread Scott Cheloha
ok?

Index: rthread_rwlock_compat.c
===
RCS file: /cvs/src/lib/librthread/rthread_rwlock_compat.c,v
retrieving revision 1.1
diff -u -p -r1.1 rthread_rwlock_compat.c
--- rthread_rwlock_compat.c 13 Feb 2019 13:15:39 -  1.1
+++ rthread_rwlock_compat.c 14 May 2022 14:29:27 -
@@ -143,8 +143,7 @@ int
 pthread_rwlock_timedrdlock(pthread_rwlock_t *lockp,
 const struct timespec *abstime)
 {
-   if (abstime == NULL || abstime->tv_nsec < 0 ||
-   abstime->tv_nsec >= 10)
+   if (abstime == NULL || !timespecisvalid(abstime))
return (EINVAL);
return (_rthread_rwlock_rdlock(lockp, abstime, 0));
 }
@@ -210,8 +209,7 @@ int
 pthread_rwlock_timedwrlock(pthread_rwlock_t *lockp,
 const struct timespec *abstime)
 {
-   if (abstime == NULL || abstime->tv_nsec < 0 ||
-   abstime->tv_nsec >= 10)
+   if (abstime == NULL || !timespecisvalid(abstime))
return (EINVAL);
return (_rthread_rwlock_wrlock(lockp, abstime, 0));
 }
Index: rthread_sem.c
===
RCS file: /cvs/src/lib/librthread/rthread_sem.c,v
retrieving revision 1.32
diff -u -p -r1.32 rthread_sem.c
--- rthread_sem.c   6 Apr 2020 00:01:08 -   1.32
+++ rthread_sem.c   14 May 2022 14:29:27 -
@@ -254,8 +254,7 @@ sem_timedwait(sem_t *semp, const struct 
int error;
PREP_CANCEL_POINT(tib);
 
-   if (!semp || !(sem = *semp) || abstime == NULL ||
-  abstime->tv_nsec < 0 || abstime->tv_nsec >= 10) {
+   if (!semp || !(sem = *semp) || !abstime || !timespecisvalid(abstime)) {
errno = EINVAL;
return (-1);
}
Index: rthread_sem_compat.c
===
RCS file: /cvs/src/lib/librthread/rthread_sem_compat.c,v
retrieving revision 1.1
diff -u -p -r1.1 rthread_sem_compat.c
--- rthread_sem_compat.c8 Jun 2018 13:53:01 -   1.1
+++ rthread_sem_compat.c14 May 2022 14:29:27 -
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -266,8 +267,7 @@ sem_timedwait(sem_t *semp, const struct 
int r;
PREP_CANCEL_POINT(tib);
 
-   if (!semp || !(sem = *semp) || abstime == NULL ||
-   abstime->tv_nsec < 0 || abstime->tv_nsec >= 10) {
+   if (!semp || !(sem = *semp) || !abstime || !timespecisvalid(abstime)) {
errno = EINVAL;
return (-1);
}



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:

> arc4random_uniform_fast2 that I made, streams in data from arc4random() and
> uses the datastream directly and uses it as a bit by bit right "sliding
> window" in the last loop. arc4random_uniform() uses a modulus which I is
> simple to implement, but I wonder how cryptographically sound or even how
> evenly it distributes. Adding a modulus seems sloppy without something
> better. I did make arc4random_fast_simple() which merely takes an
> upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> the top function was into it which binary searches to find the correct size
> bitfield (return value) needed to barely fit the upperbound while also
> being able to discover every possible value below the upperbound. It isn't
> as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
> same thing and appears faster than repeatedly using arc4random_uniform()
> and it's wasteful use of arc4random() and calling the expensive rekeying
> function more often.
> 
> It may be interesting to determine even without looking at performance,
> whether arc4random_fast_simple() creates a more superior, secure use of the
> chacha20 stream than arc4random_uniform() with the modulus. what exactly
> does all that extra data from the modulus do to the random distribution?
> 
> -Luke

I don't follow you at all. Your blabbering does not even use the terms
"uniform" and "modulo bias". I wonder even if you realize what they
mean in this context.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
arc4random_uniform_fast2 that I made, streams in data from arc4random() and
uses the datastream directly and uses it as a bit by bit right "sliding
window" in the last loop. arc4random_uniform() uses a modulus which I is
simple to implement, but I wonder how cryptographically sound or even how
evenly it distributes. Adding a modulus seems sloppy without something
better. I did make arc4random_fast_simple() which merely takes an
upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
the top function was into it which binary searches to find the correct size
bitfield (return value) needed to barely fit the upperbound while also
being able to discover every possible value below the upperbound. It isn't
as fast as arc4random_uniform_fast2 if it were used repeatedly after a
single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
same thing and appears faster than repeatedly using arc4random_uniform()
and it's wasteful use of arc4random() and calling the expensive rekeying
function more often.

It may be interesting to determine even without looking at performance,
whether arc4random_fast_simple() creates a more superior, secure use of the
chacha20 stream than arc4random_uniform() with the modulus. what exactly
does all that extra data from the modulus do to the random distribution?

-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
In general, wasting some arc4random bits is not a problem at all.

I see a lot of code with no demonstration, explanation or proof why it
would be correct (i.e. produces uniform results). You only talk about
speed.

The current code might not be optimal, but at least it is very clear
it's correct.

-Otto


On Fri, May 13, 2022 at 09:43:26AM -0500, Luke Small wrote:

> I made a couple new versions of a new kind of arc4random_uniform-like
> function and some example functions which use them. Instead of having a
> sufficiently large random number greater than the modulus, I pick a random
> number using arc4random() from a bitfield where the length of the bitfield
> is just below or slightly beyond the value of the modulus and returns the
> bitfield it if it is less than the value of the modulus.
> 
> In both versions, I retrieve a bitfield from a static uint64_t which is
> refreshed from periodic arc4random() calls. The functions demand a bit
> length. but I provide a convenient bitsize search function:
> arc4random_uniform_fast_bitsearch()
> 
> in the first version, if the chosen return value isn't less than the
> modulus, the entire bitfield is trashed and a completely new bitfield is
> refreshed from the cache. This can be very inefficient and for certain
> upperbounds where the functions demand that all returned values less than
> the upperbound are possible. This can perform worse than
> arc4random_uniform() even with more random number generation churn.
> 
> in the second version, if the chosen return value isn't less than the
> modulus, the bitfield is shifted right, losing the least significant bit
> and a new random-value bit from the least significant bit of the cache is
> put into the most significant bit of the bitfield and it tries again. For
> significant expenditures of effort, this version is always more efficient
> than arc4random_uniform() and slightly to much more efficient than my first
> version.
> 
> 
> The performance boost comes from less pseudorandom number generation by not
> trashing perfectly good random data and preserving it so that rekeying is
> performed less.
> 
> 
> I suspect that the first version may be secure, but I'm now sure what you
> think of the second version, for being secure. Is there a way to test how
> well distributed and secure it is?!
> 
> Perhaps it's even better than the typical use case of arc4random_uniform
> since it feeds it a bitstream instead of performing a modulous operation on
> it!
> 
> I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it
> doesn't do anything but malloc() an array, do stuff on the array and print
> stuff; if you don't profile, anyway.
> 
> What do you think of having such a thing in OpenBSD?
> 
> 
> 
> 
> newdata() generates random a-z characters performing arc4random_uniform(26);
> 
> newdataTypableFilename() generates random filename typable characters
> performing arc4random_uniform(92);
> 
> I perform other tests including presumed worst-cases on large numbers and
> numbers which will have a lot of misses.
> 
> 
> 
> 
> -Luke




Re: stop using mquery(2) inside realloc(3)

2022-05-14 Thread Otto Moerbeek
On Fri, May 13, 2022 at 08:21:19PM -0700, Philip Guenther wrote:

> If you try to grow a 'large' (at least half a page) allocation with 
> realloc(3), it'll try to extend the memory mapping for it and if that 
> works it won't need to move it.

(on a side note: it does not do anything at all if the # of pages does
not grow).
> 
> Currently, it does that by using mquery(2) to see if that area is 
> available and if so then trying to mmap it, munmaping it if mmap(2) didn't 
> return the desired address (perhaps due to a race with another thread).  
> Instead of doing mquery/mmap/munmap, this path can use the MAP_FIXED and 
> __MAP_NOREPLACE flags to directly request the specific address but not 
> change anything if it's not available.
> 
> 
> (We still use mquery in ld.so on i386 as a performance optimization, but 
> that use case differs in needing to find a base address for *multiple* 
> mappings, where unrolling partial hits grows to be more expensive than 
> probing with mquery and only trying to do all the mapping on a successful 
> set of probes: recall that on x86 munmap() is more expensive than mmap() 
> due to TLB shootdowns. This case in realloc() only has a single mapping to 
> probe/extend so it avoids those costs.  Indeed, this diff eliminates the 
> current "munmap on failed attempt" path/cost.)
> 
> 
> The regress/lib/libc/malloc tests continue to pass on my amd64 box, with 
> ktrace confirming the change in calls.
> 
> oks?

I was looking through the commit log to see when this flag was introduced.
It was in sys/mman.h rev 1.21, but a bit hidden becuase the commit
message talks about __MAP_NOREMAP.

Anyway, I cannot oversee if it is relevant if all pmap implementations
implement this or if this is done in a higher layer. 

Also a tiny bit worried the __MAP_NOREPLACE path in uvm isn't
exercised as much, so it might reveal bugs. But thats no reason not to
do this now.

So if my first question can be answered with: it's in a higher layer,
OK.

-Otto

> 
> 
> Philip
> 
> 
> Index: stdlib/malloc.c
> ===
> RCS file: /data/src/openbsd/src/lib/libc/stdlib/malloc.c,v
> retrieving revision 1.273
> diff -u -p -r1.273 malloc.c
> --- stdlib/malloc.c   26 Feb 2022 16:14:42 -  1.273
> +++ stdlib/malloc.c   14 May 2022 02:35:04 -
> @@ -100,9 +100,6 @@
>  #define MMAPA(a,sz,f)mmap((a), (sz), PROT_READ | PROT_WRITE, \
>  MAP_ANON | MAP_PRIVATE | (f), -1, 0)
>  
> -#define MQUERY(a,sz,f)   mquery((a), (sz), PROT_READ | PROT_WRITE, \
> -MAP_ANON | MAP_PRIVATE | MAP_FIXED | (f), -1, 0)
> -
>  struct region_info {
>   void *p;/* page; low bits used to mark chunks */
>   uintptr_t size; /* size for pages, or chunk_info pointer */
> @@ -1687,11 +1684,7 @@ orealloc(struct dir_info **argpool, void
>   size_t needed = rnewsz - roldsz;
>  
>   STATS_INC(pool->cheap_realloc_tries);
> - q = MQUERY(hint, needed, pool->mmap_flag);
> - if (q == hint)
> - q = MMAPA(hint, needed, 
> pool->mmap_flag);
> - else
> - q = MAP_FAILED;
> + q = MMAPA(hint, needed, MAP_FIXED | 
> __MAP_NOREPLACE | pool->mmap_flag);
>   if (q == hint) {
>   STATS_ADD(pool->malloc_used, needed);
>   if (pool->malloc_junk == 2)
> @@ -1709,9 +1702,6 @@ orealloc(struct dir_info **argpool, void
>   STATS_INC(pool->cheap_reallocs);
>   ret = p;
>   goto done;
> - } else if (q != MAP_FAILED) {
> - if (munmap(q, needed))
> - wrterror(pool, "munmap %p", q);
>   }
>   }
>   } else if (rnewsz < roldsz) {
>