Re: [RFC resend v1 2/2] Use arc4random_range() instead of arc4random_uniform() when appropriate

2023-01-02 Thread Alejandro Colomar



On 1/1/23 01:08, Alejandro Colomar wrote:



On 12/31/22 20:08, Alejandro Colomar wrote:

This makes the code much more readable and self-documented.  While doing
this, I noticed a few bugs, and other cases which may be bugs or not.
Switching to this specialized API makes it easier to spot such bugs, but
since I'm not familiar with the code, I kept some bugs unfixed.  The
most obvious ones (although I may be wrong) I fixed them.  And in some
cases where it was very unclear, I didn't touch the old *_uniform() code.

Below are the cases where I changed the behavior (I considered it a bug):

*  usr.bin/ssh/auth.c:

    -  *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];
    +  *cp = hashchars[arc4random_range(0, sizeof(hashchars) - 1)];


Reconsidering, this one is probably better just as 
arc4random_uniform(sizeof(hashchars)).


I was also wrong here.
I was confused by the implicit strlen() calculation with sizeof()-1, whose -1 
was cancelled by the +1.



--
<http://www.alejandro-colomar.es/>


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC resend v1 2/2] Use arc4random_range() instead of arc4random_uniform() when appropriate

2022-12-31 Thread Ingo Schwarze
Hi Alejandro,

Alejandro Colomar wrote on Sun, Jan 01, 2023 at 01:08:17AM +0100:
> On 12/31/22 20:08, Alejandro Colomar wrote:

>> This makes the code much more readable and self-documented.  While doing
>> this, I noticed a few bugs, and other cases which may be bugs or not.
>> Switching to this specialized API makes it easier to spot such bugs, but
>> since I'm not familiar with the code, I kept some bugs unfixed.  The
>> most obvious ones (although I may be wrong) I fixed them.  And in some
>> cases where it was very unclear, I didn't touch the old *_uniform() code.
>> 
>> Below are the cases where I changed the behavior (I considered it a bug):
>> 
>> *  usr.bin/ssh/auth.c:
>> 
>> -  *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];
>> +  *cp = hashchars[arc4random_range(0, sizeof(hashchars) - 1)];

> Reconsidering, this one is probably better just as 
> arc4random_uniform(sizeof(hashchars)).

That seems to introduce exactly the same bug as your first try.
I already explained last year that the code is correct as-is.
We don't want NUL bytes in the password hash, hence the - 1.

Also, please avoid using MIME when you post to OpenBSD lists, with
the only exception of posting tarballs of new ports to ports@, in
which case attachments are OK.

Yours,
  Ingo



Re: [RFC resend v1 2/2] Use arc4random_range() instead of arc4random_uniform() when appropriate

2022-12-31 Thread Alejandro Colomar



On 12/31/22 20:08, Alejandro Colomar wrote:

This makes the code much more readable and self-documented.  While doing
this, I noticed a few bugs, and other cases which may be bugs or not.
Switching to this specialized API makes it easier to spot such bugs, but
since I'm not familiar with the code, I kept some bugs unfixed.  The
most obvious ones (although I may be wrong) I fixed them.  And in some
cases where it was very unclear, I didn't touch the old *_uniform() code.

Below are the cases where I changed the behavior (I considered it a bug):

*  usr.bin/ssh/auth.c:

-  *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];
+  *cp = hashchars[arc4random_range(0, sizeof(hashchars) - 1)];


Reconsidering, this one is probably better just as 
arc4random_uniform(sizeof(hashchars)).




*  usr.sbin/ftp-proxy/ftp-proxy.c:

-  return (IPPORT_HIFIRSTAUTO +
    -  arc4random_uniform(IPPORT_HILASTAUTO - IPPORT_HIFIRSTAUTO));
+  return arc4random_range(IPPORT_HIFIRSTAUTO, IPPORT_HILASTAUTO);

*  usr.sbin/rad/engine.c:

-  tv.tv_sec = MIN_RTR_ADV_INTERVAL +
    -  arc4random_uniform(MAX_RTR_ADV_INTERVAL - MIN_RTR_ADV_INTERVAL);
+  tv.tv_sec = arc4random_range(MIN_RTR_ADV_INTERVAL, MAX_RTR_ADV_INTERVAL);

In the following change, I didn't use the temporary variable 'num3'.
AFAICS, this doesn't affect other uses of the variable in other places,
because they set it before use.  But please check carefully; I may have
missed something:

*  usr.sbin/cron/entry.c:

-  /* get a random number in the interval [num1, num2]
-   */
-  num3 = num1;
    -  num1 = arc4random_uniform(num2 - num3 + 1) + num3;
+  num1 = arc4random_range(num1, num2);

Signed-off-by: Alejandro Colomar 


--
<http://www.alejandro-colomar.es/>


OpenPGP_signature
Description: OpenPGP digital signature


Re: [RFC resend v1 2/2] Use arc4random_range() instead of arc4random_uniform() when appropriate

2022-12-31 Thread Ingo Schwarze
Hi Alejandro,

Alejandro Colomar wrote on Sat, Dec 31, 2022 at 08:08:14PM +0100:

> This makes the code much more readable and self-documented.

I object.  I is a needless detour that invites confusion and bugs.
In C code, it is customary to deal with half-open intervals,
and closed intervals are rare.

For example, array indices run from 0 to sizeof(array)-1,
sizeof(array) points to the storage location beyond the last element,
and C programmers are used to that.  So the was arc4random_uniform(3)
works feels familiar to C programmers, whereas your proposal
of arc4random_range(3) is idiosyncratic and provokes bugs.

> While doing this, I noticed a few bugs,

I doubt it.  I think you fell into your own trap.

> *  usr.bin/ssh/auth.c:
> 
>-  *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];
>+  *cp = hashchars[arc4random_range(0, sizeof(hashchars) - 1)];

There is no bug here.  The code goes:

const char hashchars[] = "./ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz0123456789"; /* from bcrypt.c */
char *cp;
/* ... */
for (cp = fake.pw_passwd + 7; *cp != '\0'; cp++)
    *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];

sizeof(hashchars) is the size of the char array, i.e. the string length
plus one (plus one for the terminating NUL byte).

So sizeof(hashchars)-1 is the number of useful characters in the string,
which is the same as the strlen(3).

So arc4random_uniform() returns a number in the half-open
interval [0, strlen).  The minimum index is 0 -> '.',
the maximum index is strlen-1 -> '9'.

This is all perfectly fine.

Your patch introduces a bug.  The terminaing NUL character may now
be copied into fake.pw_passwd.

This is exactly one of the reasons why i objected to your
arc4random_range() proposal: it will cause confusion and bugs.

I think that the first hunk of your patch introduces rather than
fixes a bug makes your patch unworthy of review.  It should be
summarily rejected.

Yours,
  Ingo



[RFC resend v1 2/2] Use arc4random_range() instead of arc4random_uniform() when appropriate

2022-12-31 Thread Alejandro Colomar
This makes the code much more readable and self-documented.  While doing
this, I noticed a few bugs, and other cases which may be bugs or not.
Switching to this specialized API makes it easier to spot such bugs, but
since I'm not familiar with the code, I kept some bugs unfixed.  The
most obvious ones (although I may be wrong) I fixed them.  And in some
cases where it was very unclear, I didn't touch the old *_uniform() code.

Below are the cases where I changed the behavior (I considered it a bug):

*  usr.bin/ssh/auth.c:

   -  *cp = hashchars[arc4random_uniform(sizeof(hashchars) - 1)];
   +  *cp = hashchars[arc4random_range(0, sizeof(hashchars) - 1)];

*  usr.sbin/ftp-proxy/ftp-proxy.c:

   -  return (IPPORT_HIFIRSTAUTO +
   -  arc4random_uniform(IPPORT_HILASTAUTO - IPPORT_HIFIRSTAUTO));
   +  return arc4random_range(IPPORT_HIFIRSTAUTO, IPPORT_HILASTAUTO);

*  usr.sbin/rad/engine.c:

   -  tv.tv_sec = MIN_RTR_ADV_INTERVAL +
   -  arc4random_uniform(MAX_RTR_ADV_INTERVAL - MIN_RTR_ADV_INTERVAL);
   +  tv.tv_sec = arc4random_range(MIN_RTR_ADV_INTERVAL, MAX_RTR_ADV_INTERVAL);

In the following change, I didn't use the temporary variable 'num3'.
AFAICS, this doesn't affect other uses of the variable in other places,
because they set it before use.  But please check carefully; I may have
missed something:

*  usr.sbin/cron/entry.c:

   -  /* get a random number in the interval [num1, num2]
   -   */
   -  num3 = num1;
   -  num1 = arc4random_uniform(num2 - num3 + 1) + num3;
   +  num1 = arc4random_range(num1, num2);

Signed-off-by: Alejandro Colomar 
---
 games/boggle/boggle/bog.c   | 2 +-
 games/canfield/canfield/canfield.c  | 2 +-
 games/mille/init.c  | 2 +-
 gnu/gcc/gcc/cfgexpand.c | 2 +-
 lib/libevent/select.c   | 2 +-
 regress/lib/libc/malloc/malloc_general/malloc_general.c | 2 +-
 regress/sys/sys/tree/rb/rb-test.c   | 3 +--
 regress/sys/sys/tree/splay/splay-test.c | 3 +--
 sbin/iked/ikev2.c   | 2 +-
 sys/dev/pci/drm/drm_linux.c | 2 +-
 sys/dev/pci/drm/include/linux/random.h  | 2 +-
 sys/kern/kern_fork.c| 2 +-
 sys/net/if_spppsubr.c   | 7 ++-
 sys/net/pf.c| 2 +-
 sys/net/pf_lb.c | 4 ++--
 sys/netinet/ip_ipsp.c   | 2 +-
 usr.bin/nc/netcat.c | 2 +-
 usr.bin/skeyinit/skeyinit.c | 2 +-
 usr.bin/ssh/auth.c  | 2 +-
 usr.sbin/cron/entry.c   | 5 +
 usr.sbin/ftp-proxy/ftp-proxy.c  | 3 +--
 usr.sbin/pppd/chap.c| 5 +
 usr.sbin/rad/engine.c   | 3 +--
 usr.sbin/relayd/shuffle.c   | 2 +-
 24 files changed, 26 insertions(+), 39 deletions(-)

diff --git a/games/boggle/boggle/bog.c b/games/boggle/boggle/bog.c
index c0e19454a27..3ed4888fc43 100644
--- a/games/boggle/boggle/bog.c
+++ b/games/boggle/boggle/bog.c
@@ -607,7 +607,7 @@ newgame(char *b)
/* Shuffle the cubes using Fisher-Yates (aka Knuth P). */
p = ncubes;
while (--p) {
-       q = (int)arc4random_uniform(p + 1);
+   q = (int)arc4random_range(0, p);
tmp = cubes[p];
cubes[p] = cubes[q];
cubes[q] = tmp;
diff --git a/games/canfield/canfield/canfield.c 
b/games/canfield/canfield/canfield.c
index 346fd20a1d2..dec75f6531f 100644
--- a/games/canfield/canfield/canfield.c
+++ b/games/canfield/canfield/canfield.c
@@ -531,7 +531,7 @@ shuffle(struct cardtype *deck[])
deck[i]->paid = FALSE;
}
for (i = decksize - 1; i > 0; i--) {
-   j = arc4random_uniform(i + 1);
+   j = arc4random_range(0, i);
if (i != j) {
temp = deck[i];
deck[i] = deck[j];
diff --git a/games/mille/init.c b/games/mille/init.c
index a86157739dd..c0cc6ac1f02 100644
--- a/games/mille/init.c
+++ b/games/mille/init.c
@@ -90,7 +90,7 @@ shuffle(void)
CARDtemp;
 
for (i = DECK_SZ - 1; i > 0; i--) {
-   r = arc4random_uniform(i + 1);
+   r = arc4random_range(0, i);
temp = Deck[r];
Deck[r] = Deck[i];
Deck[i] = temp;
diff --git a/gnu/gcc/gcc/cfgexpand.c b/gnu/gcc/gcc/cfgexpand.c
index 17aff165f6d..0cb8a21289b 100644
--- a/gnu/gcc/gcc/cfgexpand.c
+++ b/gnu/gcc/gcc/cfgexpand.c
@@ -438,7 +438,7 @@ partition_stack

Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Theo de Raadt
We will not be using your code.

End of story.


Luke Small  wrote:

> Marc, all you all have to do is say is that you all refuse to provide it.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Luke Small
Marc, all you all have to do is say is that you all refuse to provide it.

I was asked to at least provide evidence for correctness. I did so; and I’d
say I did a stellar job aside from getting some kind of statistical program.

The following has an attached source code for my test (along with
referencing another post) in case you missed it in this ginormous thread:

https://marc.info/?l=openbsd-tech&m=165306234108572&w=2


Aside from multithreaded processes, if you provide this, folks might use
this instead of creating their own incorrect custom random number
generators to improve performance for smaller values.

That custom deterministic “arc4random()” earlier in the thread from Damien
was not very good by the way. I tested it.


> You're about one step away from getting into my spam rules.
>
> Whatever you say, the way you say it, *repeatedly* when all the people
> I know (me included) tell you to get your facts straight so you don't
> sound like a wacko, makes reading you a total waste of time.
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Marc Espie
On Sat, May 21, 2022 at 07:47:40AM -0500, Luke Small wrote:
> Perhaps it was rude sending off list stuff to the list. Your email sounded
> "less than friendly" and more of a professional challenge that you were
> definitely in the works to produce; much like Damien Miller’s challenge to
> prove correctness. So, whatever.

You're about one step away from getting into my spam rules.

Whatever you say, the way you say it, *repeatedly* when all the people
I know (me included) tell you to get your facts straight so you don't
sound like a wacko, makes reading you a total waste of time.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Luke Small
Perhaps it was rude sending off list stuff to the list. Your email sounded
"less than friendly" and more of a professional challenge that you were
definitely in the works to produce; much like Damien Miller’s challenge to
prove correctness. So, whatever.

Aside from that unpleasantness:
I worked in __builtin_clz(upperbound-1) mentioned earlier in the thread;
instead of my binary search. It made the knuth sort simulation run even
faster. arc4random_uniform now takes 130% of the time mine takes for a
series of random numbers decreasing from 65535 to 2.

“__builtin_clz((upperbound - 1) | 1)” is only needed if upperbound can be 1
and that possibility is eliminated by first checking to see that upperbound
is >= 2.

static uint32_t
arc4random_uniform_small_unlocked(uint32_t upper_bound)
{
static uint64_t rand_holder = 0;
static size_t rand_bits = 0;

static size_t upper_bound0 = 2;
static size_t bitmask = 0x01;
static size_t bits = 1;

const size_t ub = upper_bound;
size_t ret;

if (ub != upper_bound0) {

if (ub < 2) {

/*
 * reset static cache for if a process needs to
fork()
 * to make it manually fork-safe
 */
if (ub == 0) {
rand_holder = 0;
rand_bits = 0;
}
return 0;
}

bits = 32 - __builtin_clz(ub - 1);
bitmask = ((size_t)1 << bits) - 1;

upper_bound0 = ub;
}

do {
if (rand_bits < bits) {
rand_holder |= ((uint64_t)arc4random()) <<
rand_bits;

/*
 * rand_bits will be between 0 and 31 here
 * so the 0x20 bit will be empty
 * rand_bits += 32;
 */
rand_bits |= 32;
}

ret = rand_holder & bitmask;
rand_holder >>= bits;
rand_bits -= bits;

} while (ret >= ub);


return (uint32_t)ret;
}


> Luke,
>
> It's very bad etiquette to deliberately re-post a private, off-list comment
> to a public mailing list.
>

Also, please fix your email client to respect the Mail-Followup-To: header,
> this is another lack of etiquette on your part.
>

I am either using gmail app on a phone or gmail.com, so I don't know if I
can help you there.


Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Crystal Kolipe
On Fri, May 20, 2022 at 09:48:28PM -0500, Luke Small wrote:
> Crystal: You can prove that for random, repetitive, correct, database
> record name generation using small upperbounds, the demonstrated 1/3-1/2
> runtime isn???t worth it for an upperbound like 26 - 92 in a business context
> that fights for every last millisecond?
> 
> Bring it.
> 
> Prove the correctness of whatever algorithm you???re using while you???re at 
> it.

Luke,

It's very bad etiquette to deliberately re-post a private, off-list comment
to a public mailing list.

Just remember that you might need my help one day, and if your mail has been
sent to /dev/null or otherwise ignored then I will never see it.

It's also very rude to make demands when I've already told you that I have
other work of a much higher priority.

IF AND ONLY IF there is sufficient interest from OTHER people, (who can
indicate it to me OFF LIST), then I will consider writing up a comprehensive
article debunking your ideas and concepts.

_Quality_ work takes time, and I've only just finished writing up our
response to the claim that our asm optimisation in the rasops bit shifting
code wasn't beneficial over the C compiler output, (which of course it was).

Also, please fix your email client to respect the Mail-Followup-To: header,
this is another lack of etiquette on your part.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
Crystal: You can prove that for random, repetitive, correct, database
record name generation using small upperbounds, the demonstrated 1/3-1/2
runtime isn’t worth it for an upperbound like 26 - 92 in a business context
that fights for every last millisecond?

Bring it.

Prove the correctness of whatever algorithm you’re using while you’re at it.

…unless a lot of those processes are threaded of course. :/

On Fri, May 20, 2022 at 7:27 PM Crystal Kolipe 
wrote:

>
> I've actually written a program to demonstrate how pointless your chacha
> bit
> saving routine is :).  I just haven't posted it yet because I'm too busy
> with
> other, (more useful), things to bother finishing it off.
>
> Your thread on -tech has already wasted more bits than your random number
> routine would save in a very long time, ha ha ha.
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
I appreciate your response, Damien.

I did do the bare minimum of correctness testing and it was the post right
after Guenther was congratulated on proving incorrectness:

https://marc.info/?l=openbsd-tech&m=165259528425835&w=2

The post includes software to reproduce the results.



I wrote a highly dynamically allocated program to test at intervals of
intervals to show at various stages to show the degree that the output
remains random

This is an example of some output:

testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5):

 256X 2048X16384X   131072X
 1048576X

 
   0 original246  2053 16400131115
1048306
 mine263  2042 16312131258
1046989

   1 original234  2013 16337130894
1047810
 mine249  2022 16304131161
1049511

   2 original236  2061 16367130802
1047430
 mine257  2117 16597130718
1046375

   3 original284  2089 16444131092
1050332
 mine266  2058 16379131190
1049877

   4 original280  2024 16372131457
1049002
 mine245  2001 16328131033
1050128



max-min values:

 original 5076   107   655
 2902
 mine 21   116   293   540
 3753


The program is set up to test values 2 through 50. This will create 224KiB
of output.
I suspected that you'd prefer that I didn't attach it.

Some progress output has been relegated to stderr so that you can easily
pipe it to a file.

I added some getrlimit an setrlimit functions to maximize memory limits if
necessary
In the case that I created for you, this will not be needed.

It uses 1276K RAM in the current configuration.


> Just to bring this back to where we came in: even putting thread-safety
> aside (which you absolutely can't): it doesn't matter how much faster
> it is, your replacement function isn't useful until you do the work to
> demonstrate correctness.
>
> You have done literally zero so far, not even the bare minimum of
> testing the output. As a result your first version was demonstrated to
> be completely broken by literally the most basic of possible tests, a
> mere 10 lines of code.
>
> That you left this to others to do tells me that you fundamentally don't
> understand the environment in which you're trying to operate, and that
> you don't respect the time of the people you're trying to convince.
>
> Please stop wasting our time.
>
> -d

-- 
-Luke
#include 
#include 
#include 
#include 
#include 
#include 
#include 

extern char *malloc_options;


/*
cc arc4random_uniform_small.c -O2 -march=native -mtune=native -flto=full \
-static -p -fno-inline && ./a.out && gprof a.out gmon.out

cc arc4random_uniform_small.c -O2 && ./a.out
 */


/* 
 * uint32_t
 * arc4random_uniform(uint32_t upper_bound);
 * 
 * for which this function is named, provides a cryptographically secure:
 * uint32_t arc4random() % upper_bound;
 * 
 * this function minimizes the waste of randomly generated bits,
 * for small upper bounds.
 * 
 * 'bits' is the requested number of random bits and it needs to be
 * within the following parameters:
 * 
 *   1 << bits >= upper_bound > (1 << bits) >> 1
 * 
 * I make a possibly incorrect presumption that size_t is
 * at the very least a uint32_t, but probably a uint64_t for speed
 */
static uint32_t
arc4random_uniform_small_unlocked(uint32_t upper_bound)
{	
	static uint64_t rand_holder = 0;
	static uint64_t rand_bits = 0;
	
	static uint64_t upper_bound0 = 2;
	static uint64_t bits0 = 1;

	uint64_t bits = 16, i = 1, n = 32, ret, ub = upper_bound;

	if (ub != upper_bound0) {
		
		if (ub < 2) {
			
			/*
			 * reset static cache for if a process needs to fork()
			 * to make it manually fork-safe
			 */
			if (ub == 0) { 
rand_holder = 0;
rand_bits = 0;
			}
			return 0;
		}
		
		/*
		 * binary search for ideal size random bitstream selection
		 */
		for (;;) {
			
			if ( ub > ((uint64_t)1 << bits) ) {

if (n - i == 1) {
	bits = n;
	break;
}

i = bits;
bits = (i + n) / 2;
continue;
			}
			if ( ub <= ((uint64_t)1 << bits) >> 1 ) {

if (n - i == 1) {
	bits = n;
	break;
}

n = bits;
bits = (i + n) / 2;
continue;
			}
			
			break;
		}
		upper_bound0 = ub;
		bits0 = bits;
	} else
		bits = bits0;
		
	const uint64_t bitmask = ((uint64_t)1 << bits) - 1;
		
	do {
		if (rand_bits < bits) {
			rand_ho

Fwd: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
I appreciate your response, Damien.

I did do the bare minimum of correctness testing and it was the post right
after Guenther was congratulated on proving incorrectness:

https://marc.info/?l=openbsd-tech&m=165259528425835&w=2

The post includes software to reproduce the results.



I wrote a highly dynamically allocated program to test at intervals of
intervals to show at various stages to show the degree that the output
remains random

This is an example of some output:

testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5):

 256X 2048X16384X   131072X
 1048576X

 
   0 original246  2053 16400131115
1048306
 mine263  2042 16312131258
1046989

   1 original234  2013 16337130894
1047810
 mine249  2022 16304131161
1049511

   2 original236  2061 16367130802
1047430
 mine257  2117 16597130718
1046375

   3 original284  2089 16444131092
1050332
 mine266  2058 16379131190
1049877

   4 original280  2024 16372131457
1049002
 mine245  2001 16328131033
1050128



max-min values:

 original 5076   107   655
 2902
 mine 21   116   293   540
 3753


The program is set up to test values 2 through 50. This will create 224KiB
of output.
I suspected that you'd prefer that I didn't attach it.

Some progress output has been relegated to stderr so that you can easily
pipe it to a file.

I added some getrlimit an setrlimit functions to maximize memory limits if
necessary
In the case that I created for you, this will not be needed.

It uses 1276K RAM in the current configuration.


> Just to bring this back to where we came in: even putting thread-safety
> aside (which you absolutely can't): it doesn't matter how much faster
> it is, your replacement function isn't useful until you do the work to
> demonstrate correctness.
>
> You have done literally zero so far, not even the bare minimum of
> testing the output. As a result your first version was demonstrated to
> be completely broken by literally the most basic of possible tests, a
> mere 10 lines of code.
>
> That you left this to others to do tells me that you fundamentally don't
> understand the environment in which you're trying to operate, and that
> you don't respect the time of the people you're trying to convince.
>
> Please stop wasting our time.
>
> -d

-- 
-Luke


arc4random_uniform_small.c
Description: Binary data


Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Joerg Sonnenberger wrote in
 :
 |Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso:
 |> What surprised me was that the Apple code requires more calls, and
 |> that today divisions and multiplications still matter.  I think it
 |> was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
 |> one cycle, * was ten cycles, and %,/ was fourty cycles.  But
 |> i think the Core i5 8th Gen that i have requires one cycle for all
 |> of them.
 |
 |I dare you to find any CPU optimized for speed where division and modulo
 |are as fast as multiplication. What you have nowadays is a single cycle
 |for basic operations like additions and shifts, about three cycles for
 |multiplications (on x86 at least) and an order of magnitude more for
 |division. That's not really surprising either if you consider that

Thanks.  I really had not looked since about 2005.  (In fact
i never really looked, i think a c't magazine CD once came over
with x86 interrupt lists, CPU cycle usage, and a (sketchy) RFC
collection.)  Hm, it even seems the number of cycles increased for
division .. i do not know which numbers i had seen by then, did
they include cache work at all?  Well.  So this was that.

 |integer multiplication has a fast circuit implementation where as
 |division is often implemented as cascading conditional subtraction.
 |The difference matters enough turning the reminder operation in the ELF
 |hash into essentially (x - x * (1/size) * size) give a 2% speed up for a
 |large application like Firefox ten years ago.

Interesting.  Well you live in pkgsrc with all the mess that all
those huge software suites introduce, whereas i still "jerk" if
i produce software which needs more than EXEC (/ RODATA) and BSS,
and "shudder" for all the CTOR and DTOR and TLS (but ok: cool) and
all the other stuff that is found in ELF, and handled by the
linker.  (I have ELF v1.2 and TLS-ABI, and even glanced over them
a really long time ago though.  But you know, it is so far off!
And it makes no fun thinking about that :))

 |Joerg
 --End of 

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Joerg Sonnenberger
Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso:
> What surprised me was that the Apple code requires more calls, and
> that today divisions and multiplications still matter.  I think it
> was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
> one cycle, * was ten cycles, and %,/ was fourty cycles.  But
> i think the Core i5 8th Gen that i have requires one cycle for all
> of them.

I dare you to find any CPU optimized for speed where division and modulo
are as fast as multiplication. What you have nowadays is a single cycle
for basic operations like additions and shifts, about three cycles for
multiplications (on x86 at least) and an order of magnitude more for
division. That's not really surprising either if you consider that
integer multiplication has a fast circuit implementation where as
division is often implemented as cascading conditional subtraction.
The difference matters enough turning the reminder operation in the ELF
hash into essentially (x - x * (1/size) * size) give a 2% speed up for a
large application like Firefox ten years ago.

Joerg



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20220518104921.sr0ht%stef...@sdaoden.eu>:
 ...
 |The web site that has been linked from the man from the country
 |that has an even much worse Earth Country Overshoot Day than
 |Germany and is almost en par with Australia or even USA (April
 |3rd, pooh; never again a Saab!  Cuba: Nov 25th, Jamaica Dec 20th)

Finland is March 31st.  Man is that dirty.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Philip Guenther wrote in
 :
 |On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso  \
 |wrote:
 |> Joerg Sonnenberger wrote in
 |>  :
 |>|Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 |>|> I made a couple new versions of a new kind of arc4random_uniform-like
 |>  ...
 |>|If your main use case is limiting the amount of cryptography when using
 |>|small bounds, there is a much simpler approach to be taken here. For
 |>|boundaries below 256, use arc4random_buf to extract one byte if bound is
 |>|a power of two, otherwise two. This gives most of the performance
 ...
 |> You can use (really implemented) _buf() if you need a 8-bit or
 |> 16-bit etc number.
 |>
 |> I find that _uniform() often makes no difference to a simple
 |> modulo because like the comment in _uniform() says "p > 0.5 (worst
 |>case, usually far better", and usually RNGs sprinkle bits nicely,
 |
 |What does that statement mean?  You seem to be saying "module is uniform,
 |except when it isn't, which could be almost half the time for some cases,
 |but when it's uniform it's uniform, so why bother making it actually
 |correct and dependable".
 |
 |I mean, what does that _mean_???   It's as if I said "my text handling

Well it means this thread was too long.

 |program handles all characters uniformly, except those with accents, but
 |that's less than 10% of the characters I type, so it handles all characters
 |uniformly."  WTF, NO!

But it also means that

// calculates 2**32 % range
uint32_t t = (-range) % range;
for (;;) {
uint32_t r = rng();
if (r >= t)

where range is a very small number results in a very, very low
probability that r>=t is not true.  For 16-bit 0x two zero
bytes had to occur in the upper 16-bit.  And worse for 64-bit RNG.
So this is what i said.  It depends on the application.

This gets hypothetic and is anyway beyond my mathematical
capabilities.  I mean i look at /etc/ssh/moduli, so much on
cramping of random numbers.
The web site that has been linked from the man from the country
that has an even much worse Earth Country Overshoot Day than
Germany and is almost en par with Australia or even USA (April
3rd, pooh; never again a Saab!  Cuba: Nov 25th, Jamaica Dec 20th)
said the bias for the number 52 is 0.0121%.  And what i posted
had ~0.008 that rand<0x01FF aka 32-bit high byte is 0, for
32-bit from getrandom(2) as well as mine (in "strong" mode,
SipHash on ARC4).

You need quite some samples to be able to find a statistical
frequency of that order.  And it depends on the application of
that many samples exist.
And even TCP from RFC 793 / September 1981 has a 32-bit sequence
number.

But sure Philip, of course, yes, of course you are right: simply
call _uniform() and "milk the shit out of the random range" --
just use it and forget about the problem.

What surprised me was that the Apple code requires more calls, and
that today divisions and multiplications still matter.  I think it
was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
one cycle, * was ten cycles, and %,/ was fourty cycles.  But
i think the Core i5 8th Gen that i have requires one cycle for all
of them.  (And somewhere i read that there are CPUs where the
shift operators are now more expensive than multiplication.)
I do not really track any of that since 2005.  That is nice: you
buy a laptop and suddenly have a NVME SSD that can 1.5GB/sec.
Wow.

 |> 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
 |> most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
 |> NOT for Linux getrandom(2)), which is a pretty high cut off.
 ...
 |Where do these ideas come from, that "0 bytes 'do not occur'"??  If your
 |rand generator doesn't provide zero bytes at the expected frequency, you
 |know, 1 in 256, then you're using a garbage random number generator.
 |Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT
 |OPENBSD.  Do ya'll not bother to test the claims that you make?
 ...
 |for (;;) {
 |u = arc4random();
 |if ((u & 0xff00) == 0 ||
 |(u & 0x00ff) == 0 ||
 |(u & 0xff00) == 0 ||
 |(u & 0x00ff) == 0)
 |break;

That is any-byte-0.

 |00b82e5c58
 |ab47880036

Seems random to me.
:)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Otto Moerbeek
On Wed, May 18, 2022 at 05:08:02PM +1000, Damien Miller wrote:

> On Wed, 18 May 2022, Otto Moerbeek wrote:
> 
> > instrumenting the code to count the number of arc4random calls I see thsi:
> > 
> > openbsd; elapsed = 2.835819; calls = 12340949
> > bitmask; elapsed = 4.335576; calls = 17836216
> > bitmask+reuse; elapsed = 3.710277; calls = 15245337
> > 
> > (this is a different number of trials on a slow machine).
> > 
> > So the promise of less calls is not fulfilled. Sounds like a bug.
> 
> Well, I don't know whether the simple bitmasking approach will really
> result in fewer calls. I *think* our current approach has a higher
> probability per loop of suceeding (at least for small upper_bounds) than
> mask-and-test, but my brain is too addled with a cold to calculate it :/

I *assumed* the speedups seen in the web page were because of less rnd
calls.  That was a mistake.

> What Theo said is 100% right - the cost is dominated by that of the
> underlying RNG. If anyone wants a faster arc4random_uniform() then the
> first place to look it at arc4random().

One more reason to keep the current implementation of arc4random_uniform().

> 
> It's entirely possible that the speedups measured in that article
> are because they are using a omgoptimised (non-crypto) RNG and the
> improvements they saw were due solely to reduction the overhead inside
> the uniformity code even if it came at the cost of more RNG queries.

Right.

> Anyway, here's a tweaked up version of the test harness that fakes out
> arc4random() with a deterministic RNG that counts how often it is called
> in case anyone wants to play with it further.

-Otto


> 
> 
> 
> 
> #include 
> #include 
> 
> #include 
> #include 
> 
> static uint32_t nqueries;
> 
> /* deterministic, query-counting arc4random() replacement for benchmarking */
> static uint32_t
> fake_random(void)
> {
>   static uint32_t ready, remain, msb;
>   uint32_t ret;
> 
>   if (!ready) {
>   srand_deterministic(31337);
>   ready = 1;
>   }
>   if (remain == 0) {
>   msb = (uint32_t)rand();
>   remain = 31; /* XXX makes assumption re RAND_MAX */
>   }
>   ret = (uint32_t)rand() | (msb << 31);
>   msb >>= 1;
>   remain--;
>   nqueries++;
>   return ret; 
> }
> 
> #define arc4random() fake_random()
> 
> #ifndef __has_builtin
> #define __has_builtin(x) 0
> #endif
> #if __has_builtin(__builtin_clz)
> #define arc4random_clz(x) __builtin_clz(x)
> #else
> #warning lacks __builtin_clz()
> /* Count most-significant zero bits, like __builtin_clz() */
> static int
> arc4random_clz(unsigned int x)
> {
>   int ret = 0;
>   unsigned int n;
>   const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
> 
>   while (x != 0) {
>   n = (x >> 28) & 0xf;
>   if (n != 0)
>   return ret + lut[n];
>   x <<= 4;
>   ret += 4;
>   }
>   return 32;
> }
> #endif
> 
> /* Current OpenBSD implementation */
> static uint32_t
> arc4random_uniform1(uint32_t upper_bound)
> {
>   uint32_t r, min;
> 
>   if (upper_bound < 2)
>   return 0;
> 
>   /* 2**32 % x == (2**32 - x) % x */
>   min = -upper_bound % upper_bound;
> 
>   /*
>* This could theoretically loop forever but each retry has
>* p > 0.5 (worst case, usually far better) of selecting a
>* number inside the range we need, so it should rarely need
>* to re-roll.
>*/
>   for (;;) {
>   r = arc4random();
>   if (r >= min)
>   break;
>   }
> 
>   return r % upper_bound;
> }
> 
> /*
>  * Like "Bitmask with Rejection" implementation from
>  * https://www.pcg-random.org/posts/bounded-rands.html
>  */
> static uint32_t
> arc4random_uniform2(uint32_t upper_bound)
> {
>   uint32_t mask, r;
> 
>   if (upper_bound < 2)
>   return 0;
> 
>   mask = 0x >> arc4random_clz((upper_bound - 1) | 1);;
>   for (;;) {
>   r = arc4random() & mask;
>   if (r < upper_bound)
>   return r;
>   }
>   /* NOTREACHED */
> }
> 
> /*
>  * Like Apple's
>  * 
> https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
>  */
> static uint32_t
> arc4random_uniform3(uint32_t upper_bound)
> {
>

Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Damien Miller
On Wed, 18 May 2022, Otto Moerbeek wrote:

> instrumenting the code to count the number of arc4random calls I see thsi:
> 
> openbsd; elapsed = 2.835819; calls = 12340949
> bitmask; elapsed = 4.335576; calls = 17836216
> bitmask+reuse; elapsed = 3.710277; calls = 15245337
> 
> (this is a different number of trials on a slow machine).
> 
> So the promise of less calls is not fulfilled. Sounds like a bug.

Well, I don't know whether the simple bitmasking approach will really
result in fewer calls. I *think* our current approach has a higher
probability per loop of suceeding (at least for small upper_bounds) than
mask-and-test, but my brain is too addled with a cold to calculate it :/

What Theo said is 100% right - the cost is dominated by that of the
underlying RNG. If anyone wants a faster arc4random_uniform() then the
first place to look it at arc4random().

It's entirely possible that the speedups measured in that article
are because they are using a omgoptimised (non-crypto) RNG and the
improvements they saw were due solely to reduction the overhead inside
the uniformity code even if it came at the cost of more RNG queries.

Anyway, here's a tweaked up version of the test harness that fakes out
arc4random() with a deterministic RNG that counts how often it is called
in case anyone wants to play with it further.




#include 
#include 

#include 
#include 

static uint32_t nqueries;

/* deterministic, query-counting arc4random() replacement for benchmarking */
static uint32_t
fake_random(void)
{
static uint32_t ready, remain, msb;
uint32_t ret;

if (!ready) {
srand_deterministic(31337);
ready = 1;
}
if (remain == 0) {
msb = (uint32_t)rand();
remain = 31; /* XXX makes assumption re RAND_MAX */
}
ret = (uint32_t)rand() | (msb << 31);
msb >>= 1;
remain--;
nqueries++;
return ret; 
}

#define arc4random() fake_random()

#ifndef __has_builtin
#define __has_builtin(x) 0
#endif
#if __has_builtin(__builtin_clz)
#define arc4random_clz(x) __builtin_clz(x)
#else
#warning lacks __builtin_clz()
/* Count most-significant zero bits, like __builtin_clz() */
static int
arc4random_clz(unsigned int x)
{
int ret = 0;
unsigned int n;
const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };

while (x != 0) {
n = (x >> 28) & 0xf;
if (n != 0)
return ret + lut[n];
x <<= 4;
ret += 4;
}
return 32;
}
#endif

/* Current OpenBSD implementation */
static uint32_t
arc4random_uniform1(uint32_t upper_bound)
{
uint32_t r, min;

if (upper_bound < 2)
return 0;

/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;

/*
 * This could theoretically loop forever but each retry has
 * p > 0.5 (worst case, usually far better) of selecting a
 * number inside the range we need, so it should rarely need
 * to re-roll.
 */
for (;;) {
r = arc4random();
if (r >= min)
break;
}

return r % upper_bound;
}

/*
 * Like "Bitmask with Rejection" implementation from
 * https://www.pcg-random.org/posts/bounded-rands.html
 */
static uint32_t
arc4random_uniform2(uint32_t upper_bound)
{
uint32_t mask, r;

if (upper_bound < 2)
return 0;

mask = 0x >> arc4random_clz((upper_bound - 1) | 1);;
for (;;) {
r = arc4random() & mask;
if (r < upper_bound)
return r;
}
/* NOTREACHED */
}

/*
 * Like Apple's
 * 
https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
 */
static uint32_t
arc4random_uniform3(uint32_t upper_bound)
{
int zeroes, bits, remain;
uint32_t mask, r, rm;

if (upper_bound < 2)
return 0;

zeroes = arc4random_clz((upper_bound - 1) | 1);
bits = 32 - zeroes;
mask = (uint32_t)-1 >> zeroes;

for (;;) {
r = arc4random();
rm = r & mask;
if (rm < upper_bound)
return rm;
for (remain = zeroes; remain >= bits; remain -= bits) {
r >>= bits;
rm = r & mask;
if (rm < upper_bound)
return rm;
}
}
/* NOTREACHED */
}

#include 

static uint32_t
fixture(const char *s, uint32_t (* const rnd)(uint32_t))
{
const uint32_t trials = 20 * 1000 * 1000;
const uint32_t max = 0x8000;

Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Otto Moerbeek
On Wed, May 18, 2022 at 02:50:28PM +1000, Damien Miller wrote:

> On Tue, 17 May 2022, Raimo Niskanen wrote:
> 
> > Why reinvent the wheel?
> > 
> > Here is a pretty good walkthrough of established methods:
> > 
> > https://www.pcg-random.org/posts/bounded-rands.html
> > 
> > It sounds to me as if your suggested methor essentially is
> > "Bitmask with Rejection -- Apple's Method" with the added twist
> > to keep the unused bits of the base generator's word and append
> > new, which might be an optimization for small ranges, but might
> > be just overhead for large ones.
> > 
> > In that post one can see that there might be other suggested smaller
> > optimizations that can be applied to the OpenBSD method, and that
> > the bitmask method is effective in many cases but not a clear winner.
> 
> Oh nice, I wasn't aware that Apple had an improved algorithm. I always
> thought the best avenue for a speedup was to consider only the bits that
> could satisfy the constraint, but it never occurred to me how to actually
> make use of this :)
> 
> The toy benchmark below compares the existing implementation with
> reimplementations of both their version as well as something more close
> to Apple's actual method (which reuses the high bits).
> 
> However, I don't see a speedup for either of the alternate approaches.
> 
> [djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb 
> openbsd; elapsed = 8.327954
> bitmask; elapsed = 13.306228
> bitmask+reuse; elapsed = 11.567389

instrumenting the code to count the number of arc4random calls I see thsi:

openbsd; elapsed = 2.835819; calls = 12340949
bitmask; elapsed = 4.335576; calls = 17836216
bitmask+reuse; elapsed = 3.710277; calls = 15245337

(this is a different number of trials on a slow machine).

So the promise of less calls is not fulfilled. Sounds like a bug.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Theo de Raadt
> However, I don't see a speedup for either of the alternate approaches.

Or maybe, just maybe, the underlying function (arc4random, which in the
dominant case it is just a small chacha step) is so inexpensive relative
to the proposed higher-level logic changes?  So this is all tilting at
windmills, and also something about measuring before optimizing^Wbreaking...



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Damien Miller
On Tue, 17 May 2022, Raimo Niskanen wrote:

> Why reinvent the wheel?
> 
> Here is a pretty good walkthrough of established methods:
> 
> https://www.pcg-random.org/posts/bounded-rands.html
> 
> It sounds to me as if your suggested methor essentially is
> "Bitmask with Rejection -- Apple's Method" with the added twist
> to keep the unused bits of the base generator's word and append
> new, which might be an optimization for small ranges, but might
> be just overhead for large ones.
> 
> In that post one can see that there might be other suggested smaller
> optimizations that can be applied to the OpenBSD method, and that
> the bitmask method is effective in many cases but not a clear winner.

Oh nice, I wasn't aware that Apple had an improved algorithm. I always
thought the best avenue for a speedup was to consider only the bits that
could satisfy the constraint, but it never occurred to me how to actually
make use of this :)

The toy benchmark below compares the existing implementation with
reimplementations of both their version as well as something more close
to Apple's actual method (which reuses the high bits).

However, I don't see a speedup for either of the alternate approaches.

[djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb 
openbsd; elapsed = 8.327954
bitmask; elapsed = 13.306228
bitmask+reuse; elapsed = 11.567389

Maybe my benchmark is crap or maybe I need dlg@ to omgoptimize it for me...



#include 
#include 
#include 
#include 

#ifndef __has_builtin
#define __has_builtin(x) 0
#endif
#if __has_builtin(__builtin_clz)
#define arc4random_clz(x) __builtin_clz(x)
#else
#warning lacks __builtin_clz()
/* Count most-significant zero bits, like __builtin_clz() */
static int
arc4random_clz(unsigned int x)
{
int ret = 0;
unsigned int n;
const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };

while (x != 0) {
n = (x >> 28) & 0xf;
if (n != 0)
return ret + lut[n];
x <<= 4;
ret += 4;
}
return 32;
}
#endif

/* Current OpenBSD implementation */
static uint32_t
arc4random_uniform1(uint32_t upper_bound)
{
uint32_t r, min;

if (upper_bound < 2)
return 0;

/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;

/*
 * This could theoretically loop forever but each retry has
 * p > 0.5 (worst case, usually far better) of selecting a
 * number inside the range we need, so it should rarely need
 * to re-roll.
 */
for (;;) {
r = arc4random();
if (r >= min)
break;
}

return r % upper_bound;
}

/*
 * Like "Bitmask with Rejection" implementation from
 * https://www.pcg-random.org/posts/bounded-rands.html
 */
static uint32_t
arc4random_uniform2(uint32_t upper_bound)
{
uint32_t mask, r;

if (upper_bound < 2)
return 0;

mask = (uint32_t)-1 >> arc4random_clz((upper_bound - 1) | 1);;
for (;;) {
r = arc4random();
if ((r & mask) < upper_bound)
return r & mask;
}
/* NOTREACHED */
}

/*
 * Like Apple's
 * 
https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
 */
static uint32_t
arc4random_uniform3(uint32_t upper_bound)
{
int zeroes, bits, remain = 32;
uint32_t mask, r;

if (upper_bound < 2)
return 0;

zeroes = arc4random_clz((upper_bound - 1) | 1);
bits = 32 - zeroes;
mask = (uint32_t)-1 >> zeroes;

for (;;) {
r = arc4random();
if ((r & mask) < upper_bound)
return r & mask;
for (remain = zeroes; remain >= bits; remain -= bits) {
r >>= bits;
if ((r & mask) < upper_bound)
return r & mask;
}
}
/* NOTREACHED */
}


#include 

static uint32_t
fixture(const char *s, uint32_t (* const rnd)(uint32_t))
{
const uint32_t trials = 50 * 1000 * 1000;
const uint32_t max = 0x8000;
const uint32_t mul = 7;
unsigned int v, i, r;
struct timeval start, finish, delta;

gettimeofday(&start, NULL);
for (v = 3, r = 0; v < max; v *= mul) {
/* printf("%u\n", v); */
for (i = 0; i < trials; i++)
r |= rnd(v);
}
gettimeofday(&finish, NULL);
timersub(&finish, &start, &delta);
printf("%s; elapsed = %lld.%06lld\n", s,
(long long)delta.tv_sec, (long long)delta.tv_usec);
return r;
}

int main(void) {
fixture("openbsd", arc4random_uniform1);
fixture("bitmask", arc4random_uniform2);
fixture("bitmask+reuse",

Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Philip Guenther
On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso  wrote:

> Joerg Sonnenberger wrote in
>  :
>  |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
>  |> I made a couple new versions of a new kind of arc4random_uniform-like
>  ...
>  |If your main use case is limiting the amount of cryptography when using
>  |small bounds, there is a much simpler approach to be taken here. For
>  |boundaries below 256, use arc4random_buf to extract one byte if bound is
>  |a power of two, otherwise two. This gives most of the performance
>  |benefit without complicating the algorithm. Extracting two bytes ensures
>  |that the propability of success is > 99% and the double extracting
>  |doesn't eat up the benefits.
>
> You can use (really implemented) _buf() if you need a 8-bit or
> 16-bit etc number.
>
> I find that _uniform() often makes no difference to a simple
> modulo because like the comment in _uniform() says "p > 0.5 (worst

case, usually far better", and usually RNGs sprinkle bits nicely,
>

What does that statement mean?  You seem to be saying "module is uniform,
except when it isn't, which could be almost half the time for some cases,
but when it's uniform it's uniform, so why bother making it actually
correct and dependable".

I mean, what does that _mean_???   It's as if I said "my text handling
program handles all characters uniformly, except those with accents, but
that's less than 10% of the characters I type, so it handles all characters
uniformly."  WTF, NO!



> 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
> most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
> NOT for Linux getrandom(2)), which is a pretty high cut off.
> Using _uniform() just because of its name seems strange thus.
>

Where do these ideas come from, that "0 bytes 'do not occur'"??  If your
rand generator doesn't provide zero bytes at the expected frequency, you
know, 1 in 256, then you're using a garbage random number generator.
Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT
OPENBSD.  Do ya'll not bother to test the claims that you make?

: bleys; cat f.c
#include 
#include 

int
main(void)
{
uint32_t u;
long count;

count = 0;
while ((u = arc4random()) > 0x1ff)
count++;
printf("%08x\t%ld\n", u, count);

count = 0;
for (;;) {
u = arc4random();
if ((u & 0xff00) == 0 ||
(u & 0x00ff) == 0 ||
(u & 0xff00) == 0 ||
(u & 0x00ff) == 0)
break;
count++;
}
printf("%08x\t%ld\n", u, count);

return 0;
}
: bleys; cc f.c
: bleys; ./a.out
00b82e5c58
ab47880036
: bleys;


Philip Guenther


Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20220517220924.xohqc%stef...@sdaoden.eu>:
 |Joerg Sonnenberger wrote in
 | :
 ||Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 ||> I made a couple new versions of a new kind of arc4random_uniform-like
 ...
 |0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
 |most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
 |NOT for Linux getrandom(2)), which is a pretty high cut off.
 ...

  /*@ lotto.cxx - output six random numbers */

  #include 
  #include 

  #include 

  //#define NYDPROF_ENABLE
  //#define NYD_ENABLE
  //#define NYD2_ENABLE
  #include 
  NSPC_USE(su)

  static u32
  bounded_rand(u32 range, u32 rv){
 for(u32 t = -range % range;;){
if(rv >= t)
   return rv % range;
log::write(log::emerg, "NOFIT");
 }
  }

  int
  main(void){
 u32 uni[50], mod[50], rv;

 state::create_core(NIL, (state::debug | log::debug), state::err_nopass);

 su_mem_set(uni, 0, sizeof uni);
 su_mem_set(mod, 0, sizeof mod);

  u32 au=0;
 for(u32 i = 0; i < 10; ++i){
random::builtin_generate(&rv, sizeof rv, state::err_nopass);
//if(getrandom(&rv, sizeof rv, GRND_NONBLOCK) == -1)
//   log::write(log::emerg, "getrandom(2)");
  if(rv < 0x01FF)
  log::write(log::info, "AU %u - 0x%X", ++au,rv);
++mod[rv % NELEM(mod)];
++uni[bounded_rand(NELEM(mod), rv)];
 }

 u32 unilo, modlo, unihi, modhi;
 unilo = modlo = max::u32;
 unihi = modhi = 0;

 for(u32 i = 0; i < NELEM(uni); ++i){
unilo = get_min(unilo, uni[i]);
modlo = get_min(modlo, mod[i]);
unihi = get_max(unihi, uni[i]);
modhi = get_max(modhi, mod[i]);
log::write(log::info, "%u - %u / %u %s",
   i, uni[i], mod[i], (uni[i] != mod[i] ? "!!!" : "="));
 }
 log::write(log::info, "MIN %u / %u, MAX %u / %u - au %u",
   unilo, modlo, unihi, modhi,au);

 return 0;
  }

  #include 
  // s-it-mode

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Steffen Nurpmeso
Joerg Sonnenberger wrote in
 :
 |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 |> I made a couple new versions of a new kind of arc4random_uniform-like
 ...
 |If your main use case is limiting the amount of cryptography when using
 |small bounds, there is a much simpler approach to be taken here. For
 |boundaries below 256, use arc4random_buf to extract one byte if bound is
 |a power of two, otherwise two. This gives most of the performance
 |benefit without complicating the algorithm. Extracting two bytes ensures
 |that the propability of success is > 99% and the double extracting
 |doesn't eat up the benefits.

You can use (really implemented) _buf() if you need a 8-bit or
16-bit etc number.

I find that _uniform() often makes no difference to a simple
modulo because like the comment in _uniform() says "p > 0.5 (worst
case, usually far better", and usually RNGs sprinkle bits nicely,
0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
NOT for Linux getrandom(2)), which is a pretty high cut off.
Using _uniform() just because of its name seems strange thus.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Joerg Sonnenberger
Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
> I made a couple new versions of a new kind of arc4random_uniform-like
> function and some example functions which use them. Instead of having a
> sufficiently large random number greater than the modulus, I pick a random
> number using arc4random() from a bitfield where the length of the bitfield
> is just below or slightly beyond the value of the modulus and returns the
> bitfield it if it is less than the value of the modulus.

If your main use case is limiting the amount of cryptography when using
small bounds, there is a much simpler approach to be taken here. For
boundaries below 256, use arc4random_buf to extract one byte if bound is
a power of two, otherwise two. This gives most of the performance
benefit without complicating the algorithm. Extracting two bytes ensures
that the propability of success is > 99% and the double extracting
doesn't eat up the benefits.

Joerg



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Damien Miller
On Mon, 16 May 2022, Luke Small wrote:

> Yeah, I see your point.
>
> I suppose it depends on how conservative you want to be and whether
> you want to supply options to people like getchar_unlocked when it
> isn’t essential.
>
> It could be made manually fork-safe if I could make a simple feature
> where “arc4random_uniform_unlocked(0);” with a 0 upperbound could
> trigger a reset of the static variables rand_bits and rand_holder
> which would be simple enough and could be added to a man page. I
> certainly read the man pages. (I’m surprised “man getchar” doesn’t
> “see also” to getchar_unlocked() by the way.)
>
> If you look at profiling with programs that call it a lot,
> arc4random() inside arc4random_uniform() calls the expensive rekey
> function which makes it take more time. That’s why I can get around
> 2X-3X performance on a typical repetitive small upperbound loop and
> an extra 20% improvement on a single 65536 Knuth shuffle, loop even
> though my function repeats a binary search for ‘bit’ size every single
> time and has misses which demands calling up more data.
>
> Otherwise my function would be particularly useful when there’s a loop
> with small upperbound like: 26+26+10 which, if I recall correctly, is
> in identd, which would call it frequently.

Just to bring this back to where we came in: even putting thread-safety
aside (which you absolutely can't): it doesn't matter how much faster
it is, your replacement function isn't useful until you do the work to
demonstrate correctness.

You have done literally zero so far, not even the bare minimum of
testing the output. As a result your first version was demonstrated to
be completely broken by literally the most basic of possible tests, a
mere 10 lines of code.

That you left this to others to do tells me that you fundamentally don't
understand the environment in which you're trying to operate, and that
you don't respect the time of the people you're trying to convince.

Please stop wasting our time.

-d


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
Yeah, I see your point.

I suppose it depends on how conservative you want to be and whether you
want to supply options to people like getchar_unlocked when it isn’t
essential.

It could be made manually fork-safe if I could make a simple feature where
“arc4random_uniform_unlocked(0);” with a 0 upperbound could trigger a reset
of the static variables rand_bits and rand_holder which would be simple
enough and could be added to a man page. I certainly read the man pages.
(I’m surprised “man getchar” doesn’t “see also” to getchar_unlocked() by
the way.)

If you look at profiling with programs that call it a lot, arc4random()
inside arc4random_uniform() calls the expensive rekey function which makes
it take more time. That’s why I can get around 2X-3X performance on a
typical repetitive small upperbound loop and an extra 20% improvement on a
single 65536 Knuth shuffle, loop even though my function repeats a binary
search for ‘bit’ size every single time and has misses which demands
calling up more data.

Otherwise my function would be particularly useful when there’s a loop with
small upperbound like: 26+26+10 which, if I recall correctly, is in identd,
which would call it frequently.

I have a project that I generate MANY random record names much like that
and I use fork()/fork()-exec() on many processes well before calling
randomizing functions. Maybe I’m not the only one.

I don’t want to even try multithreading, especially as often as you guys
say that it is too unpredictable. I believe you. Using lua scripts in
multi-worker-process redis calls to avoid race conditions is awkward enough
for me.


But otherwise it’s certainly your prerogative to not have it. At least you
can’t legitimately say that it adds too much extra code or is too complex.
It’s pretty simple to understand if you’re familiar with bitwise stuff.

On Mon, May 16, 2022 at 5:35 PM Stuart Henderson 
wrote:

> On 2022/05/16 15:13, Luke Small wrote:
> > If you’re not running a threaded program, my function wouldn’t be “less
> > safe.”
> >
> > I’d imagine that 99% of programs aren’t multithreaded.
>
> code is reused in different places. non threaded programs are sometimes
> turned into threaded programs and when that happens, sometimes
> non-thread-safe calls are missed. so i'd argue that it is still less
> safe.
>
> in some cases there might be benefits that would mean it's worth it,
> especially if the failure modes would be obvious such that they can
> be detected. really not seeing that here. (how often are you even
> calling arc4random_uniform to consider it slow?!)
>
> if the consequence is not a crash but instead subtly broken randomness,
> how long do you think it's going to take to notice and report/fix it?
> even *very* broken randomness in widespread software distributions
> has been known to sit around for a long time before it's noticed:
>
> - predictable rng in a popular os. *serious* bug. introduced 2006,
> discovered/reported nearly 2 years later.
>
> - non-fork-safe rng in a popular ssl library, introduced sometime before
> sept 2018, reported may 2019.
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Stuart Henderson
On 2022/05/16 15:13, Luke Small wrote:
> If you’re not running a threaded program, my function wouldn’t be “less
> safe.”
> 
> I’d imagine that 99% of programs aren’t multithreaded.

code is reused in different places. non threaded programs are sometimes
turned into threaded programs and when that happens, sometimes
non-thread-safe calls are missed. so i'd argue that it is still less
safe.

in some cases there might be benefits that would mean it's worth it,
especially if the failure modes would be obvious such that they can
be detected. really not seeing that here. (how often are you even
calling arc4random_uniform to consider it slow?!)

if the consequence is not a crash but instead subtly broken randomness,
how long do you think it's going to take to notice and report/fix it?
even *very* broken randomness in widespread software distributions
has been known to sit around for a long time before it's noticed:

- predictable rng in a popular os. *serious* bug. introduced 2006,
discovered/reported nearly 2 years later.

- non-fork-safe rng in a popular ssl library, introduced sometime before
sept 2018, reported may 2019.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
No...am I incorrect, especially on OpenBSD?

Of course since you made such a remark, you seem like the kind of fellow
that would put the nail in the coffin for spite.

...now I sound like an asshole.

On Mon, May 16, 2022 at 4:00 PM Theo de Raadt  wrote:

> hey luke you know you sound like an asshole right?
>
>
> Luke Small  wrote:
>
> > If you’re not running a threaded program, my function wouldn’t be “less
> > safe.”
> >
> > I’d imagine that 99% of programs aren’t multithreaded.
> >
> > On Mon, May 16, 2022 at 1:01 PM  wrote:
> >
> > > > There is the specifically non-threadsafe call getchar_unlocked() on
> > > OpenBSD
> > > > which is presumably available for performance reasons alone, when
> > > getchar()
> > > > is a perfectly viable option and is even an ISO conforming function.
> > > What I
> > > > submitted could be such a higher performance non-threadsafe function.
> > > >
> > > > so, how about arc4random_uniform_unlocked() ?!
> > >
> > > getchar_unlocked is mandated by POSIX.
> > >
> > > OpenBSD has not yet invented an alternate function that only exists to
> > > give away safety for performance. It has only gone in the opposite
> > > direction if anything.
> > >
> > > --
> > -Luke
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Theo de Raadt
hey luke you know you sound like an asshole right?


Luke Small  wrote:

> If you’re not running a threaded program, my function wouldn’t be “less
> safe.”
> 
> I’d imagine that 99% of programs aren’t multithreaded.
> 
> On Mon, May 16, 2022 at 1:01 PM  wrote:
> 
> > > There is the specifically non-threadsafe call getchar_unlocked() on
> > OpenBSD
> > > which is presumably available for performance reasons alone, when
> > getchar()
> > > is a perfectly viable option and is even an ISO conforming function.
> > What I
> > > submitted could be such a higher performance non-threadsafe function.
> > >
> > > so, how about arc4random_uniform_unlocked() ?!
> >
> > getchar_unlocked is mandated by POSIX.
> >
> > OpenBSD has not yet invented an alternate function that only exists to
> > give away safety for performance. It has only gone in the opposite
> > direction if anything.
> >
> > --
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
If you’re not running a threaded program, my function wouldn’t be “less
safe.”

I’d imagine that 99% of programs aren’t multithreaded.

On Mon, May 16, 2022 at 1:01 PM  wrote:

> > There is the specifically non-threadsafe call getchar_unlocked() on
> OpenBSD
> > which is presumably available for performance reasons alone, when
> getchar()
> > is a perfectly viable option and is even an ISO conforming function.
> What I
> > submitted could be such a higher performance non-threadsafe function.
> >
> > so, how about arc4random_uniform_unlocked() ?!
>
> getchar_unlocked is mandated by POSIX.
>
> OpenBSD has not yet invented an alternate function that only exists to
> give away safety for performance. It has only gone in the opposite
> direction if anything.
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Steffen Nurpmeso
Philip Guenther wrote in
 :
 |On Sun, 15 May 2022, Steffen Nurpmeso wrote:
 |> Stuart Henderson wrote in
 |...
 |>|what's the perceived problem you're wanting to solve? and does that
 |>|problem actually exist in the first place?
 |> 
 |> The problem is that if have a low upper bound then modulo will "remove a 
 |> lot of randomization".  For example if you have a program which 
 |> generates Lotto numbers (1..49), then using _uniform() as it is will 
 |> generate many duplicates.
 |
 |Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
 |distribution.  Says right so in the manpage.  If an implementation of that 
 |API fails to do that, it's a broken implementation.

I always admired its source code comments and have been left
stunning more than just once.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Sven M . Hallberg
Luke Small on Sun, May 15 2022:
> The current implementation is nothing more than a naive arc4random() %
> upper_bound which trashes initial arc4random() calls it doesn’t like,

Yes, that's exactly what it is.

> The whole transformation by modulus of perfectly decent random data
> seems so awkward. [...] it seems like an ugly HACK

Let me point out that this way of doing it is absolutely standard. It is
not awkward or a hack.

> to simply meet demand

This is probably a good way of describing the design goal for this
highly sensitive piece of kernel interna.

-p

PS: I am not an OpenBSD developer, a tiny contributor at most. I speak
as a bystander who happens to have some domain knowledge.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
Yeah. It most likely won't go in. From past experience and advice, not
necessarily just from a perceived lack of merit.

However, many, if not all of the arguments are based upon non-facts and
misconceptions from earlier submissions or just not understanding what the
software is doing.

The only real thing that makes it not quite as good as the standard is mine
isn't threadsafe. If it could be accepted as a higher performance,
non-threadsafe call, it would perform better in many typical cases and
perhaps even give a safer return value, especially in large upper_bound
edge cases, I suspect.

There is the specifically non-threadsafe call getchar_unlocked() on OpenBSD
which is presumably available for performance reasons alone, when getchar()
is a perfectly viable option and is even an ISO conforming function. What I
submitted could be such a higher performance non-threadsafe function.

so, how about arc4random_uniform_unlocked() ?!

...other than making upper_bound a uint32_t instead of the submitted
uint64_t. That'd be somewhat of a problem.

-Luke


> You've had several developers tell you this is not going to go in. I'd
> suggest
> "read the room".
>
> If you want this for your own use, just keep it as a local diff. Nobody
> will
> know (or likely care).
>
> -ml
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
I’m not trying to be rude, but you don’t realize what’s going on here:

uuu is a bitmask:

‘uuu’ (or (1 << bits)-1 ) in “ret = rand_holder & uuu;“ , only puts the
lower ‘bit’ quantity of bits of rand_holder into ret, then it right shifts
rand_holder afterward to trash them every time in the loop when it’s done.

So if bits is 8, uuu is going to be 0xff

No, you aren't:
>
> > for (;;) {
> > if (rand_bits < bits) {
> > rand_holder |= ((uint64_t)arc4random()) <<
> > rand_bits;
> >
> > /*
> >  * rand_bits will be a number between 0 and 31
> here
> >  * so the 0x20 bit will be empty
> >  * rand_bits += 32;
> >  */
> > rand_bits |= 32;
> > }
> >
> > ret = rand_holder & uuu;
> > rand_holder >>= bits;
> > rand_bits -= bits;
> >
> > if (ret < upper_bound)
> > return ret;
> > }
>
> This isn't rejection sampling. This is reusing part of the rejected
> sample.

-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Mike Larkin
On Sun, May 15, 2022 at 08:40:19PM -0500, Luke Small wrote:
> https://marc.info/?l=openbsd-tech&m=165259528425835&w=2
>
> This one (which is strongly based upon my first of two versions) which I
> submitted after Guenther correctly trashed version 2, doesn’t reuse any
> part of the sample. It picks up a clean new bitfield upon failure.
>
> I think Guenther didn’t, perhaps like yourself, realize I submitted this
> later program. That’s why he said it wasn’t correct. It didn’t occur to me
> at the time of responding to him: “correct correct correct.”
>

You've had several developers tell you this is not going to go in. I'd suggest
"read the room".

If you want this for your own use, just keep it as a local diff. Nobody will
know (or likely care).

-ml

> On Sun, May 15, 2022 at 7:47 PM Damien Miller  wrote:
>
> > On Sat, 14 May 2022, Luke Small wrote:
> >
> > > Look at my code. I don’t even use a modulus operator. I perform hit and
> > > miss with a random bitstream.
> > >
> > > How can I have a bias of something I don’t do? I return a bitstream which
> > > meets the parameters of being a value less than the upper bound. Much
> > like
> > > arc4random_buf().
> > >
> > > If I use arc4random_uniform() repeatedly to create a random distribution
> > of
> > > say numbers less than 0x1000 or even something weird like 0x1300 will the
> > > random distribution be better with arc4random_uniform() or with mine? For
> > > 0x1000 mine will simply pluck 12 bits of random data straight from the
> > > arc4random() (and preserve the remaining 20 bits for later) on the first
> > > try, just like it’s arc4random_buf().
> > >
> > > arc4random_uniform() will perform a modulus of a 32 bit number which adds
> > > data to the bitstream. Does it make it better? Perhaps it makes it harder
> > > to guess the source bits.
> > >
> > > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> > > looking at modulo bias.
> > >
> > > I didn’t know what it was, before, but I basically “rejection sample”:
> > >
> > >
> > https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/
> >
> > No, you aren't:
> >
> > > for (;;) {
> > > if (rand_bits < bits) {
> > > rand_holder |= ((uint64_t)arc4random()) <<
> > > rand_bits;
> > >
> > > /*
> > >  * rand_bits will be a number between 0 and 31
> > here
> > >  * so the 0x20 bit will be empty
> > >  * rand_bits += 32;
> > >  */
> > > rand_bits |= 32;
> > > }
> > >
> > > ret = rand_holder & uuu;
> > > rand_holder >>= bits;
> > > rand_bits -= bits;
> > >
> > > if (ret < upper_bound)
> > > return ret;
> > > }
> >
> > This isn't rejection sampling. This is reusing part of the rejected
> > sample.
> >
> > Think of it like this: you want to uniformly generate a number in the
> > range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
> > 11 or 12? You can't just reroll one of the dice because the other dice
> > is constrained to be have rolled either 5 or 6, and so proceeding with
> > it would force the output to be in the range [6:11] for these ~5.6%
> > of initial rolls. Your output is no longer uniform.
> >
> > BTW the existing code already implements the prefered approach of the
> > article you quoted.
> >
> > -d
>
> --
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
https://marc.info/?l=openbsd-tech&m=165259528425835&w=2

This one (which is strongly based upon my first of two versions) which I
submitted after Guenther correctly trashed version 2, doesn’t reuse any
part of the sample. It picks up a clean new bitfield upon failure.

I think Guenther didn’t, perhaps like yourself, realize I submitted this
later program. That’s why he said it wasn’t correct. It didn’t occur to me
at the time of responding to him: “correct correct correct.”

On Sun, May 15, 2022 at 7:47 PM Damien Miller  wrote:

> On Sat, 14 May 2022, Luke Small wrote:
>
> > Look at my code. I don’t even use a modulus operator. I perform hit and
> > miss with a random bitstream.
> >
> > How can I have a bias of something I don’t do? I return a bitstream which
> > meets the parameters of being a value less than the upper bound. Much
> like
> > arc4random_buf().
> >
> > If I use arc4random_uniform() repeatedly to create a random distribution
> of
> > say numbers less than 0x1000 or even something weird like 0x1300 will the
> > random distribution be better with arc4random_uniform() or with mine? For
> > 0x1000 mine will simply pluck 12 bits of random data straight from the
> > arc4random() (and preserve the remaining 20 bits for later) on the first
> > try, just like it’s arc4random_buf().
> >
> > arc4random_uniform() will perform a modulus of a 32 bit number which adds
> > data to the bitstream. Does it make it better? Perhaps it makes it harder
> > to guess the source bits.
> >
> > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> > looking at modulo bias.
> >
> > I didn’t know what it was, before, but I basically “rejection sample”:
> >
> >
> https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/
>
> No, you aren't:
>
> > for (;;) {
> > if (rand_bits < bits) {
> > rand_holder |= ((uint64_t)arc4random()) <<
> > rand_bits;
> >
> > /*
> >  * rand_bits will be a number between 0 and 31
> here
> >  * so the 0x20 bit will be empty
> >  * rand_bits += 32;
> >  */
> > rand_bits |= 32;
> > }
> >
> > ret = rand_holder & uuu;
> > rand_holder >>= bits;
> > rand_bits -= bits;
> >
> > if (ret < upper_bound)
> > return ret;
> > }
>
> This isn't rejection sampling. This is reusing part of the rejected
> sample.
>
> Think of it like this: you want to uniformly generate a number in the
> range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
> 11 or 12? You can't just reroll one of the dice because the other dice
> is constrained to be have rolled either 5 or 6, and so proceeding with
> it would force the output to be in the range [6:11] for these ~5.6%
> of initial rolls. Your output is no longer uniform.
>
> BTW the existing code already implements the prefered approach of the
> article you quoted.
>
> -d

-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Damien Miller
On Sat, 14 May 2022, Luke Small wrote:

> Look at my code. I don’t even use a modulus operator. I perform hit and
> miss with a random bitstream.
> 
> How can I have a bias of something I don’t do? I return a bitstream which
> meets the parameters of being a value less than the upper bound. Much like
> arc4random_buf().
> 
> If I use arc4random_uniform() repeatedly to create a random distribution of
> say numbers less than 0x1000 or even something weird like 0x1300 will the
> random distribution be better with arc4random_uniform() or with mine? For
> 0x1000 mine will simply pluck 12 bits of random data straight from the
> arc4random() (and preserve the remaining 20 bits for later) on the first
> try, just like it’s arc4random_buf().
> 
> arc4random_uniform() will perform a modulus of a 32 bit number which adds
> data to the bitstream. Does it make it better? Perhaps it makes it harder
> to guess the source bits.
> 
> I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> looking at modulo bias.
> 
> I didn’t know what it was, before, but I basically “rejection sample”:
> 
> https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/

No, you aren't:

> for (;;) {
> if (rand_bits < bits) {
> rand_holder |= ((uint64_t)arc4random()) <<
> rand_bits;
> 
> /* 
>  * rand_bits will be a number between 0 and 31 here
>  * so the 0x20 bit will be empty
>  * rand_bits += 32;
>  */ 
> rand_bits |= 32;
> }
> 
> ret = rand_holder & uuu;
> rand_holder >>= bits;
> rand_bits -= bits;
> 
> if (ret < upper_bound)
> return ret;
> }

This isn't rejection sampling. This is reusing part of the rejected
sample.

Think of it like this: you want to uniformly generate a number in the
range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
11 or 12? You can't just reroll one of the dice because the other dice
is constrained to be have rolled either 5 or 6, and so proceeding with
it would force the output to be in the range [6:11] for these ~5.6%
of initial rolls. Your output is no longer uniform.

BTW the existing code already implements the prefered approach of the
article you quoted.

-d


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Damien Miller
On Sun, 15 May 2022, Luke Small wrote:

> Do I really have to use specific terminology to make a point?
>
> I'm not educated enough on chacha20 enough to know whether, like I
> pointed out, whether choosing 5 bits from the middle of (or even from
> the tail end of one and the beginning of another) 32 bit pseudorandom
> cipher is "correct."

You don't need to understand chacha20 to understand    .
arc4random_uniform() I certainly didn't when I wrote it.

The underlying CSPRNG is irrelevant to how arc4random_uniform() works.
It it treated as an oracle that provides 32 random bit upon request. You
could swap it out for 32 coin-tossing monkeys and the implementation
wouldn't need to change.

It requests another 32 bit random value for each attempt at satisfying
the bounds check because they need to be independent - reusing parts of
a previous attempt is highly likely to introduce biases.

It's almost certainly possible to make this function faster, but it's
also very easy to get it wrong (e.g. I made one stupid math error in its
early implementation, forever immortalised by CVS). The existing code
has the advantage of being very obvious in how it works and therefore
has a very low risk of being wrong.

If someone is proposing to move to something less obvious then it's
incumbent upon them to do the work to prove that their alternative is
just as correct.

> ...correct correct correct. Did I use that buzzword enough?

Highly experienced people are taking he time to give you detailed,
critical feedback. This can be hard to receive, but if you ever want to
improve then you should consider it and try to engage constructively.

-d



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
Do I really have to use specific terminology to make a point?

I'm not educated enough on chacha20 enough to know whether, like I pointed
out, whether choosing 5 bits from the middle of (or even from the tail end
of one and the beginning of another) 32 bit pseudorandom cipher is
"correct."

...correct correct correct. Did I use that buzzword enough?

-Luke


On Sun, May 15, 2022 at 5:26 PM Philip Guenther  wrote:

> On Sun, 15 May 2022, Luke Small wrote:
> > The current implementation is nothing more than a naive arc4random() %
> > upper_bound which trashes initial arc4random() calls it doesn’t like,
> then
> > transforms over a desired modulus. The whole transformation by modulus of
> > perfectly decent random data seems so awkward. It’s not like it is used
> as
> > some majestic artistry of RSA it seems like an ugly HACK to simply meet a
> > demand lacking of something better.
>
> You fail to mention correctness at all or address the fact that your
> version isn't while the current one is.  Meanwhile, you talk about getting
> only just enough random data as if there's some sort of limited supply
> when there isn't.
>
> "My version may be wrong, but at least it doesn't look naive!"
>
> That is utterly the wrong attitude for OpenBSD code.
>
>
> Best wishes.
>
> Philip Guenther
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Philip Guenther
On Sun, 15 May 2022, Luke Small wrote:
> The current implementation is nothing more than a naive arc4random() %
> upper_bound which trashes initial arc4random() calls it doesn’t like, then
> transforms over a desired modulus. The whole transformation by modulus of
> perfectly decent random data seems so awkward. It’s not like it is used as
> some majestic artistry of RSA it seems like an ugly HACK to simply meet a
> demand lacking of something better.

You fail to mention correctness at all or address the fact that your 
version isn't while the current one is.  Meanwhile, you talk about getting 
only just enough random data as if there's some sort of limited supply 
when there isn't.

"My version may be wrong, but at least it doesn't look naive!"

That is utterly the wrong attitude for OpenBSD code.


Best wishes.

Philip Guenther



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
The current implementation is nothing more than a naive arc4random() %
upper_bound which trashes initial arc4random() calls it doesn’t like, then
transforms over a desired modulus. The whole transformation by modulus of
perfectly decent random data seems so awkward. It’s not like it is used as
some majestic artistry of RSA it seems like an ugly HACK to simply meet a
demand lacking of something better.

If you understand what I’ve done, it streams in a bitfield into an integer
type like it’s a buffer for just enough or slightly more data to meet the
demands of the upperbound and if it exceeds upperbound-1, it is trashed and
reads in a completely new bitfield to check. It relies on arc4random()
supplying good random data regardless of how many bits are in the bitfield.
If it does so, it should and seems to supply a good distribution across the
length of the bitfield which may often be something like 5 for a common
26+26+10 upper_bound in /usr/src. It seems to me that it should be pretty
good if not superior method; at least in the realm of cleaner results.
Perhaps it’s confusing what I’ve done with all the bitwise operators, but
it isn’t some random hacky thing I’ve cobbled together.

Or does arc4random() only provide decent random data 32 bits at a time; or
an even 8 bits at a time as arc4random_buf() would suggest?

All I would have to prove is that chacha20 provides good or superior random
bitfields regardless of how many bits are needed and regardless of whether
they begin at the beginning of a byte.

I don’t have the education for that, but “I got a ‘puter for Christmas.”
lol. I can perhaps run simulations if I have nothing better to do.


> I think I can say we know here uniformity is only *one* of the
> desirable properties of a secure random generator.
>
> But I don't think anybody else execpt Luke was talking about
> "improving".  The sole purpose of arc4random_uniform() is to give a
> good implementation of a random number function in a specific range
> using arc4random() as the source. This is needed because the naive
> implementation arc4random() % upper_bound is not uniform.
>
> -Otto
>
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 08:57:09AM -0300, Crystal Kolipe wrote:

> On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote:
> > On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:
> > > How did someone prove the current implementation was cryptographically
> > > sound? Did they run massive simulations which ran the gamut of the 
> > > uint32_t
> > > range which demanded tight tolerances over varying length runs?
> > > 
> > > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing 
> > > to
> > > bet it wasn???t from a mathematical proof; it was from bad data.
> 
> > You miss the point completely. The point is: how do you derive a
> > uniformly distributed random function for a smaller range, given a
> > uniformly distibuted random function over the range [0-2^32-1].
> > 
> > The current implementation does exactly that and has all the
> > information in the comments to verify the uniformity claim. You only
> > need to use basic mathematical properties of modulo arithmetic to
> > do the verification.
> 
> You do all realise that uniform distribution alone does not make a
> good random number generator, don't you?
> 
> For example:
> 
> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5...
> 
> That's uniformly distributed, and also useless as a random number
> stream.
> 
> Further more, _cryptographically secure_ random number generation is
> not the same as _mathematically good_ random number generation.
> 
> There are plenty of random number generation formulas which are
> considered good and useful from a mathematical basis, but which are
> useless for cryptography.
> 
> So random, (pun intended), hacks at the arc4random code are not
> likely to 'improve' it from the general standpoint, (although if you
> have a specific need for a specific private application, that's
> different).  I think Stuart has already more or less made that point.
> 

I think I can say we know here uniformity is only *one* of the
desirable properties of a secure random generator.

But I don't think anybody else execpt Luke was talking about
"improving".  The sole purpose of arc4random_uniform() is to give a
good implementation of a random number function in a specific range
using arc4random() as the source. This is needed because the naive
implementation arc4random() % upper_bound is not uniform.

-Otto




Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Crystal Kolipe
On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote:
> On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:
> > How did someone prove the current implementation was cryptographically
> > sound? Did they run massive simulations which ran the gamut of the uint32_t
> > range which demanded tight tolerances over varying length runs?
> > 
> > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing to
> > bet it wasn???t from a mathematical proof; it was from bad data.

> You miss the point completely. The point is: how do you derive a
> uniformly distributed random function for a smaller range, given a
> uniformly distibuted random function over the range [0-2^32-1].
> 
> The current implementation does exactly that and has all the
> information in the comments to verify the uniformity claim. You only
> need to use basic mathematical properties of modulo arithmetic to
> do the verification.

You do all realise that uniform distribution alone does not make a
good random number generator, don't you?

For example:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5...

That's uniformly distributed, and also useless as a random number
stream.

Further more, _cryptographically secure_ random number generation is
not the same as _mathematically good_ random number generation.

There are plenty of random number generation formulas which are
considered good and useful from a mathematical basis, but which are
useless for cryptography.

So random, (pun intended), hacks at the arc4random code are not
likely to 'improve' it from the general standpoint, (although if you
have a specific need for a specific private application, that's
different).  I think Stuart has already more or less made that point.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Marc Espie
Random generators have been analyzed for years. 

Pick up "Concrete mathematics" by Don E. Knuth, read it, then come back
with actual science.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:

> I have a feeling that making it threadsafe under any quantity of threads
> and still perform is likely not feasible, but if I could; or if making a
> nonthreadsafe variant was possible:
> 
> Creating a mathematical proof that somehow is “simple and probably correct”
> enough, would be an impossible task even for a PhD mathematician.
> 
> How did someone prove the current implementation was cryptographically
> sound? Did they run massive simulations which ran the gamut of the uint32_t
> range which demanded tight tolerances over varying length runs?
> 
> How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to
> bet it wasn’t from a mathematical proof; it was from bad data.
> 
> I’m guessing that large upperbounds approaching 2**32 don’t feed very
> soundly in the current implementation using a modulus; although I suspect
> that there isn’t much of a call for such things, I’m pretty sure I saw a
> 3,000,000,000 upperbound in the src partition tonight.


You miss the point completely. The point is: how do you derive a
uniformly distributed random function for a smaller range, given a
uniformly distibuted random function over the range [0-2^32-1].

The current implementation does exactly that and has all the
information in the comments to verify the uniformity claim. You only
need to use basic mathematical properties of modulo arithmetic to
do the verification.

-Otto

> 
> On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek  wrote:
> 
> > On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:
> >
> > > This is version 1, which I was pretty sure was secure.
> > >
> > > I revamped it with a few features and implanted the binary search for
> > 'bit'
> > >
> > > in most cases, which aren't intentionally worst-case, it's pretty darned
> > > fast!
> > >
> > > This is a sample run of my program with your piece of code included:
> > >
> > >
> > >  1  99319 100239
> > >  2 100353  99584
> > >  3 100032  99879
> > >  4  99704 100229
> > >  5  99530  99796
> > >  6  99617 100355
> > >  7 100490 100319
> > >  8  99793 100114
> > >  9 100426  99744
> > > 10 100315 100558
> > > 11  99279  99766
> > > 12  99572  99750
> > > 13  99955 100017
> > > 14 100413 15
> > > 15 100190 100052
> > > 16 101071 100195
> > > 17 100322 100224
> > > 18  99637  99540
> > > 19 100323  99251
> > > 20  99841 100177
> > > 21  99948  99504
> > > 22 100065 100031
> > > 23 100026  99827
> > > 24  99836  99818
> > > 25 100245  99822
> > > 26 100088  99678
> > > 27  99957  3
> > > 28 100065  99961
> > > 29 100701 100679
> > > 30  99756  99587
> > > 31 100220 100076
> > > 32 100067 15
> > > 33  99547  99984
> > > 34 100124 100031
> > > 35  99547 100661
> > > 36  99801  99963
> > > 37 100189 100230
> > > 38  99878  99579
> > > 39  99864  99442
> > > 40  99683 14
> > > 41  99907 100094
> > > 42 100291  99817
> > > 43  99908  99984
> > > 44 100044 100606
> > > 45 100065 100120
> > > 46  99358 100141
> > > 47 100152 100442
> > > 48 10 100279
> > > 49 100486  99848
> >
> > Sadly a sample run cannot be used to proof your implementation is
> > correct.  It can only be used to show it is not correct, like Philip
> > did.  To show your implementation produces uniform results in all
> > cases, you need to provide a solid argumentation that is easy to
> > follow. So far you failed to do that and I do not see it coming, given
> > the complexituy of your implementation.  The current implementation
> > has a straightforward argumentation as it uses relatively simple
> > mathematical properties of modulo arithmetic.
> >
> > It is also clear your code (as it uses statics) is not thread safe.
> >
> > So to answer you original question clearly: no, we will not accept this.
> >
> > -Otto
> >
> -- 
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
I have a feeling that making it threadsafe under any quantity of threads
and still perform is likely not feasible, but if I could; or if making a
nonthreadsafe variant was possible:

Creating a mathematical proof that somehow is “simple and probably correct”
enough, would be an impossible task even for a PhD mathematician.

How did someone prove the current implementation was cryptographically
sound? Did they run massive simulations which ran the gamut of the uint32_t
range which demanded tight tolerances over varying length runs?

How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to
bet it wasn’t from a mathematical proof; it was from bad data.

I’m guessing that large upperbounds approaching 2**32 don’t feed very
soundly in the current implementation using a modulus; although I suspect
that there isn’t much of a call for such things, I’m pretty sure I saw a
3,000,000,000 upperbound in the src partition tonight.

On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek  wrote:

> On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:
>
> > This is version 1, which I was pretty sure was secure.
> >
> > I revamped it with a few features and implanted the binary search for
> 'bit'
> >
> > in most cases, which aren't intentionally worst-case, it's pretty darned
> > fast!
> >
> > This is a sample run of my program with your piece of code included:
> >
> >
> >  1  99319 100239
> >  2 100353  99584
> >  3 100032  99879
> >  4  99704 100229
> >  5  99530  99796
> >  6  99617 100355
> >  7 100490 100319
> >  8  99793 100114
> >  9 100426  99744
> > 10 100315 100558
> > 11  99279  99766
> > 12  99572  99750
> > 13  99955 100017
> > 14 100413 15
> > 15 100190 100052
> > 16 101071 100195
> > 17 100322 100224
> > 18  99637  99540
> > 19 100323  99251
> > 20  99841 100177
> > 21  99948  99504
> > 22 100065 100031
> > 23 100026  99827
> > 24  99836  99818
> > 25 100245  99822
> > 26 100088  99678
> > 27  99957  3
> > 28 100065  99961
> > 29 100701 100679
> > 30  99756  99587
> > 31 100220 100076
> > 32 100067 15
> > 33  99547  99984
> > 34 100124 100031
> > 35  99547 100661
> > 36  99801  99963
> > 37 100189 100230
> > 38  99878  99579
> > 39  99864  99442
> > 40  99683 14
> > 41  99907 100094
> > 42 100291  99817
> > 43  99908  99984
> > 44 100044 100606
> > 45 100065 100120
> > 46  99358 100141
> > 47 100152 100442
> > 48 10 100279
> > 49 100486  99848
>
> Sadly a sample run cannot be used to proof your implementation is
> correct.  It can only be used to show it is not correct, like Philip
> did.  To show your implementation produces uniform results in all
> cases, you need to provide a solid argumentation that is easy to
> follow. So far you failed to do that and I do not see it coming, given
> the complexituy of your implementation.  The current implementation
> has a straightforward argumentation as it uses relatively simple
> mathematical properties of modulo arithmetic.
>
> It is also clear your code (as it uses statics) is not thread safe.
>
> So to answer you original question clearly: no, we will not accept this.
>
> -Otto
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:

> This is version 1, which I was pretty sure was secure.
> 
> I revamped it with a few features and implanted the binary search for 'bit'
> 
> in most cases, which aren't intentionally worst-case, it's pretty darned
> fast!
> 
> This is a sample run of my program with your piece of code included:
> 
> 
>  1  99319 100239
>  2 100353  99584
>  3 100032  99879
>  4  99704 100229
>  5  99530  99796
>  6  99617 100355
>  7 100490 100319
>  8  99793 100114
>  9 100426  99744
> 10 100315 100558
> 11  99279  99766
> 12  99572  99750
> 13  99955 100017
> 14 100413 15
> 15 100190 100052
> 16 101071 100195
> 17 100322 100224
> 18  99637  99540
> 19 100323  99251
> 20  99841 100177
> 21  99948  99504
> 22 100065 100031
> 23 100026  99827
> 24  99836  99818
> 25 100245  99822
> 26 100088  99678
> 27  99957  3
> 28 100065  99961
> 29 100701 100679
> 30  99756  99587
> 31 100220 100076
> 32 100067 15
> 33  99547  99984
> 34 100124 100031
> 35  99547 100661
> 36  99801  99963
> 37 100189 100230
> 38  99878  99579
> 39  99864  99442
> 40  99683 14
> 41  99907 100094
> 42 100291  99817
> 43  99908  99984
> 44 100044 100606
> 45 100065 100120
> 46  99358 100141
> 47 100152 100442
> 48 10 100279
> 49 100486  99848

Sadly a sample run cannot be used to proof your implementation is
correct.  It can only be used to show it is not correct, like Philip
did.  To show your implementation produces uniform results in all
cases, you need to provide a solid argumentation that is easy to
follow. So far you failed to do that and I do not see it coming, given
the complexituy of your implementation.  The current implementation
has a straightforward argumentation as it uses relatively simple
mathematical properties of modulo arithmetic.

It is also clear your code (as it uses statics) is not thread safe.

So to answer you original question clearly: no, we will not accept this.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
This is version 1, which I was pretty sure was secure.

I revamped it with a few features and implanted the binary search for 'bit'

in most cases, which aren't intentionally worst-case, it's pretty darned
fast!

This is a sample run of my program with your piece of code included:


 1  99319 100239
 2 100353  99584
 3 100032  99879
 4  99704 100229
 5  99530  99796
 6  99617 100355
 7 100490 100319
 8  99793 100114
 9 100426  99744
10 100315 100558
11  99279  99766
12  99572  99750
13  99955 100017
14 100413 15
15 100190 100052
16 101071 100195
17 100322 100224
18  99637  99540
19 100323  99251
20  99841 100177
21  99948  99504
22 100065 100031
23 100026  99827
24  99836  99818
25 100245  99822
26 100088  99678
27  99957  3
28 100065  99961
29 100701 100679
30  99756  99587
31 100220 100076
32 100067 15
33  99547  99984
34 100124 100031
35  99547 100661
36  99801  99963
37 100189 100230
38  99878  99579
39  99864  99442
40  99683 14
41  99907 100094
42 100291  99817
43  99908  99984
44 100044 100606
45 100065 100120
46  99358 100141
47 100152 100442
48 10 100279
49 100486  99848








newdata
time = 0.240530757 seconds

newdata_fast
time = 0.073868626 seconds

The original takes 325.620% of the runtime of newdata_fast






newdataTypableFilename
time = 0.236748989 seconds

newdataTypableFilename_fast
time = 0.115842333 seconds

The original takes 204.372% of the runtime of newdataTypableFilename_fast






rand_ideal
time = 0.235832820 seconds

rand_ideal_fast
time = 0.025300798 seconds

The original takes 932.116% of the runtime of rand_ideal_fast






rand_short_ideal
time = 0.236828684 seconds

rand_short_ideal_fast
time = 0.124025922 seconds

The original takes 190.951% of the runtime of rand_short_ideal_fast






rand_short_worst
time = 0.237142869 seconds

rand_short_worst_fast
time = 0.294156486 seconds

The original takes 80.618% of the runtime of rand_short_worst_fast






rand_worst
time = 0.237002775 seconds

rand_worst_fast
time = 0.377148420 seconds

The original takes 62.841% of the runtime of rand_worst_fast






shuffle
time = 0.003044472 seconds

shuffle_fast
time = 0.002590664 seconds

The original takes 117.517% of the runtime of shuffle_fast

If it crashed here, you are trying to profile.
Turn off pledge at the beginning of main().



On Sat, May 14, 2022 at 7:03 PM Philip Guenther  wrote:

> On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> > Stuart Henderson wrote in
> ...
> >  |what's the perceived problem you're wanting to solve? and does that
> >  |problem actually exist in the first place?
> >
> > The problem is that if have a low upper bound then modulo will "remove a
> > lot of randomization".  For example if you have a program which
> > generates Lotto numbers (1..49), then using _uniform() as it is will
> > generate many duplicates.
>
> Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform
> distribution.  Says right so in the manpage.  If an implementation of that
> API fails to do that, it's a broken implementation.
>
> The OpenBSD implementation achieves that.  NetBSD's implementation has the
> same core logic.  Here's my quick lotto pick demo program, doing 490
> picks, so they should all be somewhere near 10:
>
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++)
> counts[arc4random_uniform(LIMIT)]++;
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\n", i+1, counts[i]);
> return 0;
> }
>
> And a sample run:
>
> : bleys; ./a.out
>
>  1   100100
>  2   100639
>  399965
>  499729
>  599641
>  699650
>  7   100299
>  8   100164
>  999791
> 10   100101
> 11   100657
> 12   100042
> 1399661
> 1499927
> 1599426
> 1699491
> 1799646
> 18   100133
> 19   100013
> 2099942
> 2199873
> 2299924
> 2399567
> 24   100152
> 25   100688
> 26   100011
> 27   100481
> 2899980
> 29   100406
> 3099726
> 3199808
> 3299929
> 33   100050
> 3499983
> 35   100048
> 3699771
> 3799906
> 38   100215
> 39   100261
> 40   100426
> 4199847
> 4299533
> 43   100368
> 4499695
> 45   100041
> 46   100465
> 4799875
> 48   100034
> 4999920
> : bleys;
>
> Looks pretty good to me, with repeated runs showing the values bouncing
> around.
&

Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
On Sat, May 14, 2022 at 05:03:17PM -0700, Philip Guenther wrote:

> On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> > Stuart Henderson wrote in
> ...
> >  |what's the perceived problem you're wanting to solve? and does that
> >  |problem actually exist in the first place?
> > 
> > The problem is that if have a low upper bound then modulo will "remove a 
> > lot of randomization".  For example if you have a program which 
> > generates Lotto numbers (1..49), then using _uniform() as it is will 
> > generate many duplicates.
> 
> Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
> distribution.  Says right so in the manpage.  If an implementation of that 
> API fails to do that, it's a broken implementation.
> 
> The OpenBSD implementation achieves that.  NetBSD's implementation has the 
> same core logic.  Here's my quick lotto pick demo program, doing 490 
> picks, so they should all be somewhere near 10:
> 
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++)
> counts[arc4random_uniform(LIMIT)]++;
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\n", i+1, counts[i]);
> return 0;
> }
> 
> And a sample run:
> 
> : bleys; ./a.out  
>  
>  1   100100
>  2   100639
>  399965
>  499729
>  599641
>  699650
>  7   100299
>  8   100164
>  999791
> 10   100101
> 11   100657
> 12   100042
> 1399661
> 1499927
> 1599426
> 1699491
> 1799646
> 18   100133
> 19   100013
> 2099942
> 2199873
> 2299924
> 2399567
> 24   100152
> 25   100688
> 26   100011
> 27   100481
> 2899980
> 29   100406
> 3099726
> 3199808
> 3299929
> 33   100050
> 3499983
> 35   100048
> 3699771
> 3799906
> 38   100215
> 39   100261
> 40   100426
> 4199847
> 4299533
> 43   100368
> 4499695
> 45   100041
> 46   100465
> 4799875
> 48   100034
> 4999920
> : bleys; 
> 
> Looks pretty good to me, with repeated runs showing the values bouncing 
> around.
> 
> 
> ...
> > I was a bit interested when i saw Luke's message, but i am no
> > mathematician and, like i said, i in fact never needed the
> > functionality.  And the question i would have is how small
> > upper_bound has to be for the modulo problem to really kick in.
> 
> If the implementation isn't broken, it's not a problem, period.
> 
> 
> > And even if, whether it matters.  For example, if you have
> > a bitset of 1024 "in-flight things", and the small upper bound
> > hits a used slot, you could simply search forward until you find
> > a free one.  I mean, with 1024 possible values the number of
> > possibilities is anyhow restricted.
> 
> Well hey, let's give Luke's a try and see how uniform it is:
> 
> 
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> unsigned counts2[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++) {
> counts[arc4random_uniform(LIMIT)]++;
> counts2[arc4random_uniform_fast_simple(LIMIT)]++;
> }
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]);
> return 0;
> }
> 
> : bleys; ./a.out  
>  1   100188   76502
>  299983   76602
>  3   100521   76522
>  4   100416   76682
>  5   100171   76387
>  6   100163   76759
>  7   100024   76336
>  8   19   76703
>  999769   76237
> 1099892   76532
> 11   100197   76730
> 12   100483   76398
> 1399769   76310
> 14   100075   76474
> 1599781   76599
> 1699846   76439
> 1799814   76430
> 18   100313   76648
> 19   100259   76813
> 2099885   77068
> 21   100302   76546
> 228   76698
> 2399491   76678
> 24   100340   76324
> 2599763  115263
> 2699872  153008
> 27   100022  1

Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Philip Guenther
On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> Stuart Henderson wrote in
...
>  |what's the perceived problem you're wanting to solve? and does that
>  |problem actually exist in the first place?
> 
> The problem is that if have a low upper bound then modulo will "remove a 
> lot of randomization".  For example if you have a program which 
> generates Lotto numbers (1..49), then using _uniform() as it is will 
> generate many duplicates.

Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
distribution.  Says right so in the manpage.  If an implementation of that 
API fails to do that, it's a broken implementation.

The OpenBSD implementation achieves that.  NetBSD's implementation has the 
same core logic.  Here's my quick lotto pick demo program, doing 490 
picks, so they should all be somewhere near 10:

#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++)
counts[arc4random_uniform(LIMIT)]++;
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\n", i+1, counts[i]);
return 0;
}

And a sample run:

: bleys; ./a.out
   
 1   100100
 2   100639
 399965
 499729
 599641
 699650
 7   100299
 8   100164
 999791
10   100101
11   100657
12   100042
1399661
1499927
1599426
1699491
1799646
18   100133
19   100013
2099942
2199873
2299924
2399567
24   100152
25   100688
26   100011
27   100481
2899980
29   100406
3099726
3199808
3299929
33   100050
3499983
35   100048
3699771
3799906
38   100215
39   100261
40   100426
4199847
4299533
43   100368
4499695
45   100041
46   100465
4799875
48   100034
4999920
: bleys; 

Looks pretty good to me, with repeated runs showing the values bouncing around.


...
> I was a bit interested when i saw Luke's message, but i am no
> mathematician and, like i said, i in fact never needed the
> functionality.  And the question i would have is how small
> upper_bound has to be for the modulo problem to really kick in.

If the implementation isn't broken, it's not a problem, period.


> And even if, whether it matters.  For example, if you have
> a bitset of 1024 "in-flight things", and the small upper bound
> hits a used slot, you could simply search forward until you find
> a free one.  I mean, with 1024 possible values the number of
> possibilities is anyhow restricted.

Well hey, let's give Luke's a try and see how uniform it is:


#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
    unsigned counts2[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++) {
counts[arc4random_uniform(LIMIT)]++;
counts2[arc4random_uniform_fast_simple(LIMIT)]++;
}
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]);
return 0;
}

: bleys; ./a.out  
 1   100188   76502
 299983   76602
 3   100521   76522
 4   100416   76682
 5   100171   76387
 6   100163   76759
 7   100024   76336
 8   19   76703
 999769   76237
1099892   76532
11   100197   76730
12   100483   76398
1399769   76310
14   100075   76474
1599781   76599
1699846   76439
1799814   76430
18   100313   76648
19   100259   76813
2099885   77068
21   100302   76546
228   76698
2399491   76678
24   100340   76324
2599763  115263
2699872  153008
27   100022  152979
2899481  153793
29   100018  210714
3099617  229286
31   100167  297003
32   100270  449664
33   100468   76790
3499115   76452
3599921   76392
3699862   76140
37   100485   76607
38   100029   75885
3999577   76498
4099479   76727
41   100139   76746
42   100883   76698
43   100102   76474
4499801   76592
45   100117   76124
4699678   76417
4799770   76639
4899524   77034
49   100151   76658
: bleys; 

Wow, that last column is *bad*.  Repeated runs show 25-32 are consistently 
high, 32 being *outrageously* off, and the others are low.

Luke's implementation does not correctly implement the API.  Doesn't 
matter if it's a million times faster when it doesn't deliver the goal.


Philip Guenther



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Steffen Nurpmeso
Stuart Henderson wrote in
 :
 |On 2022/05/14 06:56, Luke Small wrote:
 |> If I use arc4random_uniform() repeatedly to create a random distribution \
 |> of
 |> say numbers less than 0x1000 or even something weird like 0x1300 will the
 |> random distribution be better with arc4random_uniform() or with mine?
 |
 |there's no point to have a choice of different arc4random_uniform_xyz
 |with different characteristics, how is somebody going to pick the
 |"right" one?
 |
 |the bottom of netbsd's version of the arc4random(3) manual says it well:
 |
 | One may be tempted to create new APIs to accommodate different \
 | security
 | models and performance constraints without unpleasant surprises \
 | on dif-
 | ferent operating systems.  This should not be done lightly, though,
 | because there are already too many different choices, and too \
 | many oppor-
 | tunities for programmers to reach for one and pick the wrong one.
 |
 |what's the perceived problem you're wanting to solve? and does that
 |problem actually exist in the first place?

The problem is that if have a low upper bound then modulo will
"remove a lot of randomization".  For example if you have
a program which generates Lotto numbers (1..49), then using
_uniform() as it is will generate many duplicates.

I used the approach i found in a TeX file over twenty years ago
("RANDOM.TEX, v.1 (Donald Arseneau) from TeTeX,
texmf/tex/generic/misc/random.tex"; and now that i look, he
published a .v2 not too long ago), it was named _clip() as it
produces something in between a minimum and a maximum:

_max -= _min;
++_max;
_min = Maximum::sir - 2; // m - 2 = 2^(32|64) - 3
if (_max != 0)
_min /= _max;

for(;;) {
uir i;

i = _random;
i += M1::uir;
if(_min != 0)
i /= _min;

if(i >= _max && (i != 0 || _max != 0)) {
_random *= 16807; // just MUL
if(_random == 0)
_random = 1;
continue;
}
_random = i;
break;
}

_random += origmin;
_Assert(1 && _random >= origmin && _random <= origmax);

That is the random number is multiplied with a, eh, non-prime, as
long as we do not fit.
However i found this became a terrible mess with 64-bit numbers,
so i removed the function, aka did not "port it over" when
i implemented a random number generator for my own use cases.
Maybe i should have increased the multiplicator.

It is really, really bad.  :-)

I only ever used it for a Lotto number program anyhow, i think,
and so i thought it would be better to simply read a one byte
random and skip those that do not fit, instead of making them
always fit with modulo 50.  That is, application specific
workaround.  Things are less harmful for large numbers, anyway.
In fact i have to say i was astonished to see someone using this
function, not too long ago i saw a FreeBSD commit fly by which
uses it!

I was a bit interested when i saw Luke's message, but i am no
mathematician and, like i said, i in fact never needed the
functionality.  And the question i would have is how small
upper_bound has to be for the modulo problem to really kick in.
And even if, whether it matters.  For example, if you have
a bitset of 1024 "in-flight things", and the small upper bound
hits a used slot, you could simply search forward until you find
a free one.  I mean, with 1024 possible values the number of
possibilities is anyhow restricted.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Matthew Martin
int
main() {
int results[3] = { 0, 0, 0 };
for (int i = 0; i < 10; i++) {
results[arc4random_uniform_fast_simple(3)]++;
}
for (int i = 0; i < 3; i++)
printf("%d: %d\n", i, results[i]);

return 0;
}

% ./a.out
0: 24809
1: 50011
2: 25180

You can't reuse bits because they'll be biased.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Stuart Henderson
On 2022/05/14 06:56, Luke Small wrote:
> If I use arc4random_uniform() repeatedly to create a random distribution of
> say numbers less than 0x1000 or even something weird like 0x1300 will the
> random distribution be better with arc4random_uniform() or with mine?

there's no point to have a choice of different arc4random_uniform_xyz
with different characteristics, how is somebody going to pick the
"right" one?

the bottom of netbsd's version of the arc4random(3) manual says it well:

 One may be tempted to create new APIs to accommodate different security
 models and performance constraints without unpleasant surprises on dif-
 ferent operating systems.  This should not be done lightly, though,
 because there are already too many different choices, and too many oppor-
 tunities for programmers to reach for one and pick the wrong one.

what's the perceived problem you're wanting to solve? and does that
problem actually exist in the first place?



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
Look at my code. I don’t even use a modulus operator. I perform hit and
miss with a random bitstream.

How can I have a bias of something I don’t do? I return a bitstream which
meets the parameters of being a value less than the upper bound. Much like
arc4random_buf().

If I use arc4random_uniform() repeatedly to create a random distribution of
say numbers less than 0x1000 or even something weird like 0x1300 will the
random distribution be better with arc4random_uniform() or with mine? For
0x1000 mine will simply pluck 12 bits of random data straight from the
arc4random() (and preserve the remaining 20 bits for later) on the first
try, just like it’s arc4random_buf().

arc4random_uniform() will perform a modulus of a 32 bit number which adds
data to the bitstream. Does it make it better? Perhaps it makes it harder
to guess the source bits.

I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
looking at modulo bias.

I didn’t know what it was, before, but I basically “rejection sample”:

https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/

On Sat, May 14, 2022 at 6:14 AM Otto Moerbeek  wrote:

> On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:
>
> > arc4random_uniform_fast2 that I made, streams in data from arc4random()
> and
> > uses the datastream directly and uses it as a bit by bit right "sliding
> > window" in the last loop. arc4random_uniform() uses a modulus which I is
> > simple to implement, but I wonder how cryptographically sound or even how
> > evenly it distributes. Adding a modulus seems sloppy without something
> > better. I did make arc4random_fast_simple() which merely takes an
> > upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> > the top function was into it which binary searches to find the correct
> size
> > bitfield (return value) needed to barely fit the upperbound while also
> > being able to discover every possible value below the upperbound. It
> isn't
> > as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> > single use of arc4random_uniform_fast_bitsearch() , but it does exactly
> the
> > same thing and appears faster than repeatedly using arc4random_uniform()
> > and it's wasteful use of arc4random() and calling the expensive rekeying
> > function more often.
> >
> > It may be interesting to determine even without looking at performance,
> > whether arc4random_fast_simple() creates a more superior, secure use of
> the
> > chacha20 stream than arc4random_uniform() with the modulus. what exactly
> > does all that extra data from the modulus do to the random distribution?
> >
> > -Luke
>
> I don't follow you at all. Your blabbering does not even use the terms
> "uniform" and "modulo bias". I wonder even if you realize what they
> mean in this context.
>
> -Otto
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:

> arc4random_uniform_fast2 that I made, streams in data from arc4random() and
> uses the datastream directly and uses it as a bit by bit right "sliding
> window" in the last loop. arc4random_uniform() uses a modulus which I is
> simple to implement, but I wonder how cryptographically sound or even how
> evenly it distributes. Adding a modulus seems sloppy without something
> better. I did make arc4random_fast_simple() which merely takes an
> upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> the top function was into it which binary searches to find the correct size
> bitfield (return value) needed to barely fit the upperbound while also
> being able to discover every possible value below the upperbound. It isn't
> as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
> same thing and appears faster than repeatedly using arc4random_uniform()
> and it's wasteful use of arc4random() and calling the expensive rekeying
> function more often.
> 
> It may be interesting to determine even without looking at performance,
> whether arc4random_fast_simple() creates a more superior, secure use of the
> chacha20 stream than arc4random_uniform() with the modulus. what exactly
> does all that extra data from the modulus do to the random distribution?
> 
> -Luke

I don't follow you at all. Your blabbering does not even use the terms
"uniform" and "modulo bias". I wonder even if you realize what they
mean in this context.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
arc4random_uniform_fast2 that I made, streams in data from arc4random() and
uses the datastream directly and uses it as a bit by bit right "sliding
window" in the last loop. arc4random_uniform() uses a modulus which I is
simple to implement, but I wonder how cryptographically sound or even how
evenly it distributes. Adding a modulus seems sloppy without something
better. I did make arc4random_fast_simple() which merely takes an
upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
the top function was into it which binary searches to find the correct size
bitfield (return value) needed to barely fit the upperbound while also
being able to discover every possible value below the upperbound. It isn't
as fast as arc4random_uniform_fast2 if it were used repeatedly after a
single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
same thing and appears faster than repeatedly using arc4random_uniform()
and it's wasteful use of arc4random() and calling the expensive rekeying
function more often.

It may be interesting to determine even without looking at performance,
whether arc4random_fast_simple() creates a more superior, secure use of the
chacha20 stream than arc4random_uniform() with the modulus. what exactly
does all that extra data from the modulus do to the random distribution?

-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-13 Thread Otto Moerbeek
In general, wasting some arc4random bits is not a problem at all.

I see a lot of code with no demonstration, explanation or proof why it
would be correct (i.e. produces uniform results). You only talk about
speed.

The current code might not be optimal, but at least it is very clear
it's correct.

-Otto


On Fri, May 13, 2022 at 09:43:26AM -0500, Luke Small wrote:

> I made a couple new versions of a new kind of arc4random_uniform-like
> function and some example functions which use them. Instead of having a
> sufficiently large random number greater than the modulus, I pick a random
> number using arc4random() from a bitfield where the length of the bitfield
> is just below or slightly beyond the value of the modulus and returns the
> bitfield it if it is less than the value of the modulus.
> 
> In both versions, I retrieve a bitfield from a static uint64_t which is
> refreshed from periodic arc4random() calls. The functions demand a bit
> length. but I provide a convenient bitsize search function:
> arc4random_uniform_fast_bitsearch()
> 
> in the first version, if the chosen return value isn't less than the
> modulus, the entire bitfield is trashed and a completely new bitfield is
> refreshed from the cache. This can be very inefficient and for certain
> upperbounds where the functions demand that all returned values less than
> the upperbound are possible. This can perform worse than
> arc4random_uniform() even with more random number generation churn.
> 
> in the second version, if the chosen return value isn't less than the
> modulus, the bitfield is shifted right, losing the least significant bit
> and a new random-value bit from the least significant bit of the cache is
> put into the most significant bit of the bitfield and it tries again. For
> significant expenditures of effort, this version is always more efficient
> than arc4random_uniform() and slightly to much more efficient than my first
> version.
> 
> 
> The performance boost comes from less pseudorandom number generation by not
> trashing perfectly good random data and preserving it so that rekeying is
> performed less.
> 
> 
> I suspect that the first version may be secure, but I'm now sure what you
> think of the second version, for being secure. Is there a way to test how
> well distributed and secure it is?!
> 
> Perhaps it's even better than the typical use case of arc4random_uniform
> since it feeds it a bitstream instead of performing a modulous operation on
> it!
> 
> I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it
> doesn't do anything but malloc() an array, do stuff on the array and print
> stuff; if you don't profile, anyway.
> 
> What do you think of having such a thing in OpenBSD?
> 
> 
> 
> 
> newdata() generates random a-z characters performing arc4random_uniform(26);
> 
> newdataTypableFilename() generates random filename typable characters
> performing arc4random_uniform(92);
> 
> I perform other tests including presumed worst-cases on large numbers and
> numbers which will have a lot of misses.
> 
> 
> 
> 
> -Luke




Picky, but much more efficient arc4random_uniform!

2022-05-13 Thread Luke Small
I made a couple new versions of a new kind of arc4random_uniform-like
function and some example functions which use them. Instead of having a
sufficiently large random number greater than the modulus, I pick a random
number using arc4random() from a bitfield where the length of the bitfield
is just below or slightly beyond the value of the modulus and returns the
bitfield it if it is less than the value of the modulus.

In both versions, I retrieve a bitfield from a static uint64_t which is
refreshed from periodic arc4random() calls. The functions demand a bit
length. but I provide a convenient bitsize search function:
arc4random_uniform_fast_bitsearch()

in the first version, if the chosen return value isn't less than the
modulus, the entire bitfield is trashed and a completely new bitfield is
refreshed from the cache. This can be very inefficient and for certain
upperbounds where the functions demand that all returned values less than
the upperbound are possible. This can perform worse than
arc4random_uniform() even with more random number generation churn.

in the second version, if the chosen return value isn't less than the
modulus, the bitfield is shifted right, losing the least significant bit
and a new random-value bit from the least significant bit of the cache is
put into the most significant bit of the bitfield and it tries again. For
significant expenditures of effort, this version is always more efficient
than arc4random_uniform() and slightly to much more efficient than my first
version.


The performance boost comes from less pseudorandom number generation by not
trashing perfectly good random data and preserving it so that rekeying is
performed less.


I suspect that the first version may be secure, but I'm now sure what you
think of the second version, for being secure. Is there a way to test how
well distributed and secure it is?!

Perhaps it's even better than the typical use case of arc4random_uniform
since it feeds it a bitstream instead of performing a modulous operation on
it!

I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it
doesn't do anything but malloc() an array, do stuff on the array and print
stuff; if you don't profile, anyway.

What do you think of having such a thing in OpenBSD?




newdata() generates random a-z characters performing arc4random_uniform(26);

newdataTypableFilename() generates random filename typable characters
performing arc4random_uniform(92);

I perform other tests including presumed worst-cases on large numbers and
numbers which will have a lot of misses.




-Luke


arc4random_uniform_fast.c
Description: Binary data


Re: [patch] Convert modulus to arc4random_uniform

2015-12-17 Thread Ted Unangst
Theo Buehler wrote:
> I've now committed most of your diff, thanks once again.
> 
> o I asked for further review on the kernel parts
> o I'm going to skip hack for now
> 
> Here's a patch for libc, based on the previous discussion.
> I think this is easier to read and understand.  No binary change on
> amd64.
> 
> ok?

sure. the current code is safer for other values of randmax, but that seems an
unlikely problem.



Re: [patch] Convert modulus to arc4random_uniform

2015-12-17 Thread Theo Buehler
I've now committed most of your diff, thanks once again.

o I asked for further review on the kernel parts
o I'm going to skip hack for now

Here's a patch for libc, based on the previous discussion.
I think this is easier to read and understand.  No binary change on
amd64.

ok?

Index: lib/libc/stdlib/rand.c
===
RCS file: /var/cvs/src/lib/libc/stdlib/rand.c,v
retrieving revision 1.15
diff -u -p -r1.15 rand.c
--- lib/libc/stdlib/rand.c  13 Sep 2015 08:31:47 -  1.15
+++ lib/libc/stdlib/rand.c  16 Dec 2015 08:02:41 -
@@ -37,7 +37,7 @@ int
 rand_r(u_int *seed)
 {
*seed = *seed * 1103515245 + 12345;
-   return (*seed % ((u_int)RAND_MAX + 1));
+   return (*seed & RAND_MAX);
 }
 DEF_WEAK(rand_r);
 
@@ -50,7 +50,7 @@ int
 rand(void)
 {
if (rand_deterministic == 0)
-   return (arc4random() % ((u_int)RAND_MAX + 1));
+   return (arc4random() & RAND_MAX);
return (rand_r(&next));
 }
 



Re: [patch] Convert modulus to arc4random_uniform

2015-12-08 Thread Jérémie Courrèges-Anglas
Matthew Martin  writes:

> On Mon, Dec 07, 2015 at 09:33:47AM +0100, Theo Buehler wrote:
>> I think some of these are ok, but I'm unsure about some of the others.
>> Here are some of my concerns:
>> 
>> - since arc4random_uniform can potentially loop indefinitely, it
>>   might interfere with predictable timing of some routines.  I can't
>>   tell if this is harmless in all cases below.  This applies in
>>   particular to the proposed changes in the kernel.
>
> I hadn't considered timing problems. I'll look at it again tonight, but
> someone more familiar with the code should certainly look at it before
> committing.
>
>> - changing random() to arc4random() in games might have undesired
>>   side-effects like interfering with the reproducibility of tests or
>>   games.  I think this might apply to awk for the same reason as in the
>>   shells.  
>
> The patch for awk was wrong.

If there's a concern about upstream for nsd/sqlite/libevent, then note
that awk has an upstream too.

[...]

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: [patch] Convert modulus to arc4random_uniform

2015-12-07 Thread Theo Buehler
> I'll look into hack tonight when I have more time.

Honestly, I would prefer to leave hack as it is right now since it will
take some work to repair it anyway.  I would not want to add another
layer of (potential) complications.

> > > Index: lib/libc/stdlib/rand.c
> > > ===

> It's safe but takes a bit of thinking. I first had it as
>   return (arc4random() & RAND_MAX);
> which to me is more obviously correct, but since it's safe as is. I have
> no strong opinion on this.

I have a mild preference for this version, and I think this version
would be preferable from the point of view of uniformity of usage,
especially after your patches have gone in.

> > > Index: usr.bin/awk/run.c
> > > ===
> > 
> > Unsure about this one.  I think deterministic sequences might be desired
> > in some circumstances (this one is deterministic when a seed was given).
> 
> theo@ also pointed out that awk can be deterministic. Since RAND_MAX is
> 1 below a power of 2, & is safe. How about
> 
> Index: run.c
> ===
> RCS file: /cvs/src/usr.bin/awk/run.c,v
> retrieving revision 1.39
> diff -u -p -r1.39 run.c
> --- run.c 5 Sep 2015 22:07:10 -   1.39
> +++ run.c 7 Dec 2015 19:28:31 -
> @@ -1581,7 +1581,7 @@ Cell *bltin(Node **a, int n)/* builtin 
>   u = (Awkfloat) system(getsval(x)) / 256;   /* 256 is unix-dep */
>   break;
>   case FRAND:
> - u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
> + u = (Awkfloat) (random() & RAND_MAX) / ((u_int)RAND_MAX + 1);
>   break;
>   case FSRAND:
>   if (isrec(x)) { /* no argument provided */
> 

ok from me on this one.



Re: [patch] Convert modulus to arc4random_uniform

2015-12-07 Thread Matthew Martin
On Mon, Dec 07, 2015 at 09:33:47AM +0100, Theo Buehler wrote:
> I think some of these are ok, but I'm unsure about some of the others.
> Here are some of my concerns:
> 
> - since arc4random_uniform can potentially loop indefinitely, it
>   might interfere with predictable timing of some routines.  I can't
>   tell if this is harmless in all cases below.  This applies in
>   particular to the proposed changes in the kernel.

I hadn't considered timing problems. I'll look at it again tonight, but
someone more familiar with the code should certainly look at it before
committing.

> - changing random() to arc4random() in games might have undesired
>   side-effects like interfering with the reproducibility of tests or
>   games.  I think this might apply to awk for the same reason as in the
>   shells.  

The patch for awk was wrong. Updated patch below. I'll look into hack
tonight when I have more time.


> > Index: lib/libc/stdlib/rand.c
> > ===
> > RCS file: /cvs/src/lib/libc/stdlib/rand.c,v
> > retrieving revision 1.15
> > diff -u -p -r1.15 rand.c
> > --- lib/libc/stdlib/rand.c  13 Sep 2015 08:31:47 -  1.15
> > +++ lib/libc/stdlib/rand.c  7 Dec 2015 06:42:17 -
> > @@ -50,7 +50,7 @@ int
> >  rand(void)
> >  {
> > if (rand_deterministic == 0)
> > -   return (arc4random() % ((u_int)RAND_MAX + 1));
> > +   return (arc4random_uniform((u_int)RAND_MAX + 1));
> > return (rand_r(&next));
> >  }
> >  
> 
> this is modulo 2^n, so I think this one is fine as it is.

It's safe but takes a bit of thinking. I first had it as
return (arc4random() & RAND_MAX);
which to me is more obviously correct, but since it's safe as is. I have
no strong opinion on this.


> > Index: usr.bin/awk/run.c
> > ===
> 
> Unsure about this one.  I think deterministic sequences might be desired
> in some circumstances (this one is deterministic when a seed was given).

theo@ also pointed out that awk can be deterministic. Since RAND_MAX is
1 below a power of 2, & is safe. How about

Index: run.c
===
RCS file: /cvs/src/usr.bin/awk/run.c,v
retrieving revision 1.39
diff -u -p -r1.39 run.c
--- run.c   5 Sep 2015 22:07:10 -   1.39
+++ run.c   7 Dec 2015 19:28:31 -
@@ -1581,7 +1581,7 @@ Cell *bltin(Node **a, int n)  /* builtin 
u = (Awkfloat) system(getsval(x)) / 256;   /* 256 is unix-dep */
break;
case FRAND:
-   u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
+   u = (Awkfloat) (random() & RAND_MAX) / ((u_int)RAND_MAX + 1);
break;
case FSRAND:
if (isrec(x)) { /* no argument provided */



Re: [patch] Convert modulus to arc4random_uniform

2015-12-07 Thread Theo Buehler
On Mon, Dec 07, 2015 at 12:49:17AM -0600, Matthew Martin wrote:
> 
> Theo's diff inspired me to look for other cases of modulo bias. The
> following diff converts most modulus operations on a random number to
> use arc4random_uniform or & as appropriate. I excluded
> 
> lib/libsqlite3/src/resolve.c
> regress/lib/libevent/test-time.c
> usr.sbin/nsd/rrl.c
> usr.sbin/nsd/util.c
> usr.sbin/nsd/xfrd.c
> 
> as they seem to have upstreams. The only other case is games/wump/wump.c
> which has
> 
> if (arc4random() % 2 == 1)
> 
> This is safe and seems obvious enough to me.
> 
> - Matthew Martin

Thank you.  I was going to do the same :)

I think some of these are ok, but I'm unsure about some of the others.
Here are some of my concerns:

- since arc4random_uniform can potentially loop indefinitely, it
  might interfere with predictable timing of some routines.  I can't
  tell if this is harmless in all cases below.  This applies in
  particular to the proposed changes in the kernel.

- changing random() to arc4random() in games might have undesired
  side-effects like interfering with the reproducibility of tests or
  games.  I think this might apply to awk for the same reason as in the
  shells.  

> Index: games/atc/update.c
> ===

ok

> Index: games/hack/hack.mklev.c
> ===
> Index: games/hack/rnd.c
> ===

unsure about these two

> Index: lib/libc/stdlib/rand.c
> ===
> RCS file: /cvs/src/lib/libc/stdlib/rand.c,v
> retrieving revision 1.15
> diff -u -p -r1.15 rand.c
> --- lib/libc/stdlib/rand.c13 Sep 2015 08:31:47 -  1.15
> +++ lib/libc/stdlib/rand.c7 Dec 2015 06:42:17 -
> @@ -50,7 +50,7 @@ int
>  rand(void)
>  {
>   if (rand_deterministic == 0)
> - return (arc4random() % ((u_int)RAND_MAX + 1));
> + return (arc4random_uniform((u_int)RAND_MAX + 1));
>   return (rand_r(&next));
>  }
>  

this is modulo 2^n, so I think this one is fine as it is.

> Index: sbin/dhclient/dhclient.c
> ===

I have already done this independently.

> Index: sys/dev/ic/sili.c
> ===
> Index: sys/netinet6/nd6_rtr.c
> ===
> Index: sys/ufs/ffs/ffs_alloc.c
> ===

These must definitely be looked at by somebody else.

> Index: usr.bin/awk/run.c
> ===

Unsure about this one.  I think deterministic sequences might be desired
in some circumstances (this one is deterministic when a seed was given).

> Index: usr.sbin/npppd/common/slist.c
> ===
> Index: usr.sbin/npppd/l2tp/l2tpd.c
> ===
> Index: usr.sbin/npppd/pppoe/pppoed.c
> ===
> Index: usr.sbin/npppd/pptp/pptpd.c
> ===

These four are ok with me.



[patch] Convert modulus to arc4random_uniform

2015-12-06 Thread Matthew Martin

Theo's diff inspired me to look for other cases of modulo bias. The
following diff converts most modulus operations on a random number to
use arc4random_uniform or & as appropriate. I excluded

lib/libsqlite3/src/resolve.c
regress/lib/libevent/test-time.c
usr.sbin/nsd/rrl.c
usr.sbin/nsd/util.c
usr.sbin/nsd/xfrd.c

as they seem to have upstreams. The only other case is games/wump/wump.c
which has

if (arc4random() % 2 == 1)

This is safe and seems obvious enough to me.

- Matthew Martin


Index: games/atc/update.c
===
RCS file: /cvs/src/games/atc/update.c,v
retrieving revision 1.16
diff -u -p -r1.16 update.c
--- games/atc/update.c  9 Dec 2014 05:01:14 -   1.16
+++ games/atc/update.c  7 Dec 2015 06:42:17 -
@@ -59,6 +59,15 @@ atcrandom()
return arc4random();
 }
 
+uint32_t
+atcrandom_uniform(uint32_t upper_bound)
+{
+   if (seeded)
+   return random() % upper_bound;
+   else
+   return arc4random_uniform(upper_bound);
+}
+
 void
 update(int dummy)
 {
@@ -212,7 +221,7 @@ update(int dummy)
 * Otherwise, prop jobs show up *on* entrance.  Remember that
 * we don't update props on odd updates.
 */
-   if ((atcrandom() % sp->newplane_time) == 0)
+   if (atcrandom_uniform(sp->newplane_time) == 0)
addplane();
 }
 
@@ -308,10 +317,10 @@ addplane(void)
memset(&p, 0, sizeof (p));
 
p.status = S_MARKED;
-   p.plane_type = atcrandom() % 2;
+   p.plane_type = atcrandom_uniform(2);
 
num_starts = sp->num_exits + sp->num_airports;
-   rnd = atcrandom() % num_starts;
+   rnd = atcrandom_uniform(num_starts);
 
if (rnd < sp->num_exits) {
p.dest_type = T_EXIT;
@@ -324,7 +333,7 @@ addplane(void)
/* loop until we get a plane not near another */
for (i = 0; i < num_starts; i++) {
/* loop till we get a different start point */
-   while ((rnd2 = atcrandom() % num_starts) == rnd)
+   while ((rnd2 = atcrandom_uniform(num_starts)) == rnd)
;
if (rnd2 < sp->num_exits) {
p.orig_type = T_EXIT;
Index: games/hack/hack.mklev.c
===
RCS file: /cvs/src/games/hack/hack.mklev.c,v
retrieving revision 1.7
diff -u -p -r1.7 hack.mklev.c
--- games/hack/hack.mklev.c 27 Oct 2009 23:59:25 -  1.7
+++ games/hack/hack.mklev.c 7 Dec 2015 06:42:17 -
@@ -66,8 +66,8 @@
 #include 
 #include "hack.h"
 
-#define somex() ((random()%(croom->hx-croom->lx+1))+croom->lx)
-#define somey() ((random()%(croom->hy-croom->ly+1))+croom->ly)
+#define somex() (arc4random_uniform(croom->hx-croom->lx+1)+croom->lx)
+#define somey() (arc4random_uniform(croom->hy-croom->ly+1)+croom->ly)
 
 #defineXLIM4   /* define minimum required space around a room 
*/
 #defineYLIM3
Index: games/hack/rnd.c
===
RCS file: /cvs/src/games/hack/rnd.c,v
retrieving revision 1.5
diff -u -p -r1.5 rnd.c
--- games/hack/rnd.c27 Oct 2009 23:59:25 -  1.5
+++ games/hack/rnd.c7 Dec 2015 06:42:17 -
@@ -61,7 +61,7 @@
  * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#define RND(x) ((random()>>3) % x)
+#define RND(x) (arc4random_uniform(x))
 
 #include 
 
Index: lib/libc/stdlib/rand.c
===
RCS file: /cvs/src/lib/libc/stdlib/rand.c,v
retrieving revision 1.15
diff -u -p -r1.15 rand.c
--- lib/libc/stdlib/rand.c  13 Sep 2015 08:31:47 -  1.15
+++ lib/libc/stdlib/rand.c  7 Dec 2015 06:42:17 -
@@ -50,7 +50,7 @@ int
 rand(void)
 {
if (rand_deterministic == 0)
-   return (arc4random() % ((u_int)RAND_MAX + 1));
+   return (arc4random_uniform((u_int)RAND_MAX + 1));
return (rand_r(&next));
 }
 
Index: sbin/dhclient/dhclient.c
===
RCS file: /cvs/src/sbin/dhclient/dhclient.c,v
retrieving revision 1.367
diff -u -p -r1.367 dhclient.c
--- sbin/dhclient/dhclient.c5 Dec 2015 13:09:11 -   1.367
+++ sbin/dhclient/dhclient.c7 Dec 2015 06:42:18 -
@@ -1285,15 +1285,13 @@ send_discover(void)
 */
if (!client->interval)
client->interval = config->initial_interval;
-   else {
-   client->interval += (arc4random() >> 2) %
-   (2 * client->interval);
-   }
+   else
+   client->interval += arc4random_uniform(2 * client->interval);
 
/* Don't backoff past cutoff. */
if (client->interval > config->backoff_cutoff)
-   client->interval 

Re: Arc4random_uniform

2012-06-09 Thread Jorden Verwer
> I agree that simply "min = -upper_bound % upper_bound" should be
> sufficient in all cases, since u_int32_t arithmetic is defined as
> modulo 2**32 by the C standard, at least as of C99 and I think C89
> too.  (Even if we supported any 1s-complement architectures, the
> compiler would still need to implement u_int32_t as modulo 2**32.)
Indeed. I was looking at it from a correctness point of view instead of
trying to determine if it would work in practice.

> I also think it makes sense to get rid of the LP64 test, because
> 64-bit division still takes more than twice as long as 32-bit division
> on most amd64 processors for example (according to
> http://gmplib.org/~tege/x86-timing.pdf).
And to reduce complexity, of course.

> Of course, the potential benefit here isn't that great, so I don't
> know whether this really makes sense to worry about.
Oh, there are certainly more important matters, but you know how these
things go. You see something that can be improved and it turns into an
itch that needs to be scratched. The quickest and best way to do so was
to send an email to this list. Then when I wrote the message, I started
thinking about whether this really was the best implementation or it could
be improved further. I freely admit that it doesn't make any difference
in the grand scheme of things, but there's also the minute scheme of
things. ;)



Re: Arc4random_uniform

2012-06-08 Thread Matthew Dempsky
[+djm, who wrote this code]

I agree that simply "min = -upper_bound % upper_bound" should be
sufficient in all cases, since u_int32_t arithmetic is defined as
modulo 2**32 by the C standard, at least as of C99 and I think C89
too.  (Even if we supported any 1s-complement architectures, the
compiler would still need to implement u_int32_t as modulo 2**32.)

I also think it makes sense to get rid of the LP64 test, because
64-bit division still takes more than twice as long as 32-bit division
on most amd64 processors for example (according to
http://gmplib.org/~tege/x86-timing.pdf).


Of course, the potential benefit here isn't that great, so I don't
know whether this really makes sense to worry about.


On Fri, Jun 8, 2012 at 5:13 AM, Jorden Verwer  wrote:
> Hello OpenBSD developers,
>
> Let me state in advance that I'm not entirely sure if I'm sending this
> to the right mailing list. Based on the descriptions, however, I do
> believe tech@ is the correct one instead of misc@. Furthermore, I'm not
> subscribed to any of the lists, so please CC me if you should want to
> reply. Finally, I've sent this message before from another account, but
> it didn't appear in any of the archives. If it did reach the list, I
> apologize. In the meantime, I did write some extra things that could be
> interesting.
>
> I've been looking at the way several (BSD) operating systems implement
> random numbers, in the context of an online election system (what to do
> with an equal number of votes per seat, etc.). My current implementation
> reads a byte from /dev/random and converts it to the required range,
> possibly reading more bytes to avoid the systematic bias a naive
> implementation has for ranges that are not a power of 2. It works
> correctly, but there are some considerations (transparency and chroot
> jail compatibility) that have led me to look at alternative
> implementations. This is where arc4random_uniform comes in. The way it
> avoids biased results is different from mine (my implementation uses the
> high-order bits instead), but the main idea of throwing away results
> that are outside the range that is a whole multiple of the desired range
> is exactly the same. It might be a viable alternative if I can convince
> myself (and others) that the values it generates do not favor one party
> over the other, no matter how slightly. Obviously, the criteria are
> different from those applied to cryptography, but I don't consider
> myself competent to express them formally.
>
> This is why I compared the function in lib/libc/crypt/arc4random.c to
> that in sys/dev/rnd.c. I was surprised to find that the 32-bit
> arithmetic in the libc version differs from that in the kernel version.
> Apparently, back in 2008 some change was only applied to the kernel, not
> to libc.
>
> This is what the kernel source code says:
> min = ((0x - upper_bound) + 1) % upper_bound;
>
> This is what the libc source code says:
> min = ((0x - (upper_bound * 2)) + 1) % upper_bound;
>
> Now, I'm not a math geek, but it seems pretty obvious to me that the
> version in the kernel source is to be preferred, given that it's simpler
> and does exactly what it has to do. I'm assuming this is also the reason
> the change was implemented, and that's what the commit log seems to say
> as well. So, I wonder why the libc source code was not changed. Was this
> an oversight?
>
> Secondly, the kernel code has the added advantage that it works for any
> possible value of upper_bound (you can check this for yourself if you
> don't believe me). This means that the upper_bound > 0x8000 check
> can be removed, making the common case faster (upper_bound <= 0x8000
> is definitely the common case in my book) and everything easier to
> understand and cleaner. Personally I don't see any real drawbacks, but
> feel free to disagree. ;)
>
> There are also other possible implementations that are even shorter,
> although they aren't necessarily easier to understand. In summary, the
> part between #else and #endif could be replaced by any of these lines:
> min = ((0x - upper_bound) + 1) % upper_bound;
> min = (1 + ~upper_bound) % upper_bound;
> min = (0 - upper_bound) % upper_bound;
> min = -upper_bound % upper_bound;
>
> The latter two implementations work correctly because the C standard
> defines modulo arithmetic for unsigned integers. Replace 0 by 2**32 and
> you'll see why it does what it's supposed to. I'm not entirely sure the
> last implementation would work on a machine that doesn't use two's
> complement. The standard is vague in its wording.
>
> Another option still would be to replace the entire #if block by this:
> min = (0x1 - upper_bound) % upper_bound;
>
> This will continue to work correctly even if int ever becomes 64-bit.
>
> Kind regards,
>
> Jorden Verwer



Arc4random_uniform

2012-06-08 Thread Jorden Verwer
Hello OpenBSD developers,

Let me state in advance that I'm not entirely sure if I'm sending this
to the right mailing list. Based on the descriptions, however, I do
believe tech@ is the correct one instead of misc@. Furthermore, I'm not
subscribed to any of the lists, so please CC me if you should want to
reply. Finally, I've sent this message before from another account, but
it didn't appear in any of the archives. If it did reach the list, I
apologize. In the meantime, I did write some extra things that could be
interesting.

I've been looking at the way several (BSD) operating systems implement
random numbers, in the context of an online election system (what to do
with an equal number of votes per seat, etc.). My current implementation
reads a byte from /dev/random and converts it to the required range,
possibly reading more bytes to avoid the systematic bias a naive
implementation has for ranges that are not a power of 2. It works
correctly, but there are some considerations (transparency and chroot
jail compatibility) that have led me to look at alternative
implementations. This is where arc4random_uniform comes in. The way it
avoids biased results is different from mine (my implementation uses the
high-order bits instead), but the main idea of throwing away results
that are outside the range that is a whole multiple of the desired range
is exactly the same. It might be a viable alternative if I can convince
myself (and others) that the values it generates do not favor one party
over the other, no matter how slightly. Obviously, the criteria are
different from those applied to cryptography, but I don't consider
myself competent to express them formally.

This is why I compared the function in lib/libc/crypt/arc4random.c to
that in sys/dev/rnd.c. I was surprised to find that the 32-bit
arithmetic in the libc version differs from that in the kernel version.
Apparently, back in 2008 some change was only applied to the kernel, not
to libc.

This is what the kernel source code says:
min = ((0x - upper_bound) + 1) % upper_bound;

This is what the libc source code says:
min = ((0x - (upper_bound * 2)) + 1) % upper_bound;

Now, I'm not a math geek, but it seems pretty obvious to me that the
version in the kernel source is to be preferred, given that it's simpler
and does exactly what it has to do. I'm assuming this is also the reason
the change was implemented, and that's what the commit log seems to say
as well. So, I wonder why the libc source code was not changed. Was this
an oversight?

Secondly, the kernel code has the added advantage that it works for any
possible value of upper_bound (you can check this for yourself if you
don't believe me). This means that the upper_bound > 0x8000 check
can be removed, making the common case faster (upper_bound <= 0x8000
is definitely the common case in my book) and everything easier to
understand and cleaner. Personally I don't see any real drawbacks, but
feel free to disagree. ;)

There are also other possible implementations that are even shorter,
although they aren't necessarily easier to understand. In summary, the
part between #else and #endif could be replaced by any of these lines:
min = ((0x - upper_bound) + 1) % upper_bound;
min = (1 + ~upper_bound) % upper_bound;
min = (0 - upper_bound) % upper_bound;
min = -upper_bound % upper_bound;

The latter two implementations work correctly because the C standard
defines modulo arithmetic for unsigned integers. Replace 0 by 2**32 and
you'll see why it does what it's supposed to. I'm not entirely sure the
last implementation would work on a machine that doesn't use two's
complement. The standard is vague in its wording.

Another option still would be to replace the entire #if block by this:
min = (0x1 - upper_bound) % upper_bound;

This will continue to work correctly even if int ever becomes 64-bit.

Kind regards,

Jorden Verwer