If the dictionary contains more than 65536 words (/usr/share/hunspell/en_US.dic does not) and you want the latest words to be possibly picked, then -N 2 should become -N 3.

Well, except that it will take time to get a valid word number if there are, say, 65537 words. Indeed, 65537/2^24 = 0.00390631, that is a probability below 0.4% to pick a valid word number, and in average log(0.5) / log(1-0.00390631) = 177 iterations (and accesses to the entropy pool) to choose each word.

The "if" test guarantees that every word has the *exact* same probability of being picked. But it is not worth it. It is good enough to take a random potentially large number (4 bytes below) and use a modulo: $ words=6; dic=/usr/share/hunspell/en_US.dic; max=`wc -l < $dic`; for i in `seq $words`; do r=`od -A n -N 4 -t u4 /dev/random`; cut -d / -f 1 $dic | sed -n `expr $r % $max + 1`p; done | tr '\n' ' '

Contrary to the command in the post right above, the first words in the dictionary have a slightly greater probability to be picked. For instance, since /usr/share/hunspell/en_US.dic has 62155 words and 2^32 = 62155 x 69100 + 56796, then the 56796 first words are more likely to be chosen by a factor 69101 / 69100 = 1.000014472, i.e., 0.001% more likely. Who cares? If you do, just increase a little bit more the two numbers "4" in the command above and you will gain many more zeros in that last probability. With "5", it already is 0.000006%.

Anyway, I am writing to myself, thinking fun little solutions instead of actually working! But if somebody here wants to understand the command, I would be happy to have that additional opportunity to escape the (lot of) work I have to achieve!

Reply via email to