That script generates a 164KB file with 4096 entries in about five minutes
real time.
I edited my previous post to reduce the unnecessarily high precision. Now,
on my system, generating 4096 addresses takes ~15s.
grep -c -o -i :0 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4095
grep -c -o -i :00 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4053
grep -c -o -i :000 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 3599
Options -o and -i are useless here. You may believe that using -o would make
'grep -c' count all occurrences on a line. It does not. It still counts the
number of lines with at least one occurrence among the six random groups of
four hexadecimal digits. Those outputs therefore mean that, in
IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt, all addresses but one have at
least one group that starts with "0", 99.0% have at least one group that
starts with "00", 87.9% have at least one group that starts with "000".
Extending Magic Banana's reasoning about the relative frequency of
occurrences of :0001, :0002 and :0003, the relative frequencies of the
occurrences of :0xxx, :00xx, and :000x in a 4096-row list of IPv6 addresses
ought to be 256/4096, 16/4096, and 1/4096, respectively. In a 65,536-address
list, prefix::0/128 may happen just once.
First of all, it is not a reasoning but a choice of distribution to sample
from. I believe groups of four hexadecimal digits chosen by local network
administrators approximately follow a Zipfian distribution. The exponent may
not be 1 though. A more realistic exponent could be fitted from real-world
addresses, by regression.
Your math looks wrong. If correct, the following AWK program outputs the
probabilities to sample 0000, 000x, 00xx or 0xxx:
$ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i)
cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print partial[p] /
cdf }'
0.0857076
0.289754
0.524903
0.762378
Technically, generating 0000 or not is the realization of a Bernoulli
variable of parameter 0.0857076, generating 000x or not is that of a
Bernoulli variable of parameter 0.289754, etc. Complementing the above
program, here are, among 4096 addresses, the expected numbers of addresses
with at least one 0000, at least one 000x, at least one 00x and at least one
0xxx:
$ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i)
cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print 4096 - 4096
* (1 - partial[p] / cdf)^6 }'
1703.4
3570.21
4048.9
4095.26
That looks compatible with the counts your 'grep -c' output. A p-value could
be computed.... but I will stop here with the statistics!
It would appear that one needs to concatenate the variously randomized lists
of addresses, eliminate duplicates, and then apply the last pair of scripts
to achieve a relatively accurate evaluation of the target CIDR block.
Duplicates are unlikely. I will not do the math to compute the probability
of any duplicate. Notice however that the probability to get the most likely
address, ending with 0000:0000:0000:0000:0000:0000, is 0.0857076^6 =
.000000396384. That is about 4 in 10 millions. You can figure out that
getting it twice or more among 4096 addresses is therefore extremely
unlikely.
Could it be that the 79,228,162,514,264,337,593,543,950,336 addresses in
2a02:2788::/32 are dynamically generated on demand ?
If you could generate one billion addresses per second, it would take
79,228,162,514,264,337,594 seconds to generate them all. That is more than
2510 billions of years.