That script generates a 164KB file with 4096 entries in about five minutes real time.

I edited my previous post to reduce the unnecessarily high precision. Now, on my system, generating 4096 addresses takes ~15s.

grep -c -o -i :0 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4095
grep -c -o -i :00 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4053
grep -c -o -i :000 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 3599

Options -o and -i are useless here. You may believe that using -o would make 'grep -c' count all occurrences on a line. It does not. It still counts the number of lines with at least one occurrence among the six random groups of four hexadecimal digits. Those outputs therefore mean that, in IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt, all addresses but one have at least one group that starts with "0", 99.0% have at least one group that starts with "00", 87.9% have at least one group that starts with "000".

Extending Magic Banana's reasoning about the relative frequency of occurrences of :0001, :0002 and :0003, the relative frequencies of the occurrences of :0xxx, :00xx, and :000x in a 4096-row list of IPv6 addresses ought to be 256/4096, 16/4096, and 1/4096, respectively. In a 65,536-address list, prefix::0/128 may happen just once.

First of all, it is not a reasoning but a choice of distribution to sample from. I believe groups of four hexadecimal digits chosen by local network administrators approximately follow a Zipfian distribution. The exponent may not be 1 though. A more realistic exponent could be fitted from real-world addresses, by regression.

Your math looks wrong. If correct, the following AWK program outputs the probabilities to sample 0000, 000x, 00xx or 0xxx: $ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i) cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print partial[p] / cdf }'
0.0857076
0.289754
0.524903
0.762378

Technically, generating 0000 or not is the realization of a Bernoulli variable of parameter 0.0857076, generating 000x or not is that of a Bernoulli variable of parameter 0.289754, etc. Complementing the above program, here are, among 4096 addresses, the expected numbers of addresses with at least one 0000, at least one 000x, at least one 00x and at least one 0xxx: $ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i) cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print 4096 - 4096 * (1 - partial[p] / cdf)^6 }'
1703.4
3570.21
4048.9
4095.26
That looks compatible with the counts your 'grep -c' output. A p-value could be computed.... but I will stop here with the statistics!

It would appear that one needs to concatenate the variously randomized lists of addresses, eliminate duplicates, and then apply the last pair of scripts to achieve a relatively accurate evaluation of the target CIDR block.

Duplicates are unlikely. I will not do the math to compute the probability of any duplicate. Notice however that the probability to get the most likely address, ending with 0000:0000:0000:0000:0000:0000, is 0.0857076^6 = .000000396384. That is about 4 in 10 millions. You can figure out that getting it twice or more among 4096 addresses is therefore extremely unlikely.

Could it be that the 79,228,162,514,264,337,593,543,950,336 addresses in 2a02:2788::/32 are dynamically generated on demand ?

If you could generate one billion addresses per second, it would take 79,228,162,514,264,337,594 seconds to generate them all. That is more than 2510 billions of years.

Reply via email to