Re: Entropy Definition (was Re: passphrases with more than 160 bits of entropy)

2006-03-26 Thread John Denker

In the context of

>> 0 occurs with probability 1/2
>> each other number from 1 to 2^{160}+1 happens with
>> probability 2^{-161}.

I wrote:

> This ... serves to illustrate, in an exaggerated way, the necessity
> of not assuming that the raw data words are IID (independent and identically
> distributed).

Correction:  IID isn't the issue here.  The raw data words described
above are IID.  That's not the problem.  The problem is that the
distribution is highly skewed.

Executive summary:  Small samples do not always exhibit "average" behavior.

Let me explain.  There is a very simple way to look at this.

Consider the expression for the entropy of the source,
   S := Sum_i P_i log(1/P_i)  [1]

where i runs over all symbols in the alphabet.

One may, with a little care, attribute log(1/P_i) bits of unpredictability
to the (i)th symbol in the alphabet.  Then S can be interpreted as the
appropriately weighted _average_ unpredictability per symbol.

In the example quoted above, the minimum log(1/P_i) is vastly smaller than
the average log(1/P_i) -- namely 1 versus 161.  So focussing on the average
is unlikely to solve all the world's problems.

In crypto applications (including RNG construction), a crude yet provably
correct approach is to rely on the minimum (not the average) per-symbol
unpredictability.  Using this approach, it would require 80 samples of the
given distribution to produce an output with 80 bits of entropy.

Things can only get better from here:
 -- With full knowledge of the source statistics, one can distinguish the large
  log(1/Pi) words from the small log(1/Pi) words.  This would allow the number
  of required samples to be closer to the typical value (2) than to the 
worst-case
  value (80).  An example of this is the Huffman coding I discussed in my 
previous
  note.
 -- With even modest knowledge of the given source, one can (by histogramming
  if nothing else) estimate the probabilities well enough to yield worthwhile
  improvements in efficiency.
 -- If you want to produce a long sequence of output words, as you often do, a
  reasonable-sized buffer is all you really need to come fairly close to optimal
  efficiency, namely only about 1 sample of the distribution, on average, per
  80-bit output word.  The chance of getting fooled can be made verrry small,
  at a modest cost in terms of buffer-size and extra samples of the 
distribution.
  That is, the law of large numbers comes to your rescue sooner or later, 
usually
  rather soon.
 -- It may be possible to engineer a different source with larger minimum
  log(1/Pi).

Bottom line:  Setting up a highly-skewed distribution and then drawing from
it only a single sample is not guaranteed to produce "average" behavior.  Duh,
no surprise there!

The source entropy S in equation [1] is a single number that characterizes the
"average" behavior.  For a small sample from a highly-skewed distribution, S
doesn't tell you everything you need to know.  This has no bearing on the
definition of entropy;  entropy is defined by equation [1].  It just means that
equation [1] by itself doesn't solve all the world's problems.


-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]


Re: Entropy Definition (was Re: passphrases with more than 160 bits of entropy)

2006-03-26 Thread John Kelsey
>From: John Denker <[EMAIL PROTECTED]>
>Sent: Mar 24, 2006 11:57 AM
>To: Erik Zenner <[EMAIL PROTECTED]>, cryptography@metzdowd.com
>Subject: Re: Entropy Definition (was Re: passphrases with more than 160 bits 
>of entropy)

>Erik Zenner wrote:
>
>>>0 occurs with probability 1/2
>>>each other number from 1 to 2^{160}+1 happens with 
>>>probability 2^{-161}.
>
>>  Is anyone aware of whether (and where) this was 
>> discussed in the literature, or what other approaches are taken?
>
>This particular problem is contrived or at least exaggerated.  

The example is a contrived, pathological case.  And there are a bunch
of easy solutions once you know this is the distribution.  The point
is to demonstrate that Shannon entropy gives you the wrong answer when
you try to use it here.

Now, you might ask if this is a problem in a real-world setting.  So
let's imagine a very simple distribution--sequences of flips of a
biased coin.  There are nice ways to remove the bias, but let's
imagine we're not doing that--instead, we're going to take a sequence
of coin flips, turn it into a 0/1 sequence, and then hash it down to
get a key.  

Suppose the coin is biased so that heads comes up 0.9 of the time, and
that we generate 16-bit sequences at a time.  The Shannon entropy of a
16-bit sequence is about 7.5, but the most likely symbol (all heads)
comes up about 18% of the time.  So if we tried to generate a 7-bit
"key", we'd lose on the attacker's first guess 18% of the time. 

So, let's go further with this.  We want to generate a DES key, with
56 bits of entropy.  A 64-bit sequence produced with this biased coin
has Shannon entropy of about 60 bits, but an attacker has about a
1/1000 chance of guessing the DES key we derive from it on his first
try, which is unacceptable for just about any crypto application.
(The min-entropy is about ten bits.)

So yes, I used a contrived example to demonstrate the problem, but no,
the problem isn't just an ivory-tower concern.

The intuition here is that Shannon entropy is concerned with
communications channels, where we assume we have to send every
symbol.  So when you have lots of low-probability symbols, and one of
those low-probability symbols is chosen 999 times out of a 1000, the
amount of bandwidth you need to transmit those symbols easily becomes
dominated by them.  Most of the time, you're sending one of a large
set of symbols, and they all have to be distinguished from one
another.  

The situation for the attacker is a lot more like having a channel
where it's acceptable to only send a small fraction of those symbols,
and just drop the rest.  That is, it's okay for the attacker's model
to just ignore the huge set of low-probability symbols that occur
999/1000 of the time, and instead just transmit the highest
probability symbol 1/1000 of the time.  Instead of transmitting it,
he's just guessing it.  When he gets it right, he learns your DES
key.  

...
>This version serves to illustrate, in an exaggerated way, the necessity
>of not assuming that the raw data words are IID (independent and identically
>distributed).

Yes!  The "independent" part is much harder to deal with than the
per-symbol distribution, in many real-world applications.  The worst
of these are operating system events (used in Windows and various
Unixes to get random bits).  Peter Gutmann did some really nice work
on this on a bunch of operating systems in a Usenix paper, and he
updated that and has a version of it on his website.  If you're doing
OS-event sampling to generate random bits, you really ought to look at
his stuff.  (But if you can get some hardware support, either directly
or via sampling the microphone like the Turbid design does, you're
probably on much firmer ground.)  

--John Kelsey, NIST


-
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]