Hello all,

  I've been working through the Wideskies paper and Walter's slides. I
think I have an idea that would improve usability by a lot, but I want to
run it past you guys to see if it would work.

Here's my understanding of how the Wideskies Responder works:
- It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data
returned by the algorithm. From a Comp Sci perspective, I see the
collection of these values functioning as, basically, a big encrypted byte
buffer holding our results. It also maintains a list of counters c_0, ...,
c_t, one for each possible value of hash function H.
- When a pair (T, D) comes in, we hash T to get H(T), and break D into
8-bit chunks. We then combine each of those chunks into our Y_i values
using E_H(T), our encrypted value for the hash of T, which is chosen in
such a way that a) if we aren't interested in T, adding this chunk to Y_i
does not change the value it decrypts to (we essentially multiply by
encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in
such a way that adding this chunk to Y_i will not overwrite any byte chunks
for *other* interesting terms that have also been written to Y_i.
- We then increment the counter c_T so that future items that hash to the
same value write to different Y_i, so that we do not overwrite the value we
just wrote.
- If *any *of these counters c_H(T) would exceed r due to some new pair
(T,D), then we can no longer ensure that we won't overwrite data we already
saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci
perspective, this looks like "our buffer is full," since we don't have room
for new data. Therefore, we have to terminate the algorithm and return our
list of Y's as they stand, without processing all of the data. This is true
even if most of the terms that hashed to H(T) were ones we weren't
interested in; in other words, if ANY of the terms are extremely common,
even if we aren't interested in them, we run the risk of being unable to
process all of the data.

Please let me know if I'm off the mark on any of that :-)

So my idea is as follows: In programming world, when we are processing a
large amount of data and we fill up our current write buffer, we just
output that buffer, make a new one, and continue. Why wouldn't that work
here? In other words, when we encounter a term (T, D) that would cause one
of the counters c_H(T) to exceed r, why couldn't we just push the current
Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0,
..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D)
as normal? Then the algorithm could return all of the buffers Y, Y', Y'',
etc. that it created instead of just Y, and we can ensure that we can
process all of the data.

As far as I could tell, this requires giving the responder no additional
information about our query terms that they didn't already have. So I don't
think it would weaken the privacy of the algorithm at all, but please
correct me if I'm missing something. On the other hand I think it would
make PIRK much easier to use, since we don't run the risk of a query
failing if we don't understand enough about the data it's running over to
be able to tune r appropriately. I can also imagine this leading to some
performance improvements, since we could let PIRK choose r automatically so
that a result set could fit nicely within a hard disk page, network frame,
etc.

What do you guys think?

Regards,
-Ryan Carr

Reply via email to