On Tue, Mar 20, 2012 at 20:11, Ken Takusagawa II <[email protected]> wrote: > 1. You need 2^8=256 templates, not just 8, to reach 6*12+8=80 bits.
We won't know for sure how it hashes out until we make both the dictionaries and the syntax generator. The ambiguity was intentional. But yes, it may well use a number of generated templates. We're thinking of making it symbolic expansion based, which is more efficient on bits but also more complicated to describe before it's fixed (and it'll require a parser library). > 2. Having toyed with this idea in the past, let me warn that forming a 4096 > word dictionary of memorable, non-colliding words for each word category is > going to be very difficult. Too many words are semantically similar, > phonetically similar, or just unfamiliar. Our intention currently is to first take candidate dictionaries from WordNet, and use a combination of WordNet and Google 1-gram frequency data as part of the cutoff for whether words are adequately familiar. (N-grams with n >= 2 are rather irrelevant to our needs, AFAICT.) > http://kenta.blogspot.com/2012/02/lefoezyy-some-notes-on-google-books.html Thanks; that could be useful. > Another way to go about it might be to first catalogue semantic categories > (colors, animals, etc.) then list the most common (yet dissimilar) members > of each category. An attempt at 64 words is here: This is something that WordNet has already done. > http://kenta.blogspot.com/2011/10/xpmqawkv-common-words.html I think you omit far more common words, which you shouldn't — eg air water coal man house etc. But quibbling at this level is pointless; we'll need to be dealing with dictionaries mostly on the order of a few thousand words, sorted by *constituent types*, not be semantic categories. (E.g. one dictionary would be "nouns that can be the target of a transitive verb".) > I'd propose that the "right" way to do this is not just sentences, but > entire semantically consistent stories, written in rhyming verse, with > entropy of perhaps only a few bits per sentence. (Prehistoric oral > tradition does prove we can memorize such poems.) However, synthesizing > these seem extremely difficult, an AI problem. I think it's currently impossible to do that, and furthermore, that it's *not* Right even if you could — because it would violate a key constraint: that it can be reasonably typed as a domain. It shouldn't take longer than a few seconds to remember and type. It won't be as fast as typing "google.com", and that's OK, but I think that level of redundant expansion is way too much. Creating unambiguously parseable syntaxes and dictionaries that meet our stated constraints is already hard enough. ;-) > 3. I presume people are familiar with Bubblebabble? It doesn't solve all > the problems, but does make bit strings seem less "dense". BubbleBabble produces nonwords; as such it fails a basic requirement. Making something merely look phonotactically valid isn't enough; it has to be grammatically valid and composed entirely of known terms. - Sai _______________________________________________ tor-dev mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
