On Thu, Jan 14, 2010 at 11:01 AM, David Gerard <[email protected]> wrote: > 2010/1/14 Bryan Tong Minh <[email protected]>: >> On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske >> <[email protected]> wrote: > >>> * log search and SHA1 IP hash (anonymous!) > >> There are only 2 billion unique addresses and they can all be found in >> half an hour probably. > > > A count of search terms, with no IP info at all? Would be more useful > than nothing. > > (modulo the issue Michael Snow raised re: searches on suppressable names)
Magnus was not suggesting disclosing the IP hash, as far as I can tell. He demonstrating an abundance of caution in suggesting only logging that. (er, well, yea, if he was suggesting disclosing that... we shouldn't do that. Even if we add a secret to the hash, it's risky and allows interesting correlation attacks) Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics Which has first been filtered by: * Canonicalization of strings (at least ascii case folding) * Excluding strings over some length * Excluding searches which did not come from at least 5 distinct IPs during the reporting interval There will be useful information excluded by this process, e.g. gads of misspellings which came from only two to four unique IPs... but the output would still be *far* more useful no information at all. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
