[Wikitech-l] Log of failed searches

2010-01-14 Thread Apoc 2400
Would it be possible to generate a log or statistics of searches on Wikipedia using the Go button that did not immediately reach an article? Properly anonymized of course. I think it would be useful for finding missing articles and redirects to create. There would be a lot of crap of course, but

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Magnus Manske
On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 apoc2...@gmail.com wrote: Would it be possible to generate a log or statistics of searches on Wikipedia using the Go button that did not immediately reach an article? Properly anonymized of course. I think it would be useful for finding missing

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Robert Stojnic
Magnus Manske wrote: On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 apoc2...@gmail.com wrote: Would it be possible to generate a log or statistics of searches on Wikipedia using the Go button that did not immediately reach an article? Properly anonymized of course. I think it would be useful

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Nikola Smolenski
Robert Stojnic wrote: Magnus Manske wrote: On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 apoc2...@gmail.com wrote: Would it be possible to generate a log or statistics of searches on Wikipedia using the Go button that did not immediately reach an article? Also, searches made using either button

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Magnus Manske
On Thu, Jan 14, 2010 at 3:27 PM, Nikola Smolenski smole...@eunet.rs wrote: Robert Stojnic wrote: Magnus Manske wrote: On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 apoc2...@gmail.com wrote: Would it be possible to generate a log or statistics of searches on Wikipedia using the Go button that did

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Robert Stojnic
This sounds like a good idea, although we could probably argue about cut-offs. However, since this needs to be done in-house (and not on toolserver etc because I imagine we cannot distribute raw logs) I image it is going to go very slow as there is no-one working on it or planning to work on

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske magnusman...@googlemail.com wrote: Suggestion : * log search and SHA1 IP hash (anonymous!) *Any* mapping of the IP is not anonymous. Please see the AOL search results where unique IDs were connected between searches to disclose information.

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread David Gerard
2010/1/14 Bryan Tong Minh bryan.tongm...@gmail.com: On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske magnusman...@googlemail.com wrote: * log search and SHA1 IP hash (anonymous!) There are only 2 billion unique addresses and they can all be found in half an hour probably. A count of search

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 11:01 AM, David Gerard dger...@gmail.com wrote: 2010/1/14 Bryan Tong Minh bryan.tongm...@gmail.com: On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske magnusman...@googlemail.com wrote: * log search and SHA1 IP hash (anonymous!) There are only 2 billion unique addresses

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell gmaxw...@gmail.com wrote: Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ...

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Tei
2010/1/14 Gregory Maxwell gmaxw...@gmail.com: On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell gmaxw...@gmail.com wrote: Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Conrad Irwin
* search queries are logged in a standardized fashion (for grouping), e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc. Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Conrad Irwin
On 01/14/2010 05:51 PM, Aryeh Gregor wrote: On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.ir...@googlemail.com wrote: Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.ir...@googlemail.com wrote: Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Robert Stojnic
Such people would be able to deny searching for such terms, I don't see this as posing any more problems than the history dumps. Thinking further though, it would be possible to tie a search to an IP address or User when a page is created with the search term (as it is highly likely if there

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Platonides
Aryeh Gregor wrote: The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all. Search isn't cached, so it may be easier to just log it at the backend. I expect many people using things like please tell me how many people live in China, as

Re: [Wikitech-l] Log of failed searches

2010-01-14 Thread Gregory Maxwell
On Thu, Jan 14, 2010 at 6:32 PM, Platonides platoni...@gmail.com wrote: Sampled search logs are unlikely to reveal them though, since what they are repeating are the non-keywords, not the full query. Sampling is fine, but aggregated logs aren't likely to… thats the primary reason for reporting