> * search queries are logged in a standardized fashion (for grouping), > e.g. lowercase, single spaces, no leading/trailing spaces, special > chars converted to spaces, etc.
Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this. It would not be too much work to publish a set of post-processing scripts that could perform those normalisations that people are interested in; I don't think any two people will agree exactly on what information is useful, and removing information unnecessarily is just draconian. > * display searches per week (?) that have been searched for at least > 10 times from at least 5 different IP hashes (to avoid people > searching their own name 100 times...) I don't think the IP addresses should come into the analysis at all, though possibly a cut-off at 5 or 10 searches might be useful to prevent a huge tail-end of probably useless information (it also might exclude cases where people have typed things into the search box by accident - maybe they got distracted while logging in) > The logs are probably combined across wikis, so I'd change that to > > #start_datetime end_datetime projectcode hits search_string If these files were to be provided regularly, it would make sense to have the time period and the wiki defined in the file name, either a month or a week at a time, this would leave the file contents very simple, just the raw number of hits followed by a space, followed by what was typed into the Search box (or as close to as is available). $ cat enwiktionary-2010-01-failedsearches.lis 123919 MLIF .... 12873 mlif ... 103 MILF definition ... 1 what does M.I.L.F meen???? Conrad ( http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation) _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
