> * search queries are logged in a standardized fashion (for grouping),
> e.g. lowercase, single spaces, no leading/trailing spaces, special
>  chars converted to spaces, etc.

Wiktionary is case-sensitive and so case-folding there may not be
appropriate; I personally would be interested in seeing these logs
before even the NFC normalizers get to them (given a lack of any other
source to find out how people type fun characters in the wild) though I
can appreciate this is somewhat sadistic, and probably the logs are
taken too late for this.

It would not be too much work to publish a set of post-processing
scripts that could perform those normalisations that people are
interested in; I don't think any two people will agree exactly on what
information is useful, and removing information unnecessarily is just
draconian.

> * display searches per week (?) that have been searched for at least
> 10 times from at least 5 different IP hashes (to avoid people
> searching their own name 100 times...)

I don't think the IP addresses should come into the analysis at all,
though possibly a cut-off at 5 or 10 searches might be useful to prevent
a huge tail-end of probably useless information (it also might exclude
cases where people have typed things into the search box by accident -
maybe they got distracted while logging in)

> The logs are probably combined across wikis, so I'd change that to
>
> #start_datetime end_datetime projectcode hits search_string

If these files were to be provided regularly, it would make sense to
have the time period and the wiki defined in the file name, either a
month or a week at a time, this would leave the file contents very
simple, just the raw number of hits followed by a space, followed by
what was typed into the Search box (or as close to as is available).

$ cat enwiktionary-2010-01-failedsearches.lis

123919 MLIF
 ....
12873 mlif
  ...
103 MILF definition
  ...
1 what does M.I.L.F meen????

Conrad

( http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to