Okay, this is my final big blob of output on this topic. I'll put my
results on a wiki page and link to it in Phabricator.

—Trey

TL;DR We get some ridiculously long queries (up to 5K characters)—lots are
junk. We are also getting a Zerg Rush from bots:
   - DOI and "timestamp" queries are everywhere (automata).
   - We have lots of searches for energy-related articles (automaton?).
   - There's a likely automaton searching for term+term+term country
   - spam-looking queries for <manufacturing terms> ## de tel fax
   - paint bot: ""<artist>"" paint and <wikimedia commons file name> paint
   - Chinese product descriptions and part numbers/phone numbers/etc.

• I reviewed two samples similar to my original one, from a week before and
from two weeks before, and found the distribution of zero-result queries
was generally similar, though one week dewiki got ~30K (~60%) more. DOI
searches ranged from ~15K to ~100K (!!) and "unix timestamp" searches
ranged from 26K to 42K.

• Boolean AND queries were also within a factor of two in the other
samples. quot queries and " film" queries were very similar.

• I mentioned very long queries before. Here's a breakdown by length (the
categories overlap):

length count
150+ 2262
200+ 725
300+ 435
400+ 331
500+ 261
1000+ 120
2000+ 59
3000+ 24
4000+ 10
5000+ 1

Some of the DOI queries are over 150 characters, and make the list. The
really long ones often look like random bits of text.


• A regular fixture in the top 100 zero-result queries per day is a
173-character string from a particular novel (Google Books found it right
away). That's just weird.


• I looked at crosswiki zero-results searches and broke out the individual
words in the queries to find recurring patterns. I've noted ones that have
~1000 instances. (Even 0.2% is a fair chunk for a single phenomenon.)

There are lots of DOI-related terms, of course, and our old friend "quot",
lots of URL bits. {searchTerms} shows up 1998 times (mostly in ru).
search_suggest_query shows up 440 times (en, de, fr, sv, nl and others).

There are lots of words related to searching for articles about energy:
wind, power, turbine, energy, etc. Lots of long titles are included in
several formats. I think this may also be a bot.

• There's a weird pattern, largely in eswiki, but some in enwiki and
frwiki, where there are a bunch of search terms joined by +, followed by
space and the name of a country. Australia, Austria, Bangladés, Bélgica,
and Argentina are most used, but there are ~90 different countries
(sometimes the same countrty with its name in different languages), for
>5600 total instances. The alphabetic skew may or may not be related to the
size of my sample.


   1719  Australia
   1682  Austria
    659  Bangladés
    537  Bélgica
    519  Argentina
    119  Bolivia
...


There are 2380 more instances in es wiki of a bunch of words mushed
together with +'s

Looking at them in order in the logs, the are largely in alphabetical order.

A one-week earlier sample had fewer, a two-week earlier sample has even
more.

Another likely bot.

• There are lots of intitle searches. Many in nlwiki (out of 355) and
frwiki (out of 414) are for names who seem not to be in that wiki. Most
failed intitle searches in enwiki (out of 504) are in Spanish or
Portuguese. The rest are <100 instances, so I didn't investigate.

Similar patterns in the other two samples.

• There's a weird pattern (1293 instances), all in dewiki, like this:
<manufacturing terms> ## de tel fax

<manufacturing terms> includes injection molding, stone cutting die
casting, etc.

Other samples have a similar pattern.

• Another weird pattern (953):
""<artist>"" paint <-- literally double double quoted.
and (~140)
<wikimedia commons file name> paint

All on enwiki, and the commons file names don't include the file type
(e.g., ".jpg").

The same or more in other samples.

• 989 instances of this on enwiki
<descriptions of products in chinese>
*###########*QQ########座机###########*<misc>.<misc>.xyz

座机 = "landline"
<misc> = letters, numbers, an transliterated Chinese (Pinyin?)

Online searches for parts of these reveal a similar pattern on
Chinese-language business/manufacturing sites.

The same in samples from other weeks.

• Finally, I reviewed the larger collections of zero-results (10K+ from a
gven wiki). My ability to analyze languages I don't know is limited, but
here are some very brief impressions:

- dewiki has a few hundred OR'd together wildcard searches, some of which
seem to be trying handle variations in declension.
- jawiki has lots of " film" searches.
- ruwiki has a few non-cyrillic searches
- itwiki has lots of queries that are multi-word phrases with underscores
instead of spaces
- eswiki and frwiki have a fair number of build up searches and searches in
Arabic, and frwiki has a fair number of searches in Chinese
- zhwiki has lots of non-Chinese searches in various languages

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Jul 28, 2015 at 3:57 PM, Trey Jones <tjo...@wikimedia.org> wrote:

> Hi everyone,
>
> I've broadened my analysis from enwiki to the other larger wikis, looking
> at the same phenomena I found in enwiki.
>
> While the DOI searches are definitely an issue across 25 wikis, with the
> other earlier-identified issues some are cross-wiki and some are not.
>
> *TL;DR: After DOI searches, "unix timestamp" searches are the biggest
> cross-wikipedia issue. Weird AND queries and quot queries are big
> contributors on enwiki, which make them important overall. We could easily
> fix the unix timestamp queries (either auto correct or make suggestions),
> and we could fix lots of the quot queries. All of these could be included
> in the category of "automata" that could potentially be separated from
> regular queries, and it wouldn't hurt to track down their sources and help
> people search better.*
>
> The <unix-timestamp-looking number>:<wiki title> format (with a small
> number with a space after the colon) is spread across 45 wikis, with 28,089
> instances out of 500K (~5.6%). More than half of the results are enwiki
> (15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on
> tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be
> largely named entities or queries in the appropriate language. Removing the
> "14###########:", tracking down the source, or putting this on the automata
> list would help a lot.
>
> The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9%
> in enwiki), and they are a mixed bag, but many (626) appear with quot, and
> most (16657) are of the form
> "article_title_with_underscore" AND "article title without underscores"
> where the first half is repeated over and over and the second half is
> something linked to in the first article. Find the source and add to the
> automata list.
>
> In plwiki (263), the AND queries are all of the form
> *<musical thing>* AND (muzyk* OR Dyskografia)
> where <musical thing> seems to be an artist, band, album, or something
> similar. This looks like an automaton, but may not be worth pursuing.
> Similarly the ones from nl.
>
> Globally, OR queries are much more common. 46,035 (~9.2%), spread much
> more evenly over all the wikis. These are almost all the DOI queries.
>
> quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in
> this sample, which is a lot for one small thing. We should either create a
> secondary search with filtered quot or track down the source and help them
> figure out how to do better.
>
> TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1%
> overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single
> digits on it and ru. I'd count this as automata, though finding a source
> would be nice.
>
> Strings of numbers do happen everywhere, but are only common on enwiki,
> with less on jawiki, and much less on de, fr, ru, vi, and nl.
>
> My last bit of analysis will later this week, and I'll try to look at
> non-English and/or cross-wiki stuff, write it all up in Phabricator, and
> move on.
>
> On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones <tjo...@wikimedia.org> wrote:
>
>> Okay, I have a slightly better sample this morning. (I accidentally left
>> out Wikipedias with abbreviations longer than 2 letters).
>>
>> My new sample:
>> 500K zero-result full_text queries (web and API) across the Wikipedias
>> with 100K+ articles
>> 383,433 unique search strings (that's a long, long tail)
>> The sample covers a little over an hour: 2015-07-23 07:51:29 to
>> 2015-07-23 08:55:42
>> The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of
>> queries
>>
>> Top 10 counts, for reference:
>>  221618  enwiki
>>   51936  dewiki
>>   25500  ptwiki
>>   24206  jawiki
>>   21891  ruwiki
>>   19913  eswiki
>>   18303  itwiki
>>   14443  frwiki
>>   11730  zhwiki
>>    7685  nlwiki
>> -----
>> 417225
>>
>> The DOI searches that appear to come from Lagotto installations hit 25
>> wikis (as the Lagotto docs said they would), with en getting a lot more,
>> and ru getting fewer in this sample, and the rest *very* evenly
>> distributed. (I missed ceb and war before—apologies). The total is just
>> over 50K queries, or >10% of the full text queries against larger wikis
>> that result in zero results.
>>
>> ===DOI
>>    6050 enwiki
>>    1904 nlwiki
>>    1902 cebwiki
>>    1901 warwiki
>>    1900 viwiki
>>    1900 svwiki
>>    1900 jawiki
>>    1899 frwiki
>>    1899 eswiki
>>    1899 dewiki
>>    1898 zhwiki
>>    1898 ukwiki
>>    1898 plwiki
>>    1898 itwiki
>>    1897 ptwiki
>>    1897 nowiki
>>    1897 fiwiki
>>    1896 huwiki
>>    1896 fawiki
>>    1896 cswiki
>>    1896 cawiki
>>    1895 kowiki
>>    1895 idwiki
>>    1895 arwiki
>>     475 ruwiki
>> -----
>> 50181
>>
>> On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjo...@wikimedia.org> wrote:
>>
>>>
>
>> I've started looking at a 500K sample from 7/24 across all wikis. I'll
>>> have more results tomorrow, but right now it's already clear that someone
>>> is spamming useless DOI searches across wikis—and it's 9% of the wiki
>>> zero-results queries.
>>>
>>>
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Reply via email to