That is super awesome! :D -- brion
On Tue, Jul 19, 2016 at 6:41 PM, Deborah Tankersley < [email protected]> wrote: > We're happy to announce that after numerous tests and analyses[1] and a > fully operational demo[2], the Discovery Team is ready to release > TextCat[3] into production on wiki. > > What is TextCat? It detects the language that the search query was written > in which allows us to look for results on a different wiki. TextCat is a > language detection library based on n-grams[4]. During a search, TextCat > will only kick in when the following three things occur: > 1. fewer than 3 results are returned from the query on the current wiki > 2. language detection is successful (meaning that TextCat is reasonably > certain what language the query is in, and that it is different from the > language of the current wiki) > 3. the other wiki (in the detected language) has results > > Our analysis of the A/B test[5] (for English, French, Spanish, Italian and > German Wikipedia's) showed that: > > "...The test groups not only had a substantially lower zero results rate > (57% in control group vs 46% in the two test groups), but they had a higher > clickthrough rate (44% in the control group vs 49-50% in the two test > groups), indicating that we may be providing users with relevant results > that they would not have gotten otherwise." > > > This update will be scheduled for production release during the week of > July 25, 2016 on the following Wikipedia's: > > - English [6] > - German [7] > - Spanish [8] > - Italian [9] > - French [10] > > TextCat will then be added to this next group of Wikipedia's at a later > date: > > - Portugese[11] > - Russian[12] > - Japanese[13] > > This is a huge step forward in creating a search mechanism that is able to > detect - with a high level of accuracy - the language that was used and > produce results in that language. Another forward-looking aspect of TextCat > is investigating a confidence measuring algorithm[14], to ensure that the > language detection results are the best they can be. > > We will also be doing more[15] A/B tests using TextCat on non Wikipedia > sites, such as Wikibooks and Wikivoyage. These new tests will give us > insight into whether applying the same language detection configuration > across projects would be helpful. > > Please let us know if you have any questions or concerns, on the TextCat > discussion page[16]. Also, for screenshots of what this update will look > like, please see this one[17] showing an existing search typed in on enwiki > in Russian "первым экспериментом" and this one[18] for showing what it will > look like once TextCat is in production on enwiki. > > > Thanks! > > > [1] https://phabricator.wikimedia.org/T118278 > [2] https://tools.wmflabs.org/textcatdemo/ > [3] https://www.mediawiki.org/wiki/TextCat > [4] https://en.wikipedia.org/wiki/N-gram > [5] > > https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_Test_-_Language_Detection_on_English,_French,_Spanish,_Italian,_and_German_Wikipedias.pdf > [6] https://en.wikipedia.org/ > [7] https://de.wikipedia.org/ > [8] https://es.wikipedia.org/ > [9] https://it.wikipedia.org/ > [10] https://fr.wikipedia.org/ > [11] https://pt.wikipedia.org/ > [12] https://ru.wikipedia.org/ > [13] https://ja.wikipedia.org/ > [14] https://phabricator.wikimedia.org/T140289 > [15] https://phabricator.wikimedia.org/T140292 > [16] https://www.mediawiki.org/wiki/Talk:TextCat > [17] > https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png > [18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png > > -- > Deb Tankersley > Product Manager, Discovery > IRC: debt > Wikimedia Foundation > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
