TJones added a comment.

@Smalyshev, I think this covers the info you need. Let me know if I can give more info or help with anything else. :)

TL;DR: yep, text is useful compared to plain for of ar, bg, ca, ckb, cs, da, de, el, en, en-ca, en-gb, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, ko, lt, lv, nb, nl, nn, pt, pt-br, ro, ru, simple, sv, th, and tr.

Also, if the standard plugins are installed, include pl, zh, he, and uk.

You should possibly note that bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan probably do better with the icu_tokenizer rather than the standard tokenizer.

For everything else, keep in mind that the difference between text and plain is that plain has word_break_helper enabled.

Details:

The default plain analyzer is the standard tokenizer, the ICU Normalizer (which does some folding but much less than full ICU Folding) and the "word break helper" (which breaks words on periods, underscores, and parens). So default below is the same as "standard + icu_normalizer + word_break_helper".

All of the analyzers except CJK, Persian, and Thai have stemmers, which I assume do something useful.

Persian and Thai have stop words (as do most of the others), which I also assume do something useful.

CJK has the CJK bigram filter (whick gives overlapping bigrams as tokens) and—oddly—English stop words; that seems useful.

Also, if this is in an environment where the usual plugins are installed, you also have custom analyzers for pl, zh, he, and uk, so I've included them below in their own little sub-table.

There are also a list of languages that have the icu_tokenizer enabled rather than the standard tokenizer: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan. That might be worth having as another config option for those languages.

For all of the languages without a custom analyzer, (including the ones using the icu_tokenizer), there is always a difference betweeen text and plain: plain includes word_break_helper. Most of the language-specific analyzers do not word_break_helper.

Default Elastic analyzers:

CodeLgtextplain
arArabicarabicdefault
bgBulgarianbulgariandefault
caCatalancatalandefault
ckbSoranisoranidefault
csCzechczechdefault
daDanishdanishdefault
deGermangermandefault
elGreekgreekstandard + icu_normalizer + +icu_folding + word_break_helper
enEnglishenglishstandard + icu_normalizer + +icu_folding + word_break_helper
en-caCanadian Englishenglishstandard + icu_normalizer + +icu_folding + word_break_helper
en-gbBritish Englishenglishstandard + icu_normalizer + +icu_folding + word_break_helper
esSpanishspanishdefault
euBasquebasquedefault
faPersianpersiandefault
fiFinnishfinnishdefault
frFrenchfrenchstandard + icu_normalizer + +icu_folding + word_break_helper
gaIrishirishdefault
glGaliciangaliciandefault
hiHindihindidefault
huHungarianhungariandefault
hyArmenianarmeniandefault
idIndonesianindonesiandefault
itItalianitalianstandard + icu_normalizer + ascii_folding + dedupe_asciifolding
jaJapanesecjkicu_tokenizer + icu_normalizer + word_break_helper
koKoreancjkdefault
ltLithuanianlithuaniandefault
lvLatvianlatviandefault
nbNorwegian Bokmålnorwegiandefault
nlDutchdutchdefault
nnNorwegian Nynorsknorwegiandefault
ptPortuguesebraziliandefault
pt-brBrazilian Portugueseportuguesedefault
roRomanianromaniandefault
ruRussianrussianstandard + icu_normalizer + russian_char_filter + word_break_helper
simpleSimple Englishenglishstandard + icu_normalizer + +icu_folding + word_break_helper
svSwedishswedishstandard + icu_normalizer + +icu_folding + word_break_helper
thThaithaidefault
trTurkishturkishdefault

Analyzers with usual plugins:

CodeLgtextplain
plPolishpolishdefault
zhChinesechineseicu_tokenizer + smartcn_stop + icu_normalizer + word_break_helper
heHebrewhebrewstandard + icu_normalizer + +icu_folding + word_break_helper
ukUkrainianukrainiandefault

ICU Tokenization languages:

CodeLg
boTibetan
dzDzongkha
ganGan
jaJapanese
kmKhmer
loLao
myBurmese
thThai
wuuWu
zhChinese
zh-classicalClassical Chinese
zh-yueCantonese
bugBuginese
cdoMin Dong
crCree
hakHakka
jvJavanese
zh-min-nanMin Nan

TASK DETAIL
https://phabricator.wikimedia.org/T180169

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: TJones
Cc: TJones, Aklapper, EBernhardson, Lydia_Pintscher, hoo, aude, Smalyshev, dcausse, Lahi, GoranSMilovanovic, QZanden, EBjune, Avner, debt, Gehel, Jdrewniak, FloNight, Wikidata-bugs, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to