[Wikidata-bugs] [Maniphest] [Commented On] T180169: Make list of languages where using stemmed analyzer for Wikibase is beneficial

TJones Thu, 09 Nov 2017 15:11:31 -0800

TJones added a comment.

@Smalyshev, I think this covers the info you need. Let me know if I can give more info or help with anything else. :)

TL;DR: yep, text is useful compared to plain for of ar, bg, ca, ckb, cs, da, de, el, en, en-ca, en-gb, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, ko, lt, lv, nb, nl, nn, pt, pt-br, ro, ru, simple, sv, th, and tr.

Also, if the standard plugins are installed, include pl, zh, he, and uk.

You should possibly note that bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan probably do better with the icu_tokenizer rather than the standard tokenizer.

For everything else, keep in mind that the difference between text and plain is that plain has word_break_helper enabled.

Details:

The default plain analyzer is the standard tokenizer, the ICU Normalizer (which does some folding but much less than full ICU Folding) and the "word break helper" (which breaks words on periods, underscores, and parens). So default below is the same as "standard + icu_normalizer + word_break_helper".

All of the analyzers except CJK, Persian, and Thai have stemmers, which I assume do something useful.

Persian and Thai have stop words (as do most of the others), which I also assume do something useful.

CJK has the CJK bigram filter (whick gives overlapping bigrams as tokens) and—oddly—English stop words; that seems useful.

Also, if this is in an environment where the usual plugins are installed, you also have custom analyzers for pl, zh, he, and uk, so I've included them below in their own little sub-table.

There are also a list of languages that have the icu_tokenizer enabled rather than the standard tokenizer: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan. That might be worth having as another config option for those languages.

For all of the languages without a custom analyzer, (including the ones using the icu_tokenizer), there is always a difference betweeen text and plain: plain includes word_break_helper. Most of the language-specific analyzers do not word_break_helper.

Default Elastic analyzers:

Code	Lg	text	plain
ar	Arabic	arabic	default
bg	Bulgarian	bulgarian	default
ca	Catalan	catalan	default
ckb	Sorani	sorani	default
cs	Czech	czech	default
da	Danish	danish	default
de	German	german	default
el	Greek	greek	standard + icu_normalizer + +icu_folding + word_break_helper
en	English	english	standard + icu_normalizer + +icu_folding + word_break_helper
en-ca	Canadian English	english	standard + icu_normalizer + +icu_folding + word_break_helper
en-gb	British English	english	standard + icu_normalizer + +icu_folding + word_break_helper
es	Spanish	spanish	default
eu	Basque	basque	default
fa	Persian	persian	default
fi	Finnish	finnish	default
fr	French	french	standard + icu_normalizer + +icu_folding + word_break_helper
ga	Irish	irish	default
gl	Galician	galician	default
hi	Hindi	hindi	default
hu	Hungarian	hungarian	default
hy	Armenian	armenian	default
id	Indonesian	indonesian	default
it	Italian	italian	standard + icu_normalizer + ascii_folding + dedupe_asciifolding
ja	Japanese	cjk	icu_tokenizer + icu_normalizer + word_break_helper
ko	Korean	cjk	default
lt	Lithuanian	lithuanian	default
lv	Latvian	latvian	default
nb	Norwegian Bokmål	norwegian	default
nl	Dutch	dutch	default
nn	Norwegian Nynorsk	norwegian	default
pt	Portuguese	brazilian	default
pt-br	Brazilian Portuguese	portuguese	default
ro	Romanian	romanian	default
ru	Russian	russian	standard + icu_normalizer + russian_char_filter + word_break_helper
simple	Simple English	english	standard + icu_normalizer + +icu_folding + word_break_helper
sv	Swedish	swedish	standard + icu_normalizer + +icu_folding + word_break_helper
th	Thai	thai	default
tr	Turkish	turkish	default

Analyzers with usual plugins:

Code	Lg	text	plain
pl	Polish	polish	default
zh	Chinese	chinese	icu_tokenizer + smartcn_stop + icu_normalizer + word_break_helper
he	Hebrew	hebrew	standard + icu_normalizer + +icu_folding + word_break_helper
uk	Ukrainian	ukrainian	default

ICU Tokenization languages:

Code	Lg
bo	Tibetan
dz	Dzongkha
gan	Gan
ja	Japanese
km	Khmer
lo	Lao
my	Burmese
th	Thai
wuu	Wu
zh	Chinese
zh-classical	Classical Chinese
zh-yue	Cantonese
bug	Buginese
cdo	Min Dong
cr	Cree
hak	Hakka
jv	Javanese
zh-min-nan	Min Nan

TASK DETAIL

https://phabricator.wikimedia.org/T180169

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: TJones
Cc: TJones, Aklapper, EBernhardson, Lydia_Pintscher, hoo, aude, Smalyshev, dcausse, Lahi, GoranSMilovanovic, QZanden, EBjune, Avner, debt, Gehel, Jdrewniak, FloNight, Wikidata-bugs, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T180169: Make list of languages where using stemmed analyzer for Wikibase is beneficial

Reply via email to