[Wikidata-bugs] [Maniphest] [Edited] T180169: Make list of languages where using stemmed analyzer for Wikibase is beneficial

Smalyshev Thu, 09 Nov 2017 13:58:36 -0800

Smalyshev updated the task description. (Show Details)

CHANGES TO TASK DESCRIPTION

After talking with @dcausse, we decided that having two custom analyzers set up (stemmed & non-stemmed one) for every language in descriptions is wasteful, since not all of them are useful for Wikibase use case. We'd want to only make stemmed ones for those languages, and use the plain (non-stemmed) analyzer for others. 

Here is the list of languages for which we have "non-trivial" configuration for stemming (`text`) analyzer:

```

ar

bg

ca

ckb

cs

da

de

el

en

en-ca

en-gb

es

eu

fa

fi

fr

ga

gl

hi

hu

hy

id

it

ja

ko

lt

lv

nb

nl

nn

pt

pt-br

ro

ru

simple

sv

th

tr

```

That includes having named analyzer types (e.g. 'bulgarian') and specialized filters or tokenizers.

Note that we are only concerned about whether the `text` analyzer we have will have additional value as compared to `plain` analyzer, since we're keeping `plain` one anyway, and only in the context of common Wikibase/Wikidata usage on descriptions.

TASK DETAIL

https://phabricator.wikimedia.org/T180169

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: TJones, Aklapper, EBernhardson, Lydia_Pintscher, hoo, aude, Smalyshev, dcausse, Lahi, GoranSMilovanovic, QZanden, EBjune, Avner, debt, Gehel, Jdrewniak, FloNight, Wikidata-bugs, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Edited] T180169: Make list of languages where using stemmed analyzer for Wikibase is beneficial

Reply via email to