Hi,
I’m quite sure standard tokenizer doesn’t support Unicode combining characters.
The question is, how to process them.
I think for Russian language the best way is simply to skip this character
(create token text without this character), because it is just used to show,
where is the accent
Hi,
yes, I ended up removing the accents before processing it with CLucene.
https://unicode.org/reports/tr15/#Normalization_Forms_Table
QString unaccent(const QString )
{
const QString normalized = s.normalized(QString::NormalizationForm_D);
QString out;