Re: Multi-lingual Search & Accent Marks

Toke Eskildsen Sat, 31 Aug 2019 12:00:59 -0700

Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
> Just wanting to test the waters here – for those of you with search engines
> that index multiple languages, do you use ASCII-folding in your schema?


Our primary search engine is for Danish users, with sources being bibliographic 
records with titles and other meta data in many different languages. We 
normalise to Danish, meaning that most ligatures are removed, but also that 
letters such as Swedish ö becomes Danish ø. The rules for normalisation are 
dictated by Danish library practice and was implemented by a resident librarian.

Whenever we do this normalisation, we index two versions in our index: A very 
lightly normalised (lowercased) field and a heavily normalised field: If a 
record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and 
title_norm:køket. edismax is used to ensure that both fields are searched per 
default (plus an explicit field alias "title" are set to point to both 
title_orig and title_norm for qualified searches) and that matches in 
title_orig has more weight for relevance calculation.

> We are onboarding Spanish documents into our index right now and keep
> going back and forth on whether we should preserve accent marks.

Going with what we do, my answer would be: Yes, do preserve and also remove 
:-). You could even have 3 or more levels of normalisation, depending on how 
much time you have for polishing.

- Toke Eskildsen

Re: Multi-lingual Search & Accent Marks

Reply via email to