Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - audrey.lorberf...@ibm.com Tue, 03 Sep 2019 06:31:02 -0700

Toke,

Do you find that searching over both the original title field and the 
normalized title field increases the time it takes for your search engine to 
retrieve results?


-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/31/19, 3:01 PM, "Toke Eskildsen" <t...@kb.dk> wrote:

    Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> 
wrote:
    > Just wanting to test the waters here – for those of you with search 
engines
    > that index multiple languages, do you use ASCII-folding in your schema?
    
    Our primary search engine is for Danish users, with sources being 
bibliographic records with titles and other meta data in many different 
languages. We normalise to Danish, meaning that most ligatures are removed, but 
also that letters such as Swedish ö becomes Danish ø. The rules for 
normalisation are dictated by Danish library practice and was implemented by a 
resident librarian.
    
    Whenever we do this normalisation, we index two versions in our index: A 
very lightly normalised (lowercased) field and a heavily normalised field: If a 
record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and 
title_norm:køket. edismax is used to ensure that both fields are searched per 
default (plus an explicit field alias "title" are set to point to both 
title_orig and title_norm for qualified searches) and that matches in 
title_orig has more weight for relevance calculation.
    
    > We are onboarding Spanish documents into our index right now and keep
    > going back and forth on whether we should preserve accent marks.
    
    Going with what we do, my answer would be: Yes, do preserve and also remove 
:-). You could even have 3 or more levels of normalisation, depending on how 
much time you have for polishing.
    
    - Toke Eskildsen

Re: Re: Multi-lingual Search & Accent Marks

Reply via email to