Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - audrey.lorberf...@ibm.com Fri, 30 Aug 2019 10:55:18 -0700

Aita,

Thanks for that insight!


As the conversation has progressed, we are now leaning towards not having the 
ASCII-folding filter in our pipelines in order to keep marks like umlauts and 
tildas. Instead, we might add acute and grave accents to a file pointed at by 
the MappingCharFilterFactory to simply strip those more common accent marks...

Any other opinions are welcome!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 10:27 AM, "Atita Arora" <atitaar...@gmail.com> wrote:

    We work on german index, we neutralize accents before index i.e. umlauts to
    'ae', 'ue'.. Etc and similar what we do at the query time too for an
    appropriate match.
    
    On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
    <audrey.lorberf...@ibm.com> wrote:
    
    > Hi All,
    >
    > Just wanting to test the waters here – for those of you with search
    > engines that index multiple languages, do you use ASCII-folding in your
    > schema? We are onboarding Spanish documents into our index right now and
    > keep going back and forth on whether we should preserve accent marks. From
    > our query logs, it seems people generally do not include accents when
    > searching, but you never know…
    >
    > Thank you in advance for sharing your experiences!
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > Digital Workplace Engineering
    > CIO, Finance and Operations
    > IBM
    > audrey.lorberf...@ibm.com
    >
    >

Re: Re: Multi-lingual Search & Accent Marks

Reply via email to