Re: Approaches to indexing indigenous languages?

Alexandre Rafalovitch Fri, 11 Jun 2021 07:20:51 -0700

Hi Peter,

This is a fascinating problem. I would not mind seeing a resolved
solution fed back into the list.

I think your best bet lies in exploring the icu4j library that ships
with Solr, but needs to be enabled in solrconfig.xml. A little bit is
explained at 
https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation
and 
https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory

After that, it is basically "the shoulders of the giants". If you are
trying to trace the true support then ICU4J is the implementation of
http://site.icu-project.org/ (International Components for Unicode)
which implements Unicode, which seems to have support for the
languages you discuss: https://www.unicode.org/charts/#scripts
(Unified Canadian Aboriginal Syllabics). This seems to imply that word
and sentence boundaries (which is what I assume you are after) are
also in Unicode, therefore in ICU, therefore in ICU4j, therefore in
Solr.

And that brings us back to the valid magical invocation. The specific
invocation would depend on the exact search issue you are trying to
resolve and figuring out the language codes/names for your
languages/locales.

I did do a Thai language demo of phonetic search against Thai text.
Very long time ago, so not a copy/paste, but still relevant. This is
excerpt from my demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55

        <!--
            During indexing:
            1) tokenize Thai text with built-in rules+dictionary
            2) map it to latin characters (with special accents indicating tones
            3) get rid of tone marks, as nobody uses them
            4) do some phonetic (BMF) broadening to match possible
alternative spellings in English

            During querying, we don't want this field type matching
Thai text on query (BMFF is a little too aggressive for that). So, we
are doing English-specific query chain
        -->
        <fieldType name="thai_english" class="solr.TextField">
            <analyzer type="index">
                <tokenizer class="solr.ICUTokenizerFactory"/>
                <filter class="solr.ICUTransformFilterFactory"
id="Thai-Latin" />
                <filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC" />
                <filter class="solr.BeiderMorseFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.BeiderMorseFilterFactory" />
            </analyzer>
        </fieldType>

Hope this helps,
    Alex.
P.s. If you progress but still get stuck, feel free to reach out
directly as well. I am in Montreal, the questions resonated with me.

On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <[email protected]> wrote:
>
> I'm quite familiar with indexing English and French languages in Solr, but 
> has anybody got any tips on indexing and querying (Canadian) indigenous First 
> Nations languages? Depending on the language, terms may be written in a 
> syllabic script (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) 
> or in Americanist phonetic notation 
> (https://en.wikipedia.org/wiki/Americanist_phonetic_notation).
>
>
> Peter
>
> Peter Tyrrell, MLIS
> Lead Developer at Andornot
> 1-866-266-2525 x706 / [email protected]
>

Re: Approaches to indexing indigenous languages?

Reply via email to