Hi Peter, This is a fascinating problem. I would not mind seeing a resolved solution fed back into the list.
I think your best bet lies in exploring the icu4j library that ships with Solr, but needs to be enabled in solrconfig.xml. A little bit is explained at https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation and https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory After that, it is basically "the shoulders of the giants". If you are trying to trace the true support then ICU4J is the implementation of http://site.icu-project.org/ (International Components for Unicode) which implements Unicode, which seems to have support for the languages you discuss: https://www.unicode.org/charts/#scripts (Unified Canadian Aboriginal Syllabics). This seems to imply that word and sentence boundaries (which is what I assume you are after) are also in Unicode, therefore in ICU, therefore in ICU4j, therefore in Solr. And that brings us back to the valid magical invocation. The specific invocation would depend on the exact search issue you are trying to resolve and figuring out the language codes/names for your languages/locales. I did do a Thai language demo of phonetic search against Thai text. Very long time ago, so not a copy/paste, but still relevant. This is excerpt from my demo: https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55 <!-- During indexing: 1) tokenize Thai text with built-in rules+dictionary 2) map it to latin characters (with special accents indicating tones 3) get rid of tone marks, as nobody uses them 4) do some phonetic (BMF) broadening to match possible alternative spellings in English During querying, we don't want this field type matching Thai text on query (BMFF is a little too aggressive for that). So, we are doing English-specific query chain --> <fieldType name="thai_english" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" /> <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.BeiderMorseFilterFactory" /> </analyzer> </fieldType> Hope this helps, Alex. P.s. If you progress but still get stuck, feel free to reach out directly as well. I am in Montreal, the questions resonated with me. On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <[email protected]> wrote: > > I'm quite familiar with indexing English and French languages in Solr, but > has anybody got any tips on indexing and querying (Canadian) indigenous First > Nations languages? Depending on the language, terms may be written in a > syllabic script (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) > or in Americanist phonetic notation > (https://en.wikipedia.org/wiki/Americanist_phonetic_notation). > > > Peter > > Peter Tyrrell, MLIS > Lead Developer at Andornot > 1-866-266-2525 x706 / [email protected] >
