[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356696#comment-17356696 ] Ivan Provalov commented on LUCENE-7321: --- [~marcussorealheis], I have been maintaining it (bug fixes, etc...), not upgraded to version 8 yet. I could do that if there is any interest in integrating it. > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7321) Character Mapping
[ https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356136#comment-17356136 ] Marcus Eagan commented on LUCENE-7321: -- Hi [~iprovalo] I'm curious if you have been maintaining this patch through version `8` for your company? If so, do you want to revive this discussion? > Character Mapping > - > > Key: LUCENE-7321 > URL: https://issues.apache.org/jira/browse/LUCENE-7321 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1 >Reporter: Ivan Provalov >Priority: Minor > Labels: patch > Fix For: 6.0.1 > > Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch > > > One of the challenges in search is recall of an item with a common typing > variant. These cases can be as simple as lower/upper case in most languages, > accented characters, or more complex morphological phenomena like prefix > omitting, or constructing a character with some combining mark. This > component addresses the cases, which are not covered by ASCII folding > component, or more complex to design with other tools. The idea is that a > linguist could provide the mappings in a tab-delimited file, which then can > be directly used by Solr. > The mappings are maintained in the tab-delimited file, which could be just a > copy paste from Excel spreadsheet. This gives the linguists the opportunity > to create the mappings, then for the developer to include them in Solr > configuration. There are a few cases, when the mappings grow complex, where > some additional debugging may be required. The mappings can contain any > sequence of characters to any other sequence of characters. > Some of the cases I discuss in detail document are handling the voiced vowels > for Japanese; common typing substitutions for Korean, Russian, Polish; > transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding > for Japanese. In the appendix, I give an example of implementing a Russian > light weight stemmer using this component. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org