[jira] [Commented] (LUCENE-7321) Character Mapping

2021-06-03 Thread Ivan Provalov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356696#comment-17356696
 ] 

Ivan Provalov commented on LUCENE-7321:
---

[~marcussorealheis], I have been maintaining it (bug fixes, etc...), not 
upgraded to version 8 yet.  I could do that if there is any interest in 
integrating it.

 

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7321) Character Mapping

2021-06-02 Thread Marcus Eagan (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356136#comment-17356136
 ] 

Marcus Eagan commented on LUCENE-7321:
--

Hi [~iprovalo] I'm curious if you have been maintaining this patch through 
version `8` for your company? If so, do you want to revive this discussion?

> Character Mapping
> -
>
> Key: LUCENE-7321
> URL: https://issues.apache.org/jira/browse/LUCENE-7321
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.6.1, 5.4.1, 6.0, 6.0.1
>Reporter: Ivan Provalov
>Priority: Minor
>  Labels: patch
> Fix For: 6.0.1
>
> Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org