[jira] Commented: (SOLR-1204) Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only

Michael Ludwig (JIRA) Mon, 08 Jun 2009 10:00:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717323#action_12717323
 ]


Michael Ludwig commented on SOLR-1204:
--------------------------------------

FYI, there is a nice Unicode web tool here: http://rishida.net/scripts/uniview/

Java identifiers exclude dash (minus) and dot ( - and . ); they allow $, € and 
other currencies.

An XML NMTOKEN excludes currency symbols, but allows dash, dot, middle dot, 
underscore, and colon. It also allows Arabic numerals [0-9] at the beginning.

Colons must be excluded for Solr purposes. But I wouldn't exclude dash and dot.

Fields are entered in XML (schema.xml), so why not base the type on an XML 
type? Validation would be easy:

<!DOCTYPE solr-test[
<!ELEMENT solr-test EMPTY >
<!ATTLIST solr-test field NMTOKEN #REQUIRED>
]>
<solr-test field="   123.Käse-A_Z      "/>

Note the leading and trailing spaces around the attribute value; the XML parser 
strips these when validating using an NMTOKEN type, so this user error can be 
excluded fairly simple. The absence of any colon, however, would have to be 
guaranteed by some other means. Still, I think there are advantages.

If ensuring the uniqueness of a field name in a schema.xml matters, one could 
also consider using the NAME type and defining field/@name as ID in the DTD. 
This would exclude dash, dot, middle dot and Arabic numerals as start 
characters.

I think I could supply a patch for NMTOKEN or NAME if this is found desirable.

> Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only
> --------------------------------------------------------------------
>
>                 Key: SOLR-1204
>                 URL: https://issues.apache.org/jira/browse/SOLR-1204
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Michael Ludwig
>            Assignee: Shalin Shekhar Mangar
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: SpellingQueryConverter.java.diff, 
> SpellingQueryConverter.java.diff
>
>
> Solr - User - SpellCheckComponent: queryAnalyzerFieldType
> http://www.nabble.com/SpellCheckComponent%3A-queryAnalyzerFieldType-td23870668.html
> In the above thread, it was suggested to extend the SpellingQueryConverter to 
> cover the full UTF-8 range instead of handling US-ASCII only. This might be 
> as simple as changing the regular expression used to tokenize the input 
> string to accept a sequence of one or more Unicode letters ( \p{L}+ ) instead 
> of a sequence of one or more word characters ( \w+ ).
> See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for 
> Java regular expression reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1204) Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only

Reply via email to