[
https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750439#action_12750439
]
Robert Muir commented on SOLR-1336:
-----------------------------------
bq. Can this be customized to accomodate those languages?
Maybe, but we have to do work first. the dictionary is limited to GB2312
encoding, so we can't add support for new languages until this is fixed.
bq. Is there any wiki link or document to help us understand how this tool
works? Sort of behind the scenes....
There are some sparse javadocs or code comments. also see the original jira
ticket: LUCENE-1629
bq. What exactly does the dictionary contain? Is it any ordinary chinese
dictionary or some sort of a customized/lemmatized dictionary?
There are two dictionaries: word dictionary, and bigram dictionary.
These dictionaries contain words and bigrams respectively, along with
frequency, in a "trie"-like structure organized by chinese character.
bq. Also, how can one add new words to the dictionary?
This is currently really difficult. please see LUCENE-1817 for some background
information.
For the moment you will have to recompile your own custom jar file, and be
familiar with the file formats the analyzer uses.
Note, we put strong warnings as we would like to change the file formats in an
upcoming release, to something based on Unicode.
This way, we can support more languages, and perhaps also make it easier to
customize the dictionary data
> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
> Key: SOLR-1336
> URL: https://issues.apache.org/jira/browse/SOLR-1336
> Project: Solr
> Issue Type: New Feature
> Components: Analysis
> Reporter: Robert Muir
> Attachments: SOLR-1336.patch, SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese
> text as words.
> if the factories for the tokenizer and word token filter are added to solr it
> can be used, although there should be a sample config or wiki entry showing
> how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to
> prevent indexing punctuation...
> note: we did some refactoring/cleanup on this analyzer recently, so it would
> be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in
> its own smartcn jar file, so that would need to be added if this feature is
> desired.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.