[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Robert Muir (JIRA) Sat, 08 Aug 2009 15:25:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740988#action_12740988
 ]


Robert Muir commented on SOLR-1336:
-----------------------------------

{quote}
Are the stopwords (words="org/apache/lucene/analysis/cn/stopwords.txt") being 
loaded directly from the jar? If so, a comment to that effect might prevent 
some confusion. 
{quote}

Yes, good idea.

{quote}
Do you happen to know what the memory footprint of this analyzer is if it's 
used? I assume the dictionaries will get loaded on the first use.
{quote}

No, I am not sure of the footprint, but it is probably quite large (a few MB). 
They will be loaded on first use, correct. Also, the smartcn jar file itself is 
large due to the dictionaries in question. So, you may have noticed solr.war is 
much smaller after the last lucene update, since it was factored out of 
analyzers.jar. 

{quote}
Might be cool to add a chinese field to example/exampledocs/solr.xml... or 
maybe there should be an international.xml doc where we could add a few 
different languages?
{quote}

I figured this wasn't the best place to have an example... i like the idea of 
international.xml, with some examples for other languages too.

If there is some concern about the size of this (monster) analyzer, one option 
is to put these factories/examples elsewhere, to keep the size of solr smaller. 


> Add support for lucene's SmartChineseAnalyzer
> ---------------------------------------------
>
>                 Key: SOLR-1336
>                 URL: https://issues.apache.org/jira/browse/SOLR-1336
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1336.patch
>
>
> SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese 
> text as words.
> if the factories for the tokenizer and word token filter are added to solr it 
> can be used, although there should be a sample config or wiki entry showing 
> how to apply the built-in stopwords list.
> this is because it doesn't contain actual stopwords, but must be used to 
> prevent indexing punctuation... 
> note: we did some refactoring/cleanup on this analyzer recently, so it would 
> be much easier to do this after the next lucene update.
> it has also been moved out of -analyzers.jar due to size, and now builds in 
> its own smartcn jar file, so that would need to be added if this feature is 
> desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

Reply via email to