I want crawlers to index Thai language using nutch-1.0. (Thai has no space
between words!)
I looked at plugins/lib-lucene-analyzers. It contains ThaiAnalyzer. So, I
tried to add the plugin.includes property in nutch-site.xml as below.
<property>
<name>plugin.includes</name>
<value>language-identifier|nutch-extensionpoints|lib-lucene-analyzers|scoring-opic|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Plugin</description>
</property>
This does not work. It cannot index So, I checked handoop.log. It shows
something like
2010-05-27 12:21:07,029 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-05-27 12:21:07,030 INFO plugin.PluginRepository - Registered Plugins:
2010-05-27 12:21:07,030 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2010-05-27 12:21:07,030 INFO plugin.PluginRepository - Lucene Analysers
(lib-lucene-analyzers)
2010-05-27 12:21:07,030 INFO plugin.PluginRepository - Language
Identification Parser/Filter (language-identifier)
I don't know if this means the plugin was loaded.
How can I make use of Thai Analysis? Is the property tag above is correct?
And, How can I check if the crawler use ThaiAnayzer to do indexing.
Need you help. I've stuck with this problem for many days.
Thank you.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-can-I-use-multi-language-analyzer-tp847380p847380.html
Sent from the Nutch - User mailing list archive at Nabble.com.