Mark,
I think you can develop your tokenizer which calls sen to tokenize
Japanese sentences.
To develop your tokenizer, you can see the source code of lucene-ja.
I think you can find the source code in lucene-ja.jar, but I'm not sure.
Koji
Mark Bennett wrote:
I've been reading through the SEN project doc and various Japanese blogs,
but still having some issues.
In particular, it seems like perhaps you're supposed to have BOTH
sen-1.2.2.1 and lucene-ja-2.0test2 installed?
I guess the lucene-ja is an adapter layer between the org.apache.lucene
analyzers and base net.java Tokenizers, whereas sen-1.2.2.1 is the base SEN
package, and is not aware of Lucene/Solr. So I guess you need both.
But both versions have Lucene classes, and the lucene-ja stuff seems to be
using very old Lucene. I'm not sure how you layer this all together with a
more recent Solr implemenation? (using nightly stable)
Or perhaps the older lucene-ja is intended to already have SEN, it does have
some SEN files, but they are quite a bit older than the SEN 1221 stuff, and
you've still got the old Lucene version issue.
Any input would be appreciated.
--
Mark Bennett / New Idea Engineering, Inc. / [email protected]
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513