Thanks Robin. It works. After dumping the content, it seems the tokenized document seems to be correct. However, the word count doesn't contain the term that has only 1 occurence (same content shows for the dictionary file as well). Is that expected behavior ?
Thanks , Weide On Mon, Sep 5, 2011 at 10:51 PM, Robin Anil <[email protected]> wrote: > use the sequence file dumper to inspect the files > > bin/mahout seqdumper --help > > On Tue, Sep 6, 2011 at 10:03 AM, Walter Chang <[email protected] > >wrote: > > > i ended up add a default SmartChineseAnalyzer constructor to get around > > with > > the issue. I have another question. Right now, I can see the following > > directories created but it seems to be they are encoded using some binary > > format. Is there any tool to double check the generated contents as well > as > > TF-IDF score calculated ? > > > > df-count dictionary.file-0 frequency.file-0 tfidf-vectors tf-vectors > > tokenized-documents wordcount > > > > Thanks a lot, > > > > Weide > > > > On Mon, Sep 5, 2011 at 9:03 PM, Jake Mannix <[email protected]> > wrote: > > > > > On Mon, Sep 5, 2011 at 8:36 PM, Lance Norskog <[email protected]> > wrote: > > > > > > > > > > > > A Lucene expert could change SparseVectors to handle this case. > (There > > > > might > > > > be other problems.) > > > > > > > > > > I don't think we need a Lucene expert, we just need to change the logic > > of > > > "instantiate > > > Analyzer via no-arg constructor" to "if no-arg constructor exist for > the > > > Analyzer, use it, > > > else try the single-arg constructor which takes a LuceneUtil.VERSION as > > the > > > argument". > > > And possibly let the client specify the lucene version (making sure to > > swap > > > out all the > > > lucene jars which might be needed of that exact version) on the command > > > line. > > > > > > -jake > > > > > >
