Thanks Robin. It works. After dumping the content, it seems the tokenized
document seems to be  correct.  However,  the word count doesn't contain the
term that has only 1 occurence (same content shows for the dictionary file
as well). Is that expected behavior ?

Thanks ,

Weide

On Mon, Sep 5, 2011 at 10:51 PM, Robin Anil <[email protected]> wrote:

> use the sequence file dumper to inspect the files
>
> bin/mahout seqdumper --help
>
> On Tue, Sep 6, 2011 at 10:03 AM, Walter Chang <[email protected]
> >wrote:
>
> > i ended up add a default SmartChineseAnalyzer constructor to get around
> > with
> > the issue. I have another question. Right now, I can see the following
> > directories created but it seems to be they are encoded using some binary
> > format. Is there any tool to double check the generated contents as well
> as
> > TF-IDF score calculated ?
> >
> > df-count  dictionary.file-0  frequency.file-0  tfidf-vectors  tf-vectors
> >  tokenized-documents  wordcount
> >
> > Thanks a lot,
> >
> > Weide
> >
> > On Mon, Sep 5, 2011 at 9:03 PM, Jake Mannix <[email protected]>
> wrote:
> >
> > > On Mon, Sep 5, 2011 at 8:36 PM, Lance Norskog <[email protected]>
> wrote:
> > > >
> > > >
> > > > A Lucene expert could change SparseVectors to handle this case.
> (There
> > > > might
> > > > be other problems.)
> > > >
> > >
> > > I don't think we need a Lucene expert, we just need to change the logic
> > of
> > > "instantiate
> > > Analyzer via no-arg constructor" to "if no-arg constructor exist for
> the
> > > Analyzer, use it,
> > > else try the single-arg constructor which takes a LuceneUtil.VERSION as
> > the
> > > argument".
> > > And possibly let the client specify the lucene version (making sure to
> > swap
> > > out all the
> > > lucene jars which might be needed of that exact version) on the command
> > > line.
> > >
> > >  -jake
> > >
> >
>

Reply via email to