On Mon, Dec 5, 2011 at 12:57 PM, Isabel Drost <[email protected]> wrote:

> On 02.12.2011 Chris Grier wrote:
> > Caused by: java.io.IOException: Cannot open filename
> > /tmp/mahout-work-hadoop/reuters-out-seqdir-sparse-lda/tf-vectors/_logs
>
> Are you providing the correct input directory here? On first sight it
> seems to
> think that the logs dir contains the tf-vectors.
>

No, what I think it's doing is looping over all subdirectories of
tf-vectors, and
*not* skipping the _logs directory.  I'm surprised this is happening, I
thought
this had been fixed long ago.  In fact, at least on trunk, the line where
this
would be coming from is:

    SequenceFileDirValueIterator<VectorWritable> it =
        new SequenceFileDirValueIterator<VectorWritable>(getInputPath(),
                                                         PathType.LIST,

 PathFilters.logsCRCFilter(),
                                                         null,
                                                         true,
                                                         getConf());

which uses the PathFilters.logsCRCFilter() which skips over _logs.

On a related note: If you are working with LDA - did you try out Jake's new
> implementation? Would be great to get more feedback on that one.
>

Yeah, that would indeed be awesome.  I should see if I can hack on
cluster-reuters.sh
to add an option to use the new LDA as well.  Should not be hard, I've run
that code
against reuters and it does very nicely.

  -jake

Reply via email to