On Mon, Dec 5, 2011 at 12:57 PM, Isabel Drost <[email protected]> wrote:
> On 02.12.2011 Chris Grier wrote:
> > Caused by: java.io.IOException: Cannot open filename
> > /tmp/mahout-work-hadoop/reuters-out-seqdir-sparse-lda/tf-vectors/_logs
>
> Are you providing the correct input directory here? On first sight it
> seems to
> think that the logs dir contains the tf-vectors.
>
No, what I think it's doing is looping over all subdirectories of
tf-vectors, and
*not* skipping the _logs directory. I'm surprised this is happening, I
thought
this had been fixed long ago. In fact, at least on trunk, the line where
this
would be coming from is:
SequenceFileDirValueIterator<VectorWritable> it =
new SequenceFileDirValueIterator<VectorWritable>(getInputPath(),
PathType.LIST,
PathFilters.logsCRCFilter(),
null,
true,
getConf());
which uses the PathFilters.logsCRCFilter() which skips over _logs.
On a related note: If you are working with LDA - did you try out Jake's new
> implementation? Would be great to get more feedback on that one.
>
Yeah, that would indeed be awesome. I should see if I can hack on
cluster-reuters.sh
to add an option to use the new LDA as well. Should not be hard, I've run
that code
against reuters and it does very nicely.
-jake