Hmm... then it looks like we're spitting out Long ids from the lucene.vector
output.

Take a look at RowIdJob - it takes SequenceFile<Text,VectorWritable> and
converts
it to a pair of sequence files: SequenceFile<IntWritable, VectorWritable>,
and
SequenceFile<IntWritable,Text> (the latter being the "dictionary" of what
int ids
correspond to what original text ids).  This job could be modified trivially
by swapping
every reference to Text to LongWritable.

On Wed, May 25, 2011 at 8:00 AM, Stefan Wienert <[email protected]> wrote:

> Yes, with:
> bin/mahout lucene.vector \
>        --dir /home/hadoop/MahoutStatements/tf_index \
>        --field fulltext \
>        --dictOut /home/hadoop/MahoutStatements/dict.txt \
>        --output /home/hadoop/MahoutStatements/tfidf-vectors  \
>        --idField id \
>        --weight TFIDF
>
> 2011/5/25 Jake Mannix <[email protected]>:
> > Did you rebuild your tfidf-vectors with trunk as well?
> >
> > On Wed, May 25, 2011 at 6:59 AM, Stefan Wienert <[email protected]>
> wrote:
> >
> >> First, I use http://svn.apache.org/repos/asf/mahout/trunk, tested some
> >> minutes ago with the newest version.
> >>
> >> And still:
> >> bin/mahout transpose \
> >> --input /home/hadoop/MahoutStatements/tfidf-vectors \
> >> --numRows 227 \
> >> --numCols 107909 \
> >> --tempDir /home/hadoop/MahoutStatements/tfidf-matrix/transpose
> >> produces:
> >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> >> be cast to org.apache.hadoop.io.IntWritable
> >>
> >> My first idea to change "lucene.vector" does not work, there is too
> >> much to change.
> >>
> >> So... Ideas? What about changing "transpose" and "matrixmult" to use
> >> LongWritable instead of IntWritable? Is this problematically?
> >>
> >> 2011/5/25 Jake Mannix <[email protected]>:
> >> > On Wed, May 25, 2011 at 6:14 AM, Stefan Wienert <[email protected]>
> >> wrote:
> >> >
> >> >> So the real problem is, that "transpose" and "matrixmult" (maybe)
> >> >> still uses IntWritable instead of LongWritable".
> >> >>
> >> >
> >> > It's the other way around: matrix operations use keys which are ints,
> and
> >> > the lucene.vector class needs to respect this.  It doesn't on current
> >> trunk?
> >> >
> >> >  -jake
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> [email protected]
> >>
> >> Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> [email protected]
>
> Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> Mobil: +49176-40170270
>

Reply via email to