(I suggest we not use IntWritable or LongWritable, but favor VarIntWritable
and VarLongWritable, which are variable length encoding versions, where
possible. Saving a couple bytes per key adds up.)

On Wed, May 25, 2011 at 5:38 PM, Jake Mannix <[email protected]> wrote:

> Hmm... then it looks like we're spitting out Long ids from the
> lucene.vector
> output.
>
> Take a look at RowIdJob - it takes SequenceFile<Text,VectorWritable> and
> converts
> it to a pair of sequence files: SequenceFile<IntWritable, VectorWritable>,
> and
> SequenceFile<IntWritable,Text> (the latter being the "dictionary" of what
> int ids
> correspond to what original text ids).  This job could be modified
> trivially
> by swapping
> every reference to Text to LongWritable.
>
> On Wed, May 25, 2011 at 8:00 AM, Stefan Wienert <[email protected]> wrote:
>
> > Yes, with:
> > bin/mahout lucene.vector \
> >        --dir /home/hadoop/MahoutStatements/tf_index \
> >        --field fulltext \
> >        --dictOut /home/hadoop/MahoutStatements/dict.txt \
> >        --output /home/hadoop/MahoutStatements/tfidf-vectors  \
> >        --idField id \
> >        --weight TFIDF
> >
> > 2011/5/25 Jake Mannix <[email protected]>:
> > > Did you rebuild your tfidf-vectors with trunk as well?
> > >
> > > On Wed, May 25, 2011 at 6:59 AM, Stefan Wienert <[email protected]>
> > wrote:
> > >
> > >> First, I use http://svn.apache.org/repos/asf/mahout/trunk, tested
> some
> > >> minutes ago with the newest version.
> > >>
> > >> And still:
> > >> bin/mahout transpose \
> > >> --input /home/hadoop/MahoutStatements/tfidf-vectors \
> > >> --numRows 227 \
> > >> --numCols 107909 \
> > >> --tempDir /home/hadoop/MahoutStatements/tfidf-matrix/transpose
> > >> produces:
> > >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> > >> be cast to org.apache.hadoop.io.IntWritable
> > >>
> > >> My first idea to change "lucene.vector" does not work, there is too
> > >> much to change.
> > >>
> > >> So... Ideas? What about changing "transpose" and "matrixmult" to use
> > >> LongWritable instead of IntWritable? Is this problematically?
> > >>
> > >> 2011/5/25 Jake Mannix <[email protected]>:
> > >> > On Wed, May 25, 2011 at 6:14 AM, Stefan Wienert <[email protected]>
> > >> wrote:
> > >> >
> > >> >> So the real problem is, that "transpose" and "matrixmult" (maybe)
> > >> >> still uses IntWritable instead of LongWritable".
> > >> >>
> > >> >
> > >> > It's the other way around: matrix operations use keys which are
> ints,
> > and
> > >> > the lucene.vector class needs to respect this.  It doesn't on
> current
> > >> trunk?
> > >> >
> > >> >  -jake
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Stefan Wienert
> > >>
> > >> http://www.wienert.cc
> > >> [email protected]
> > >>
> > >> Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> > >> Mobil: +49176-40170270
> > >>
> > >
> >
> >
> >
> > --
> > Stefan Wienert
> >
> > http://www.wienert.cc
> > [email protected]
> >
> > Telefon: +495251-2026838 (neue Nummer seit 20.06.10)
> > Mobil: +49176-40170270
> >
>

Reply via email to