Hmm... then it looks like we're spitting out Long ids from the lucene.vector output.
Take a look at RowIdJob - it takes SequenceFile<Text,VectorWritable> and converts it to a pair of sequence files: SequenceFile<IntWritable, VectorWritable>, and SequenceFile<IntWritable,Text> (the latter being the "dictionary" of what int ids correspond to what original text ids). This job could be modified trivially by swapping every reference to Text to LongWritable. On Wed, May 25, 2011 at 8:00 AM, Stefan Wienert <[email protected]> wrote: > Yes, with: > bin/mahout lucene.vector \ > --dir /home/hadoop/MahoutStatements/tf_index \ > --field fulltext \ > --dictOut /home/hadoop/MahoutStatements/dict.txt \ > --output /home/hadoop/MahoutStatements/tfidf-vectors \ > --idField id \ > --weight TFIDF > > 2011/5/25 Jake Mannix <[email protected]>: > > Did you rebuild your tfidf-vectors with trunk as well? > > > > On Wed, May 25, 2011 at 6:59 AM, Stefan Wienert <[email protected]> > wrote: > > > >> First, I use http://svn.apache.org/repos/asf/mahout/trunk, tested some > >> minutes ago with the newest version. > >> > >> And still: > >> bin/mahout transpose \ > >> --input /home/hadoop/MahoutStatements/tfidf-vectors \ > >> --numRows 227 \ > >> --numCols 107909 \ > >> --tempDir /home/hadoop/MahoutStatements/tfidf-matrix/transpose > >> produces: > >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > >> be cast to org.apache.hadoop.io.IntWritable > >> > >> My first idea to change "lucene.vector" does not work, there is too > >> much to change. > >> > >> So... Ideas? What about changing "transpose" and "matrixmult" to use > >> LongWritable instead of IntWritable? Is this problematically? > >> > >> 2011/5/25 Jake Mannix <[email protected]>: > >> > On Wed, May 25, 2011 at 6:14 AM, Stefan Wienert <[email protected]> > >> wrote: > >> > > >> >> So the real problem is, that "transpose" and "matrixmult" (maybe) > >> >> still uses IntWritable instead of LongWritable". > >> >> > >> > > >> > It's the other way around: matrix operations use keys which are ints, > and > >> > the lucene.vector class needs to respect this. It doesn't on current > >> trunk? > >> > > >> > -jake > >> > > >> > >> > >> > >> -- > >> Stefan Wienert > >> > >> http://www.wienert.cc > >> [email protected] > >> > >> Telefon: +495251-2026838 (neue Nummer seit 20.06.10) > >> Mobil: +49176-40170270 > >> > > > > > > -- > Stefan Wienert > > http://www.wienert.cc > [email protected] > > Telefon: +495251-2026838 (neue Nummer seit 20.06.10) > Mobil: +49176-40170270 >
