(I suggest we not use IntWritable or LongWritable, but favor VarIntWritable and VarLongWritable, which are variable length encoding versions, where possible. Saving a couple bytes per key adds up.)
On Wed, May 25, 2011 at 5:38 PM, Jake Mannix <[email protected]> wrote: > Hmm... then it looks like we're spitting out Long ids from the > lucene.vector > output. > > Take a look at RowIdJob - it takes SequenceFile<Text,VectorWritable> and > converts > it to a pair of sequence files: SequenceFile<IntWritable, VectorWritable>, > and > SequenceFile<IntWritable,Text> (the latter being the "dictionary" of what > int ids > correspond to what original text ids). This job could be modified > trivially > by swapping > every reference to Text to LongWritable. > > On Wed, May 25, 2011 at 8:00 AM, Stefan Wienert <[email protected]> wrote: > > > Yes, with: > > bin/mahout lucene.vector \ > > --dir /home/hadoop/MahoutStatements/tf_index \ > > --field fulltext \ > > --dictOut /home/hadoop/MahoutStatements/dict.txt \ > > --output /home/hadoop/MahoutStatements/tfidf-vectors \ > > --idField id \ > > --weight TFIDF > > > > 2011/5/25 Jake Mannix <[email protected]>: > > > Did you rebuild your tfidf-vectors with trunk as well? > > > > > > On Wed, May 25, 2011 at 6:59 AM, Stefan Wienert <[email protected]> > > wrote: > > > > > >> First, I use http://svn.apache.org/repos/asf/mahout/trunk, tested > some > > >> minutes ago with the newest version. > > >> > > >> And still: > > >> bin/mahout transpose \ > > >> --input /home/hadoop/MahoutStatements/tfidf-vectors \ > > >> --numRows 227 \ > > >> --numCols 107909 \ > > >> --tempDir /home/hadoop/MahoutStatements/tfidf-matrix/transpose > > >> produces: > > >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > > >> be cast to org.apache.hadoop.io.IntWritable > > >> > > >> My first idea to change "lucene.vector" does not work, there is too > > >> much to change. > > >> > > >> So... Ideas? What about changing "transpose" and "matrixmult" to use > > >> LongWritable instead of IntWritable? Is this problematically? > > >> > > >> 2011/5/25 Jake Mannix <[email protected]>: > > >> > On Wed, May 25, 2011 at 6:14 AM, Stefan Wienert <[email protected]> > > >> wrote: > > >> > > > >> >> So the real problem is, that "transpose" and "matrixmult" (maybe) > > >> >> still uses IntWritable instead of LongWritable". > > >> >> > > >> > > > >> > It's the other way around: matrix operations use keys which are > ints, > > and > > >> > the lucene.vector class needs to respect this. It doesn't on > current > > >> trunk? > > >> > > > >> > -jake > > >> > > > >> > > >> > > >> > > >> -- > > >> Stefan Wienert > > >> > > >> http://www.wienert.cc > > >> [email protected] > > >> > > >> Telefon: +495251-2026838 (neue Nummer seit 20.06.10) > > >> Mobil: +49176-40170270 > > >> > > > > > > > > > > > -- > > Stefan Wienert > > > > http://www.wienert.cc > > [email protected] > > > > Telefon: +495251-2026838 (neue Nummer seit 20.06.10) > > Mobil: +49176-40170270 > > >
