Hi Jake,
When I run $ mahout vectordump --seqFile part-00000 --dictionary dict.out
--printKey, I get:
Input Path: part-00000
0 elts: {0:c}
1 elts: {1:d}
2 elts: {2:e}
Dumped 3 Vectors
Given that my original data was
id1: A A B C
id2: B D D
id3: A B B E
how am I to interpret this? Is it printing out the characters that are
unique for a given doc id? I was expecting to see something that would
allow me to see how similar documents were to one another.
Thanks,
Kris
2010/6/10 Jake Mannix <[email protected]>
> On Thu, Jun 10, 2010 at 10:28 AM, Kris Jack <[email protected]> wrote:
> >
> > Thanks very much for the help. I looked into the problem a little deeper
> > and found that the org.apache.mahout.utils.vectors.lucene.Driver was
> > writing
> > out LongWriters instead of IntWriters so I just changed the code in
> there.
> > Should this code be using IntWriters or LongWriters?
> >
>
> The reason why the Lucene Driver uses long is that Solr encodes uid's as
> long. Kinda backwards, that Mahout wants ints, and Solr wants longs, but
> that's the way it is.
>
> Maybe the lucene Driver could take a boolean flag on whether to encode
> the keys as long or int? Anyone have opinions on this?
>
>
> > After writing the to a sequence file and running your matrix
> transposition
> > and multiplication, I get an output called part-0000. If I read it using
> $
> > mahout seqdumper --seqFile part-00000 then it outputs:
> >
>
> I would use "mahout vectordump" instead of "mahout seqdumper" and
> you'll get nicer output.
>
> -jake
>
--
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/