Indeed it has changed quite a bit recently. Vectors formerly had a name field which has been removed. This allowed documentIds to be carried along in their term vectors. The refactoring introduced NamedVectors which wrap normal vectors to carry along such names. As NamedVectors are also Vectors they flow through the various jobs transparently.

When you run seq2sparse on a set of text documents, it produces an output sequence file <Text, VectorWritable> which has the documentId in the key field but not in the value field. This looks like it should add the documentId to a NamedVector instead. The following patch seems to correct the problem though it needs more testing:

Index: utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java
===================================================================
--- utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java (revision 948493) +++ utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java (working copy)
@@ -35,6 +35,7 @@
 import org.apache.lucene.analysis.shingle.ShingleFilter;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 import org.apache.mahout.common.StringTuple;
+import org.apache.mahout.math.NamedVector;
 import org.apache.mahout.math.RandomAccessSparseVector;
 import org.apache.mahout.math.SequentialAccessSparseVector;
 import org.apache.mahout.math.Vector;
@@ -102,7 +103,7 @@
     }
// if the vector has no nonZero entries (nothing in the dictionary), let's not waste space sending it to disk.
     if(vector.getNumNondefaultElements() > 0) {
-      VectorWritable vectorWritable = new VectorWritable(vector);
+ VectorWritable vectorWritable = new VectorWritable(new NamedVector(vector, key.toString()));
       output.collect(key, vectorWritable);
     } else {
reporter.incrCounter("TFParticalVectorReducer", "emptyVectorCount", 1); Index: utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java
===================================================================
--- utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java (revision 948493) +++ utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java (working copy)
@@ -33,11 +33,12 @@
 import org.apache.hadoop.mapred.OutputCollector;
 import org.apache.hadoop.mapred.Reducer;
 import org.apache.hadoop.mapred.Reporter;
+import org.apache.mahout.math.NamedVector;
 import org.apache.mahout.math.RandomAccessSparseVector;
 import org.apache.mahout.math.SequentialAccessSparseVector;
 import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.VectorWritable;
 import org.apache.mahout.math.Vector.Element;
-import org.apache.mahout.math.VectorWritable;
 import org.apache.mahout.math.map.OpenIntLongHashMap;
 import org.apache.mahout.utils.vectors.TFIDF;
 import org.apache.mahout.utils.vectors.common.PartialVectorMerger;
@@ -85,7 +86,7 @@
     if (sequentialAccess) {
       vector = new SequentialAccessSparseVector(vector);
     }
-    VectorWritable vectorWritable = new VectorWritable(vector);
+ VectorWritable vectorWritable = new VectorWritable(new NamedVector(vector, key.toString()));
     output.collect(key, vectorWritable);
   }



On 5/26/10 11:20 AM, Delroy Cameron wrote:
yeah Jeff,
the implementation for printing the points has changed. Instead of a list of
strings for each point, we now have a list of WeightedVectorWritable
objects. The problem is that in the previous implementation getting the
point id (i.e. the document id for each document in the cluster) was
straight forward..see below

After looking at the API for the code and testing a few output variations on
points output. i am forced to ask..are the ids for the points in the
WeightedVectorWritable object?

  List<String>  points =
clusterIdToPoints.get(String.valueOf(cluster.getId()));
         if (points != null) {
           writer.write("\tPoints: ");
           for (Iterator<String>  iterator = points.iterator();
iterator.hasNext();) {
             String point = iterator.next();
             writer.append(point);
             if (iterator.hasNext()) {
               writer.append(", ");
             }
           }
           writer.write('\n');
         }

Top Terms:
                were                                    =>    32.23076923076923
                expression                              =>   27.333333333333332
                gene                                    =>   23.076923076923077
                from                                    =>   19.641025641025642
                cells                                   =>    17.76923076923077
                c                                       =>    16.23076923076923
                1                                       =>    14.76923076923077
                human                                   =>   14.487179487179487
                5                                       =>   13.820512820512821
                we                                      =>   13.179487179487179
        Points: 10075717, 10330009, 10419905, 10811945, 11116137, 11222753,
11691919

List<WeightedVectorWritable>  points =
clusterIdToPoints.get(cluster.getId());
         if (points != null) {
           writer.write("\tWeight:  Point:\n\t");
           for (Iterator<WeightedVectorWritable>  iterator =
points.iterator(); iterator.hasNext();) {
             WeightedVectorWritable point = iterator.next();
             writer.append(Double.toString(point.getWeight())).append(": ");
             writer.append(ClusterBase.formatVector(point.getVector().get(),
dictionary));
             if (iterator.hasNext()) {
               writer.append("\n\t");
             }
           }
           writer.write('\n');
         }

Top Terms:
                 riele                                   =>
14.00426959991455
                 meredith                                =>
12.727957301669651
                 lysine-6                                =>
11.388569796526873
                 amores                                  =>
10.307115837379738
                 mashimo                                 =>
9.840165774027506
                 halks                                   =>
9.598452267823395
                 maseki                                  =>
8.773765140109592
                 lysine-63                               =>
8.496143341064453
                 saporita                                =>
8.167389004318803
                 a94                                     =>
8.119972387949625
         Weight:  Point:
         1.0: [265:1.016, 1753:3.503, 2087:2.217, 2162:2.396, 2217:1.347,
2702:1.054, 2886:1.125, 2974:2.472, 3197:1.603, 3472:1.902, 3714:1.658,
3789:1.735, 4003:1.538, 4168:3.849, 4387:6.602, 4399:3.800, 4513:1.717,
4640:1.387, ...]


-----
--cheers
Delroy

Reply via email to