U may have to inside the  keams job code ,to fullfill it,
cause after kmeans , got k point , and run clustering from the tfidf vector
file such as:

id : vectors
/10000001
{11133:0.33407269965183145,4179:0.6294628642719677,11147:0.47122968104183194,3428:0.5197254290056249}
/10000002
{2693:0.1914665765512973,1078:0.12018772808991451,12357:0.4096048435885022,3428:0.23087411002109504,1590:0.19757134430454687,4912:0.21339950481825154,1621:0.3342897624454898,4781:0.39371810276427055,11848:0.44143150752208343,10170:0.42616584472568675}
/10000003

,the mahout job lose to save the according document id , ,parse it into

catory        weight:vector
31770   1.0: [2187:0.324, 6168:0.592, 9571:0.445, 10840:0.507, 11032:0.299]


and i can tell u ,if u are runing in hadoop MR mode ,the code is:

KMeansClusterMapper:

  @Override
  protected void map(WritableComparable<?> key, VectorWritable point,
Context context)
    throws IOException, InterruptedException {
    clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
  }//there input the document key and value:vector

to

  public void outputPointWithClusterInfo(Vector vector,
                                         Iterable<Cluster> clusters,
                                        
Mapper<?,?,IntWritable,WeightedVectorWritable>.Context context)
    throws IOException, InterruptedException {
    AbstractCluster nearestCluster = null;
    double nearestDistance = Double.MAX_VALUE;
    for (AbstractCluster cluster : clusters) {
      Vector clusterCenter = cluster.getCenter();
      double distance = measure.distance(clusterCenter.getLengthSquared(),
clusterCenter, vector);
      if (distance < nearestDistance || nearestCluster == null) {
        nearestCluster = cluster;
        nearestDistance = distance;
      }
    }
    context.write(new IntWritable(nearestCluster.getId()), new
WeightedVectorWritable(1, vector));
  }

but output lose to recode the document id ,but just recode the catory it
belongs and it value

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-To-get-the-Documents-from-generated-Cluster-tp960031p3245288.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to