U may have to inside the keams job code ,to fullfill it,
cause after kmeans , got k point , and run clustering from the tfidf vector
file such as:
id : vectors
/10000001
{11133:0.33407269965183145,4179:0.6294628642719677,11147:0.47122968104183194,3428:0.5197254290056249}
/10000002
{2693:0.1914665765512973,1078:0.12018772808991451,12357:0.4096048435885022,3428:0.23087411002109504,1590:0.19757134430454687,4912:0.21339950481825154,1621:0.3342897624454898,4781:0.39371810276427055,11848:0.44143150752208343,10170:0.42616584472568675}
/10000003
,the mahout job lose to save the according document id , ,parse it into
catory weight:vector
31770 1.0: [2187:0.324, 6168:0.592, 9571:0.445, 10840:0.507, 11032:0.299]
and i can tell u ,if u are runing in hadoop MR mode ,the code is:
KMeansClusterMapper:
@Override
protected void map(WritableComparable<?> key, VectorWritable point,
Context context)
throws IOException, InterruptedException {
clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
}//there input the document key and value:vector
to
public void outputPointWithClusterInfo(Vector vector,
Iterable<Cluster> clusters,
Mapper<?,?,IntWritable,WeightedVectorWritable>.Context context)
throws IOException, InterruptedException {
AbstractCluster nearestCluster = null;
double nearestDistance = Double.MAX_VALUE;
for (AbstractCluster cluster : clusters) {
Vector clusterCenter = cluster.getCenter();
double distance = measure.distance(clusterCenter.getLengthSquared(),
clusterCenter, vector);
if (distance < nearestDistance || nearestCluster == null) {
nearestCluster = cluster;
nearestDistance = distance;
}
}
context.write(new IntWritable(nearestCluster.getId()), new
WeightedVectorWritable(1, vector));
}
but output lose to recode the document id ,but just recode the catory it
belongs and it value
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-To-get-the-Documents-from-generated-Cluster-tp960031p3245288.html
Sent from the Mahout User List mailing list archive at Nabble.com.