Hi Valerio,
All the Mahout clustering implementations operate over Hadoop sequence
files of Mahout type VectorWritable. These entities allow you to
represent dense or sparse numeric information which may be further
annotated by NamedVector wrappers to encode vector names in the data
set. If you can run Hadoop jobs or call Java from weka then you may be
able to use our code directly. Look at the driver class under each
algorithm for entry points. If all else fails we also have a command
line interface.
All the clustering jobs accept VectorWritable input files and produce
Hadoop directories (clusters-i) containing the Clusters produced by the
particular clustering iteration(s) plus an optional directory
(clusteredPoints) containing sequence files of clustered points which
are keyed by the clusterId and contain WeightedVectorWritable wrappers
around the original input vector. These wrappers encode the pdf of the
cluster assignment.
Hope this helps,
Jeff
On 8/27/10 12:06 PM, Valerio wrote:
hi all,
I need some guides that explain how to use mahout with the kmeans algorithm and
first of all,what type of dataset mahout uses?
I'm doing my thesis and I must run a k means clustering on weka,but weka must
call hadoop in background to parallelize the job. I discovered that mahout run
the kmeans on hadoop so i will call it from weka,but I don't understand what
type of files the kmeans of mahout read as input and how it works.
can someone help me?
Thanks all,
Valerio Ceraudo