Re: Mahout Clustering and HBase

Ioan Eugen Stan Sat, 24 Mar 2012 09:54:11 -0700

2012/3/23 戴睿 <[email protected]>:
> Hello,
> I'm new for Mahout,and I've read Support of
> HBase<http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/ajax/%3CCACpbbiJHP3JVmwU1GURL2nW9obw236Ju33Xt9%2BDtnLtzDyCVQg%40mail.gmail.com%3E>
> before.
> still don't get it.
> Input and output of Mahout are stored in HDFS
> and I'm wondering is there any way to cluster input data from
> HBase directly and write the output to Htable  instead of HDFS?
> that might save much time in transformation between Hadoop and mahout
>
> really look forward your answer, thank you


Hi genius33232,

The best approach is to use a custom vectorization step that transform
your data into Vectors the way mahout wants them. Take a look at
DictionaryVectorizer source code
(https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/DictionaryVectorizer.html)
and SparseVectorsFromSequanceFiles
(http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java?view=markup).

You can write you MR job that reads data from HBase, creates vectors
and then saves them to HDFS so they can be provided as input for the
clustering job. You will need to provide the dictionary sequence file
and the vectors sequence files for the next step.

You can import the clustered data again in HBase and delete them after
or hack the clustering job to spit data into HBase instead of HDFS.

I suggest you take the first approach until you are getting what you
need and then move to the next step. You can also index the cluster
data with Solr, but this depends on your use case and data size.

>From my experience with Mahout, it's not very easy to modify those
jobs but devs know this and it's on the wishlist (make mahout more
like a library).

Happy hacking,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/

Re: Mahout Clustering and HBase

Reply via email to