2012/3/23 戴睿 <[email protected]>: > Hello, > I'm new for Mahout,and I've read Support of > HBase<http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/ajax/%3CCACpbbiJHP3JVmwU1GURL2nW9obw236Ju33Xt9%2BDtnLtzDyCVQg%40mail.gmail.com%3E> > before. > still don't get it. > Input and output of Mahout are stored in HDFS > and I'm wondering is there any way to cluster input data from > HBase directly and write the output to Htable instead of HDFS? > that might save much time in transformation between Hadoop and mahout > > really look forward your answer, thank you
Hi genius33232, The best approach is to use a custom vectorization step that transform your data into Vectors the way mahout wants them. Take a look at DictionaryVectorizer source code (https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/DictionaryVectorizer.html) and SparseVectorsFromSequanceFiles (http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java?view=markup). You can write you MR job that reads data from HBase, creates vectors and then saves them to HDFS so they can be provided as input for the clustering job. You will need to provide the dictionary sequence file and the vectors sequence files for the next step. You can import the clustered data again in HBase and delete them after or hack the clustering job to spit data into HBase instead of HDFS. I suggest you take the first approach until you are getting what you need and then move to the next step. You can also index the cluster data with Solr, but this depends on your use case and data size. >From my experience with Mahout, it's not very easy to modify those jobs but devs know this and it's on the wishlist (make mahout more like a library). Happy hacking, -- Ioan Eugen Stan http://ieugen.blogspot.com/
