Hello,
I'd like to ask about the rationale behind the design of Mahout Bayes
algorithm. I found that Mahout Bayes implementation is tightly coupled with
Hadoop MapReduce classes, for instance
------------------------------
/**
* Reads the input train set(preprocessed using the {@link
org.apache.mahout.classifier.BayesFileFormatter}).
*/
public class BayesFeatureMapper extends MapReduceBase implements
Mapper<Text,Text,StringTuple,DoubleWritable>
------------------------------
I would expected that Bayes algorithm is implemented as a pojo class(es)
and tested with simple unit test, which knows nothing about hadoop,
datastores, filessystems, etc. And then a Hadoop wrapper around pojo bayes
implementation should be provided.
With a current design, first it's quite difficult to understand how the
algorithm works, as you need to go through several hadoop related classes.
Second, you can't just simply run BayesClassifierSelfTest test, which takes
input text in a format:
----------------------------
public static final String[][] DATA = {
{
"mahout",
"Mahout's goal is to build scalable machine learning libraries. With
scalable we mean: "
+ "Scalable to reasonably large data sets. Our core algorithms for
clustering,"
+ " classfication and batch based collaborative filtering are
implemented on top "
+ "of Apache Hadoop using the map/reduce paradigm. However we do
not restrict "
+ "contributions to Hadoop based implementations: Contributions
that run on"},
{
"mahout",
" a single node or on a non-Hadoop cluster are welcome as well. The
core"
+ " libraries are highly optimized to allow for good performance
also for"
+ " non-distribu
-------------------------------
and classifies it, as this test depends on hadoop and fails on Windows when
running from Eclipse. Third, I don't always want to run this algorithm in a
hadoop world. I may want to use some other map reduce provider. Also when I
run bayes classifier with:
./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model
-type bayes -ng 1 -source hdfs, it takes 40 seconds to train a model for a
file with 6 lines, even though hadoop is not really used. Is it so long
because of all those hadoop related abstractions?
Regards.
Daniel
--
Daniel Korzekwa
Software Engineer
priv: http://danmachine.com
blog: http://blog.danmachine.com