Thanks Sean for your response. I fully understand the rationale behind Mahout, and yes I really like its scale approach. I asked this question, because I was having some problems to understand how bayes works in Mahout (my question is here: http://www.manning-sandbox.com/thread.jspa?threadID=48160&tstart=0). Finally I found in Mahout code, that priors are not used and that this bayes is not the same simple approach as I described in my question, but I spent quite a lot of time to go through all mahout bayes classes.
When I first jumped into mahout code base I was expecting to see, mahout-algorithms (implemented as pure functions) and then mahout-hadoop, which takes those pure functions and reuse them in a context of hadoop. or something like: - Bayes functions. - Bayes ref impl - the simplest one, depends on Bayes functions. - Bayes hadoop impl - depends on Bayes functions and hadoop. Regarding to long time of running bayes against small training data set, I know that this problem disappears when you process bigger data file, e.g. more than 100K text records. But when you want to play with mahout, it's nice to start with some small files, and then waiting 40sec for every run is a waste of time PS. I'm in the middle of Mahout in Action book. Good stuff to you Sean and other authors, it's really enjoyable read. Regards. Daniel Korzekwa 2012/1/19 Sean Owen <[email protected]> > The test is just a tiny test. What you are observing is 99.9% Hadoop > overhead. It's not as if a million times more text takes 40 million > seconds. > > If you don't have a very large pile of data, there's no need to use > Hadoop and indeed it is just complexity and overhead for no gain. A > simple Java implementation would be better. > > Turning it around, if you can't fit your information onto one machine, > a POJO won't work. > > The whole raison d'ĂȘtre of the project is "scale" so it's not > surprising that you'll find Hadoop implementations first before > non-Hadoop here. > > If you browse the recommender bits, in contrast, you'll see both > distributed and non-distributed versions indeed, and both have their > place. > > On Thu, Jan 19, 2012 at 8:07 AM, Daniel Korzekwa > <[email protected]> wrote: > > Hello, > > > > I'd like to ask about the rationale behind the design of Mahout Bayes > > algorithm. I found that Mahout Bayes implementation is tightly coupled > with > > Hadoop MapReduce classes, for instance > > ------------------------------ > > /** > > * Reads the input train set(preprocessed using the {@link > > org.apache.mahout.classifier.BayesFileFormatter}). > > */ > > public class BayesFeatureMapper extends MapReduceBase implements > > Mapper<Text,Text,StringTuple,DoubleWritable> > > ------------------------------ > > > > I would expected that Bayes algorithm is implemented as a pojo class(es) > > and tested with simple unit test, which knows nothing about hadoop, > > datastores, filessystems, etc. And then a Hadoop wrapper around pojo > bayes > > implementation should be provided. > > > > With a current design, first it's quite difficult to understand how the > > algorithm works, as you need to go through several hadoop related > classes. > > Second, you can't just simply run BayesClassifierSelfTest test, which > takes > > input text in a format: > > > > ---------------------------- > > public static final String[][] DATA = { > > { > > "mahout", > > "Mahout's goal is to build scalable machine learning libraries. With > > scalable we mean: " > > + "Scalable to reasonably large data sets. Our core algorithms > for > > clustering," > > + " classfication and batch based collaborative filtering are > > implemented on top " > > + "of Apache Hadoop using the map/reduce paradigm. However we do > > not restrict " > > + "contributions to Hadoop based implementations: Contributions > > that run on"}, > > { > > "mahout", > > " a single node or on a non-Hadoop cluster are welcome as well. The > > core" > > + " libraries are highly optimized to allow for good performance > > also for" > > + " non-distribu > > ------------------------------- > > > > and classifies it, as this test depends on hadoop and fails on Windows > when > > running from Eclipse. Third, I don't always want to run this algorithm > in a > > hadoop world. I may want to use some other map reduce provider. Also > when I > > run bayes classifier with: > > ./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model > > -type bayes -ng 1 -source hdfs, it takes 40 seconds to train a model for > a > > file with 6 lines, even though hadoop is not really used. Is it so long > > because of all those hadoop related abstractions? > > > > Regards. > > Daniel > > > > -- > > Daniel Korzekwa > > Software Engineer > > priv: http://danmachine.com > > blog: http://blog.danmachine.com > -- Daniel Korzekwa Software Engineer priv: http://danmachine.com blog: http://blog.danmachine.com
