On the other hand, if Hadoop (or something like it) were based on a 
storage abstraction that had multiple implementations, say one in the 
client's memory and one in a cluster's disks (and maybe also others at 
other interesting points in between), and placement of computation were 
deferred to that store, then we could make Daniel happy both when 
developing and when doing real work on really large datasets.

Regards,
Mike



From:   Sean Owen <[email protected]>
To:     [email protected]
Date:   01/19/2012 04:44 AM
Subject:        Re: Why Mahout bayes implementation is tightly coupled 
with Hadoop?



It's not possible to write a pure Java algorithm, and then wrap or
season it with Hadoop to parallelize. It's just totally different when
parallelized, and more so when ported to Hadoop.

That's not to say some small bits like key formulas can't be factored
out, or that there could not be a separate non-Hadoop implementation.
And I am not sure Bayes is a great example, of everything in the
project... I do not know how much it's been loved in the last year or
more. (Is anyone actually looking at it anymore?)

There's an open invite to improve things.

I do agree that writing and reading Hadoop code is very hard.

On Thu, Jan 19, 2012 at 9:37 AM, Daniel Korzekwa
<[email protected]> wrote:
> Thanks Sean for your response.
>
> I fully understand the rationale behind Mahout, and yes I really like 
its
> scale approach. I asked this question, because I was having some 
problems
> to understand how bayes works in Mahout (my question is here:
> http://www.manning-sandbox.com/thread.jspa?threadID=48160&tstart=0).
> Finally I found in Mahout code, that priors are not used and that this
> bayes is not the same simple approach as I described in my question, but 
I
> spent quite a lot of time to go through all mahout bayes classes.
>
> When I first jumped into mahout code base I was expecting to see,
> mahout-algorithms (implemented as pure functions) and then 
mahout-hadoop,
> which takes those pure functions and reuse them in a context of hadoop.
>
> or something like:
> - Bayes functions.
> - Bayes ref impl - the simplest one, depends on Bayes functions.
> - Bayes hadoop impl - depends on Bayes functions and hadoop.
>
> Regarding to long time of running bayes against small training data set, 
I
> know that this problem disappears when you process bigger data file, 
e.g.
> more than 100K text records. But when you want to play with mahout, it's
> nice to start with some small files, and then waiting 40sec for every 
run
> is a waste of time
>
> PS. I'm in the middle of Mahout in Action book. Good stuff to you Sean 
and
> other authors, it's really enjoyable read.
>
> Regards.
> Daniel Korzekwa


Reply via email to