So here is a better description of the decision forest classification
implementation I'm working on. This is for large scale classification
after training.
We have many attributes being classified, each attribute has it's own
forest. The forest are big enough when loaded into RAM that you get
only one JVM per host. But you really want one thread per processor on
the host, so we ended up threading the mappers. We have a lot of
feature vectors so we send the features to the mappers.
This seems a bit awkward. I've been thinking about spreading the trees
out across mappers to reduce the RAM per JVM with the goal of getting
closer to one JVM per core. But then we'll need to do a more complex
join between forests and feature vectors. Right now we are essentially
doing a replicated join with the forest being the replicated set.
Has anyone tried this - Is there support for this in Mahout?
On 12/06/2012 09:32 PM, Marty Kube wrote:
Yes I'm on a project in which we classify a large data set. We do use
mapreduce to do the classification as the data set is much larger than
the working memory. We have a non-mahout implementation...
So we put the decision forest in memory via a distributed cache and
partition the data set and run it past the models. The models are
getting pretty big and keeping them in memory is a challenge. I guess
I was looking for an implementation that doesn't require keeping the
decision forest in memory. I'll have a look at the TestForest
implementation.
On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is too
large
to fit in memory. If it's does fit then you better train the forest
using
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
parameter -mr to TestForest to classify the data using mapreduce. An
example of this can be found in the wiki:
https://cwiki.apache.org/MAHOUT/partial-implementation.html
On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
[email protected]> wrote:
Hi,
I'm working improving classification throughput for a decision
forest. I
was wondering about the use case for Partial Implementation.
The quick start guide suggests that Partial Implementation is
designed for
building forest on large datasets.
My problem is classification after training. Is Partial Implementation
helpful for this use case?