Re: Decision Forest - Partial implementation

Marty Kube Fri, 07 Dec 2012 17:03:49 -0800

So here is a better description of the decision forest classificationimplementation I'm working on. This is for large scale classificationafter training.

We have many attributes being classified, each attribute has it's ownforest. The forest are big enough when loaded into RAM that you getonly one JVM per host. But you really want one thread per processor onthe host, so we ended up threading the mappers. We have a lot offeature vectors so we send the features to the mappers.

This seems a bit awkward. I've been thinking about spreading the treesout across mappers to reduce the RAM per JVM with the goal of gettingcloser to one JVM per core. But then we'll need to do a more complexjoin between forests and feature vectors. Right now we are essentiallydoing a replicated join with the forest being the replicated set.


Has anyone tried this - Is there support for this in Mahout?


On 12/06/2012 09:32 PM, Marty Kube wrote:

Yes I'm on a project in which we classify a large data set. We do usemapreduce to do the classification as the data set is much larger thanthe working memory. We have a non-mahout implementation...
So we put the decision forest in memory via a distributed cache andpartition the data set and run it past the models. The models aregetting pretty big and keeping them in memory is a challenge. I guessI was looking for an implementation that doesn't require keeping thedecision forest in memory. I'll have a look at the TestForestimplementation.
On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is toolargeto fit in memory. If it's does fit then you better train the forestusing
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
parameter -mr to TestForest to classify the data using mapreduce. An
example of this can be found in the wiki:

https://cwiki.apache.org/MAHOUT/partial-implementation.html




On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
[email protected]> wrote:
Hi,
I'm working improving classification throughput for a decisionforest. I
was wondering about the use case for Partial Implementation.
The quick start guide suggests that Partial Implementation isdesigned for
building forest on large datasets.

My problem is classification after training. Is Partial Implementation
helpful for this use case?

Re: Decision Forest - Partial implementation

Reply via email to