Yes I'm on a project in which we classify a large data set. We do use
mapreduce to do the classification as the data set is much larger than
the working memory. We have a non-mahout implementation...
So we put the decision forest in memory via a distributed cache and
partition the data set and run it past the models. The models are
getting pretty big and keeping them in memory is a challenge. I guess I
was looking for an implementation that doesn't require keeping the
decision forest in memory. I'll have a look at the TestForest
implementation.
On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is too large
to fit in memory. If it's does fit then you better train the forest using
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
parameter -mr to TestForest to classify the data using mapreduce. An
example of this can be found in the wiki:
https://cwiki.apache.org/MAHOUT/partial-implementation.html
On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
[email protected]> wrote:
Hi,
I'm working improving classification throughput for a decision forest. I
was wondering about the use case for Partial Implementation.
The quick start guide suggests that Partial Implementation is designed for
building forest on large datasets.
My problem is classification after training. Is Partial Implementation
helpful for this use case?