Hi Ted,
I've been looking at the mmap suggestion some. When you said:
1) use shared memory via mmap to store the forest. This allows multiple
mapper threads to access the same forest. The current Mahout in-memory
structure for this is not suitable for shared memory, however.
Can you be a little more specific about why the current in-memory
structure is not suitable for shared memory?
I'm finding that Java does not support shared memory so one would need
to run the forest cache through JNI in order to use mmap and shared memory.
The other track I came up with is to use a distributed cache like
memcache or hazelcast. To me those solutions seem target to cross host
caches so I worry about performance. What I really want is a within
host shared cache across JVMs.
On 12/08/2012 03:43 AM, Ted Dunning wrote:
There are several approaches that might help:
1) use shared memory via mmap to store the forest. This allows multiple
mapper threads to access the same forest. The current Mahout in-memory
structure for this is not suitable for shared memory, however.
2) split the forests across many mappers (as you suggest). You would have
to tag your outputs cleverly so that they wind up at the right reducer.
Tags would include input data segment and forest segment. Mahout doesn't
support this, but it should be easily doable.
3) thin the forests. There isn't a lot of literature on this, but I am
pretty sure that I have seen some articles where less informative trees in
the random forest were removed. Another option with a similar effect is to
use the random forest as an oracle so that you can generate a huge amount
of training data for some other technique that may be prone to
over-fitting. This alternative model can be trained to fit the output of
the random forest very precisely. Over-fitting isn't an issue because you
can generate as much training data as you like. This isn't supported in
Mahout.
On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
[email protected]> wrote:
So here is a better description of the decision forest classification
implementation I'm working on. This is for large scale classification
after training.
We have many attributes being classified, each attribute has it's own
forest. The forest are big enough when loaded into RAM that you get only
one JVM per host. But you really want one thread per processor on the
host, so we ended up threading the mappers. We have a lot of feature
vectors so we send the features to the mappers.
This seems a bit awkward. I've been thinking about spreading the trees
out across mappers to reduce the RAM per JVM with the goal of getting
closer to one JVM per core. But then we'll need to do a more complex join
between forests and feature vectors. Right now we are essentially doing a
replicated join with the forest being the replicated set.
Has anyone tried this - Is there support for this in Mahout?
On 12/06/2012 09:32 PM, Marty Kube wrote:
Yes I'm on a project in which we classify a large data set. We do use
mapreduce to do the classification as the data set is much larger than the
working memory. We have a non-mahout implementation...
So we put the decision forest in memory via a distributed cache and
partition the data set and run it past the models. The models are getting
pretty big and keeping them in memory is a challenge. I guess I was looking
for an implementation that doesn't require keeping the decision forest in
memory. I'll have a look at the TestForest implementation.
On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is too
large
to fit in memory. If it's does fit then you better train the forest using
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
parameter -mr to TestForest to classify the data using mapreduce. An
example of this can be found in the wiki:
https://cwiki.apache.org/**MAHOUT/partial-implementation.**html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
martykube@**beavercreekconsulting.com<[email protected]>>
wrote:
Hi,
I'm working improving classification throughput for a decision forest.
I
was wondering about the use case for Partial Implementation.
The quick start guide suggests that Partial Implementation is designed
for
building forest on large datasets.
My problem is classification after training. Is Partial Implementation
helpful for this use case?