On Sun, Dec 25, 2011 at 4:08 PM, Lingxiang Cheng <[email protected]>wrote:
> > Thanks for the answer. I am having some difficulty understanding why > running random forest on top of Hadoop "does not produce arbitrary > scalability". Could you elaborate? > The problem is that the problem is difficult to decompose well and get linear scaling. For instance, if you shard the data by features, you want to have overlap between the features for different shards. This means that the total data processed during learning increases super-linearly with the number of shards. On the other hand, sharding by training data records leaves you with a problem of how to combine different models and whether you get the kind of improved training that you want. Just taking the union of trees in each ensemble probably isn't that effective (based on analogizing from other types of learning). > Also, are you aware of any work that involved developing random forest > using map-reduce? > Well, we have it. There are fancier efforts as well. Have you done a web search?
