Because we don't have random access to the record, sampling still need
to go through the records sequentially. It does save some computation,
which is perhaps noticeable only if you have data cached in memory.
Different random seeds are used for trees. -Xiangrui
On Wed, Jun 3, 2015 at 4:40 PM,
When training a RandomForest model, the Strategy class (in
mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping
to use this to cut down on processing time for large datasets (more than 2MM
rows and 9K predictors), but I've found that the runtime stays approximately