Hi Suraj,

I can't answer 1) without knowing the data. However, the results for 2) are
surprising indeed. We have tested with a billion samples for regression
tasks so I am perplexed with the behavior.

Could you try the latest Spark master to see whether this problem goes
away. It has code that limits memory consumption at the master and worker
nodes to 128 MB by default which ideally should not be needed given the
amount of RAM on your cluster.

Also, feel free to send the DEBUG logs. It might give me a better idea of
where the algorithm is getting stuck.

-Manish



On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <shet...@gmail.com> wrote:

> Hi Filipus,
> The train data is already oversampled.
> The number of positives I mentioned above is for the test dataset : 12028
> (apologies for not making this clear earlier)
> The train dataset has 61,264 positives out of 689,763 total rows. The
> number of negatives is 628,499.
> Oversampling was done for the train dataset to ensure that we have atleast
> 9-10% of positives in the train part
> No oversampling is done for the test dataset.
>
> So, the only difference that remains is the amount of data used for
> building a tree.
>
> But, I have a few more questions :
> Have we tried how much data can be used at most to build a single Decision
> Tree.
> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
> train data and 30x3 GB of RAM), I would expect it to build a single
> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
> it is not able to. I confirmed that when it keeps running for hours, the
> amount of free memory available is more than 70%. So, it doesn't seem to be
> a Memory issue either.
>
>
> Thanks and Regards,
> Suraj Sheth
>
>
> On Wed, Jun 11, 2014 at 10:19 PM, filipus <floe...@gmail.com> wrote:
>
>> well I guess your problem is quite unbalanced and due to the information
>> value as a splitting criterion I guess the algo stops after very view
>> splits
>>
>> work arround is oversampling
>>
>> build many training datasets like
>>
>> take randomly 50% of the positives and from the negativ the same amount or
>> let say the double
>>
>> => 6000 positives and 12000 negatives
>>
>> build a tree
>>
>> this you do many times => many models (agents)
>>
>> and than you make an ensemble model. means vote all the model
>>
>> in a way similar two random forest but at the completely different
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to