Hi Suraj,

I don't see any logs from mllib. You might need to explicit set the logging
to DEBUG for mllib. Adding this line for log4j.properties might fix the
problem.
log4j.logger.org.apache.spark.mllib.tree=DEBUG

Also, please let me know if you can encounter similar problems with the
Spark master.

-Manish


On Sat, Jun 14, 2014 at 3:19 AM, SURAJ SHETH <shet...@gmail.com> wrote:

> Hi Manish,
> Thanks for your reply.
>
> I am attaching the logs here(regression, 5 levels). It contains the last
> 100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
> levels complete in less than 6 seconds, while the 5th level doesn't
> complete even after several hours.
> Due to the reason that this is somebody else's data, I can't share it.
>
> Can you check the code snippet attached in my first email and see if it
> needs something to enable it to work for large data and >= 5 levels. It is
> working for 3 levels on the same dataset, but, not for 5 levels.
>
> In the mean time, I will try to run it on the latest master and let you
> know the results. If it runs fine there, then, it can be related to 128 MB
> limit issue that you mentioned.
>
> Thanks and Regards,
> Suraj Sheth
>
>
>
> On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde <manish...@gmail.com> wrote:
>
>> Hi Suraj,
>>
>> I can't answer 1) without knowing the data. However, the results for 2)
>> are surprising indeed. We have tested with a billion samples for regression
>> tasks so I am perplexed with the behavior.
>>
>> Could you try the latest Spark master to see whether this problem goes
>> away. It has code that limits memory consumption at the master and worker
>> nodes to 128 MB by default which ideally should not be needed given the
>> amount of RAM on your cluster.
>>
>> Also, feel free to send the DEBUG logs. It might give me a better idea of
>> where the algorithm is getting stuck.
>>
>> -Manish
>>
>>
>>
>> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <shet...@gmail.com> wrote:
>>
>>> Hi Filipus,
>>> The train data is already oversampled.
>>> The number of positives I mentioned above is for the test dataset :
>>> 12028 (apologies for not making this clear earlier)
>>> The train dataset has 61,264 positives out of 689,763 total rows. The
>>> number of negatives is 628,499.
>>> Oversampling was done for the train dataset to ensure that we have
>>> atleast 9-10% of positives in the train part
>>> No oversampling is done for the test dataset.
>>>
>>> So, the only difference that remains is the amount of data used for
>>> building a tree.
>>>
>>> But, I have a few more questions :
>>> Have we tried how much data can be used at most to build a single
>>> Decision Tree.
>>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
>>> train data and 30x3 GB of RAM), I would expect it to build a single
>>> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
>>> it is not able to. I confirmed that when it keeps running for hours, the
>>> amount of free memory available is more than 70%. So, it doesn't seem to be
>>> a Memory issue either.
>>>
>>>
>>> Thanks and Regards,
>>> Suraj Sheth
>>>
>>>
>>> On Wed, Jun 11, 2014 at 10:19 PM, filipus <floe...@gmail.com> wrote:
>>>
>>>> well I guess your problem is quite unbalanced and due to the information
>>>> value as a splitting criterion I guess the algo stops after very view
>>>> splits
>>>>
>>>> work arround is oversampling
>>>>
>>>> build many training datasets like
>>>>
>>>> take randomly 50% of the positives and from the negativ the same amount
>>>> or
>>>> let say the double
>>>>
>>>> => 6000 positives and 12000 negatives
>>>>
>>>> build a tree
>>>>
>>>> this you do many times => many models (agents)
>>>>
>>>> and than you make an ensemble model. means vote all the model
>>>>
>>>> in a way similar two random forest but at the completely different
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>

Reply via email to