Re: Random Forest hangs without trace of error

Marco Mistroni Sun, 11 Dec 2016 03:10:59 -0800

OK. Did u change spark version? Java/scala/python version?
Have u tried with different versions of any of the above?
Hope this helps
Kr


On 10 Dec 2016 10:37 pm, "Morten Hornbech" <mor...@datasolvr.com> wrote:

> I haven’t actually experienced any non-determinism. We have nightly
> integration tests comparing output from random forests with no variations.
>
> The workaround we will probably try is to split the dataset, either
> randomly or on one of the variables, and then train a forest on each
> partition, which should then be sufficiently small.
>
> I hope to be able to provide a good repro case in some weeks. If the
> problem was in our own code I will also post it in this thread.
>
> Morten
>
> Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>:
>
> Hello Morten
> ok.
> afaik there is a tiny bit of randomness in these ML algorithms (pls anyone
> correct me if i m wrong).
> In fact if you run your RDF code multiple times, it will not give you
> EXACTLY the same results (though accuracy and errors should me more or less
> similar)..at least this is what i found when playing around with
> RDF and decision trees and other ML algorithms
>
> If RDF is not a must for your usecase, could you try 'scale back' to
> Decision Trees and see if you still get intermittent failures?
> this at least to exclude issues with the data
>
> hth
>  marco
>
> On Sat, Dec 10, 2016 at 5:20 PM, Morten Hornbech <mor...@datasolvr.com>
> wrote:
>
>> Already did. There are no issues with smaller samples. I am running this
>> in a cluster of three t2.large instances on aws.
>>
>> I have tried to find the threshold where the error occurs, but it is not
>> a single factor causing it. Input size and subsampling rate seems to be
>> most significant, and number of trees the least.
>>
>> I have also tried running on a test frame of randomized numbers with the
>> same number of rows, and could not reproduce the problem here.
>>
>> By the way maxDepth is 5 and maxBins is 32.
>>
>> I will probably need to leave this for a few weeks to focus on more
>> short-term stuff, but I will write here if I solve it or reproduce it more
>> consistently.
>>
>> Morten
>>
>> Den 10. dec. 2016 kl. 17.29 skrev Marco Mistroni <mmistr...@gmail.com>:
>>
>> Hi
>>  Bring back samples to 1k range to debug....or as suggested reduce tree
>> and bins.... had rdd running on same size data with no issues.....or send
>> me some sample code and data and I try it out on my ec2 instance ...
>> Kr
>>
>> On 10 Dec 2016 3:16 am, "Md. Rezaul Karim" <rezaul.karim@insight-centre.o
>> rg> wrote:
>>
>>> I had similar experience last week. Even I could not find any error
>>> trace.
>>>
>>> Later on, I did the following to get rid of the problem:
>>> i) I downgraded to Spark 2.0.0
>>> ii) Decreased the value of maxBins and maxDepth
>>>
>>> Additionally, make sure that you set the featureSubsetStrategy as "auto" to
>>> let the algorithm choose the best feature subset strategy for your
>>> data. Finally, set the impurity as "gini" for the information gain.
>>>
>>> However, setting the value of no. of trees to just 1 does not give you
>>> either real advantage of the forest neither better predictive performance.
>>>
>>>
>>>
>>> Best,
>>> Karim
>>>
>>>
>>> On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I have spent quite some time trying to debug an issue with the Random
>>>> Forest
>>>> algorithm on Spark 2.0.2. The input dataset is relatively large at
>>>> around
>>>> 600k rows and 200MB, but I use subsampling to make each tree manageable.
>>>> However even with only 1 tree and a low sample rate of 0.05 the job
>>>> hangs at
>>>> one of the final stages (see attached). I have checked the logs on all
>>>> executors and the driver and find no traces of error. Could it be a
>>>> memory
>>>> issue even though no error appears? The error does seem sporadic to some
>>>> extent so I also wondered whether it could be a data issue, that only
>>>> occurs
>>>> if the subsample includes the bad data rows.
>>>>
>>>> Please comment if you have a clue.
>>>>
>>>> Morten
>>>>
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>>>> 8192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-e
>>>> rror-tp28192.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> <http://nabble.com/>.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>
>

Re: Random Forest hangs without trace of error

Reply via email to