Re: 10hrs of Scheduler Delay

Ted Yu Mon, 25 Jan 2016 06:11:03 -0800

Yes, thread dump plus log would be helpful for debugging. 

Thanks


> On Jan 25, 2016, at 5:59 AM, Sanders, Isaac B <sande...@rose-hulman.edu> 
> wrote:
> 
> Is the thread dump the stack trace you are talking about? If so, I will see 
> if I can capture the few different stages I have seen it in.
> 
> Thanks for the help, I was able to do it for 0.1% of my data. I will create 
> the JIRA.
> 
> Thanks,
> Isaac
> 
> On Jan 25, 2016, at 8:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
>> Opening a JIRA is fine. 
>> 
>> See if you can capture stack trace during the hung stage and attach to JIRA 
>> so that we have more clue. 
>> 
>> Thanks
>> 
>> On Jan 25, 2016, at 4:25 AM, Darren Govoni <dar...@ontrenet.com> wrote:
>> 
>>> Probably we should open a ticket for this.
>>> There's definitely a deadlock situation occurring in spark under certain 
>>> conditions.
>>> 
>>> The only clue I have is it always happens on the last stage. And it does 
>>> seem sensitive to scale. If my job has 300mb of data I'll see the deadlock. 
>>> But if I only run 10mb of it it will succeed. This suggest a serious 
>>> fundamental scaling problem.
>>> 
>>> Workers have plenty of resources.
>>> 
>>> 
>>> 
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>> 
>>> 
>>> -------- Original message --------
>>> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
>>> Date: 01/24/2016 2:54 PM (GMT-05:00) 
>>> To: Renu Yadav <yren...@gmail.com> 
>>> Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar 
>>> <bablo...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org 
>>> Subject: Re: 10hrs of Scheduler Delay 
>>> 
>>> I am not getting anywhere with any of the suggestions so far. :(
>>> 
>>> Trying some more outlets, I will share any solution I find.
>>> 
>>> - Isaac
>>> 
>>>> On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote:
>>>> 
>>>> If you turn on spark.speculation on then that might help. it worked  for me
>>>> 
>>>>> On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni <dar...@ontrenet.com> 
>>>>> wrote:
>>>>> Thanks for the tip. I will try it. But this is the kind of thing spark is 
>>>>> supposed to figure out and handle. Or at least not get stuck forever.
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>> 
>>>>> 
>>>>> -------- Original message --------
>>>>> From: Muthu Jayakumar <bablo...@gmail.com> 
>>>>> Date: 01/22/2016 3:50 PM (GMT-05:00) 
>>>>> To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
>>>>> <sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> 
>>>>> Cc: user@spark.apache.org 
>>>>> Subject: Re: 10hrs of Scheduler Delay 
>>>>> 
>>>>> Does increasing the number of partition helps? You could try out 
>>>>> something 3 times what you currently have. 
>>>>> Another trick i used was to partition the problem into multiple 
>>>>> dataframes and run them sequentially and persistent the result and then 
>>>>> run a union on the results. 
>>>>> 
>>>>> Hope this helps. 
>>>>> 
>>>>>> On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:
>>>>>> Me too. I had to shrink my dataset to get it to work. For us at least 
>>>>>> Spark seems to have scaling issues.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>> 
>>>>>> 
>>>>>> -------- Original message --------
>>>>>> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
>>>>>> Date: 01/21/2016 11:18 PM (GMT-05:00) 
>>>>>> To: Ted Yu <yuzhih...@gmail.com> 
>>>>>> Cc: user@spark.apache.org 
>>>>>> Subject: Re: 10hrs of Scheduler Delay 
>>>>>> 
>>>>>> I have run the driver on a smaller dataset (k=2, n=5000) and it worked 
>>>>>> quickly and didn’t hang like this. This dataset is closer to k=10, 
>>>>>> n=4.4m, but I am using more resources on this one.
>>>>>> 
>>>>>> - Isaac
>>>>>> 
>>>>>>> On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>> 
>>>>>>> You may have seen the following on github page:
>>>>>>> 
>>>>>>> Latest commit 50fdf0e  on Feb 22, 2015
>>>>>>> 
>>>>>>> That was 11 months ago.
>>>>>>> 
>>>>>>> Can you search for similar algorithm which runs on Spark and is newer ?
>>>>>>> 
>>>>>>> If nothing found, consider running the tests coming from the project to 
>>>>>>> determine whether the delay is intrinsic.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>>> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
>>>>>>>> <sande...@rose-hulman.edu> wrote:
>>>>>>>> That thread seems to be moving, it oscillates between a few different 
>>>>>>>> traces… Maybe it is working. It seems odd that it would take that long.
>>>>>>>> 
>>>>>>>> This is 3rd party code, and after looking at some of it, I think it 
>>>>>>>> might not be as Spark-y as it could be.
>>>>>>>> 
>>>>>>>> I linked it below. I don’t know a lot about spark, so it might be 
>>>>>>>> fine, but I have my suspicions.
>>>>>>>> 
>>>>>>>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>>>>>>>> 
>>>>>>>> - Isaac
>>>>>>>> 
>>>>>>>>> On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> You may have noticed the following - did this indicate prolonged 
>>>>>>>>> computation in your code ?

Re: 10hrs of Scheduler Delay

Reply via email to