Yes, thread dump plus log would be helpful for debugging. Thanks
> On Jan 25, 2016, at 5:59 AM, Sanders, Isaac B <sande...@rose-hulman.edu> > wrote: > > Is the thread dump the stack trace you are talking about? If so, I will see > if I can capture the few different stages I have seen it in. > > Thanks for the help, I was able to do it for 0.1% of my data. I will create > the JIRA. > > Thanks, > Isaac > > On Jan 25, 2016, at 8:51 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Opening a JIRA is fine. >> >> See if you can capture stack trace during the hung stage and attach to JIRA >> so that we have more clue. >> >> Thanks >> >> On Jan 25, 2016, at 4:25 AM, Darren Govoni <dar...@ontrenet.com> wrote: >> >>> Probably we should open a ticket for this. >>> There's definitely a deadlock situation occurring in spark under certain >>> conditions. >>> >>> The only clue I have is it always happens on the last stage. And it does >>> seem sensitive to scale. If my job has 300mb of data I'll see the deadlock. >>> But if I only run 10mb of it it will succeed. This suggest a serious >>> fundamental scaling problem. >>> >>> Workers have plenty of resources. >>> >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> >>> -------- Original message -------- >>> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> >>> Date: 01/24/2016 2:54 PM (GMT-05:00) >>> To: Renu Yadav <yren...@gmail.com> >>> Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar >>> <bablo...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org >>> Subject: Re: 10hrs of Scheduler Delay >>> >>> I am not getting anywhere with any of the suggestions so far. :( >>> >>> Trying some more outlets, I will share any solution I find. >>> >>> - Isaac >>> >>>> On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote: >>>> >>>> If you turn on spark.speculation on then that might help. it worked for me >>>> >>>>> On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni <dar...@ontrenet.com> >>>>> wrote: >>>>> Thanks for the tip. I will try it. But this is the kind of thing spark is >>>>> supposed to figure out and handle. Or at least not get stuck forever. >>>>> >>>>> >>>>> >>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>> >>>>> >>>>> -------- Original message -------- >>>>> From: Muthu Jayakumar <bablo...@gmail.com> >>>>> Date: 01/22/2016 3:50 PM (GMT-05:00) >>>>> To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" >>>>> <sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> >>>>> Cc: user@spark.apache.org >>>>> Subject: Re: 10hrs of Scheduler Delay >>>>> >>>>> Does increasing the number of partition helps? You could try out >>>>> something 3 times what you currently have. >>>>> Another trick i used was to partition the problem into multiple >>>>> dataframes and run them sequentially and persistent the result and then >>>>> run a union on the results. >>>>> >>>>> Hope this helps. >>>>> >>>>>> On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote: >>>>>> Me too. I had to shrink my dataset to get it to work. For us at least >>>>>> Spark seems to have scaling issues. >>>>>> >>>>>> >>>>>> >>>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>>> >>>>>> >>>>>> -------- Original message -------- >>>>>> From: "Sanders, Isaac B" <sande...@rose-hulman.edu> >>>>>> Date: 01/21/2016 11:18 PM (GMT-05:00) >>>>>> To: Ted Yu <yuzhih...@gmail.com> >>>>>> Cc: user@spark.apache.org >>>>>> Subject: Re: 10hrs of Scheduler Delay >>>>>> >>>>>> I have run the driver on a smaller dataset (k=2, n=5000) and it worked >>>>>> quickly and didn’t hang like this. This dataset is closer to k=10, >>>>>> n=4.4m, but I am using more resources on this one. >>>>>> >>>>>> - Isaac >>>>>> >>>>>>> On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>>> >>>>>>> You may have seen the following on github page: >>>>>>> >>>>>>> Latest commit 50fdf0e on Feb 22, 2015 >>>>>>> >>>>>>> That was 11 months ago. >>>>>>> >>>>>>> Can you search for similar algorithm which runs on Spark and is newer ? >>>>>>> >>>>>>> If nothing found, consider running the tests coming from the project to >>>>>>> determine whether the delay is intrinsic. >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>>> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B >>>>>>>> <sande...@rose-hulman.edu> wrote: >>>>>>>> That thread seems to be moving, it oscillates between a few different >>>>>>>> traces… Maybe it is working. It seems odd that it would take that long. >>>>>>>> >>>>>>>> This is 3rd party code, and after looking at some of it, I think it >>>>>>>> might not be as Spark-y as it could be. >>>>>>>> >>>>>>>> I linked it below. I don’t know a lot about spark, so it might be >>>>>>>> fine, but I have my suspicions. >>>>>>>> >>>>>>>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala >>>>>>>> >>>>>>>> - Isaac >>>>>>>> >>>>>>>>> On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> You may have noticed the following - did this indicate prolonged >>>>>>>>> computation in your code ?