Set the ulimit (ulimit -n XX) to some high number in spark-env.sh to avoid hitting the open files limit. Also, it's possible the CPU utilization goes down between phases due to shuffle data being written out. This is expected behavior because Spark becomes io-bound during this period.
- Patrick On Wed, Nov 27, 2013 at 2:14 PM, Vijay Gaikwad <[email protected]> wrote: > The job get stuck meaning it halts for some time. Doesn’t do any processing. > CPU usage goes to 0%. After some time the processing resumes and CPU goes > up. This cycle continues as job progresses till it completes. > > But today while I am running some other spark jobs it isn’t happening. The > job is running seamlessly without halts on multiple cores. Although it > throws “TooManyFileOpen exception” if I increase the number of cores beyond > 4. > > Still I will try running the jstack on the process, especially the one which > gets stuck. > Thx > > Vijay Gaikwad > University of Washington MSIM > [email protected] > (206) 261-5828 > > On Nov 27, 2013, at 1:34 PM, Patrick Wendell <[email protected]> wrote: > > Vijay - you said the job gets stuck but you also said it eventually > completes. What do you mean by stuck? Do you mean that there are > periods of low CPU utilization? > > If you can run jstack during one of the periods and post the output > that would be most helpful. > > On Wed, Nov 27, 2013 at 1:04 AM, Vijay Gaikwad <[email protected]> wrote: > > The server has 100+ gb of memory. Virtual memory for my job is 60gb and > reserved is 20-30 gb. So there is plenty of memory to spare even when job is > stuck. I am not sure if it is GC because there is still lot of memory which > job could have used. The jobs memory consumption remains same after it > resumes and there are no swaps too. (I observe all this using top command) > However as job progresses (with all the halt and resume cycles) the memory > used slowly increases but never reaches max. > > When the job gets stuck, CPU drops to 0% and memory is unchanged. > > I have observed this behavior with my other spark scripts too which run on > multiple small files. I thought it was because I was using a single machine > . but I believe that shouldn't be the case. > > Does anyone of you observe such behavior? > Thx > > -Vijay > University of Washington > > On Nov 27, 2013 12:44 AM, "Liu, Raymond" <[email protected]> wrote: > > > How about memory usage, any GC problem? When you mention get stuck, you > mean 0% or 1200% CPU while no progress? > > Raymond > > From: Vijay Gaikwad [mailto:[email protected]] > Sent: Wednesday, November 27, 2013 2:54 PM > To: [email protected] > Subject: Re: local[k] job gets stuck - spark 0.8.0 > > Hi Patrick, > > Sorry I don't have access to web UI. > So I have been running these jobs on larger servers and letting them run.. > I have observed that when I run a job with "local[12]", it runs for some > time on full throttle at 1200% CPU consumptions, but after some this > processing goes to 0%. > After few seconds it again starts processing and goes to high percentage > of CPU utilization. This cycle repeats till the job is completed. > Ironically I observed similar behavior simple "local" jobs. > > Is it the nature of the job that is causing this? I am processing a 70GB > file and performing simple map and reduce operations. I am sufficient 100GB > ram. > Any thoughts? > > Vijay Gaikwad > University of Washington MSIM > [email protected] > (206) 261-5828 > > On Nov 25, 2013, at 11:43 AM, Patrick Wendell <[email protected]> wrote: > > > When it gets stuck, what does it show in the web UI? Also, can you run > a jstack on the process and attach the output... that might explain > what's going on. > > On Mon, Nov 25, 2013 at 11:30 AM, Vijay Gaikwad <[email protected]> > wrote: > > I am using apache spark 0.8.0 to process a large data file and perform > some > basic .map and.reduceByKey operations on the RDD. > > Since I am using a single machine with multiple processors, I mention > local[8] in the Master URL field while creating SparkContext > > val sc = new SparkContext("local[8]", "Tower-Aggs", SPARK_HOME ) > > But whenever I mention multiple processors, the job gets stuck > (pauses/halts) randomly. There is no definite place where it gets stuck, > its > just random. Sometimes it won't happen at all. I am not sure if it > continues > after that but it gets stuck for a long time after which I abort the job. > > But when I just use local in place of local[8], the job runs seamlessly > without getting stuck ever. > > val sc = new SparkContext("local", "Tower-Aggs", SPARK_HOME ) > > I am not able to understand where is the problem. > > I am using Scala 2.9.3 and sbt to build and run the application > > > - > > http://stackoverflow.com/questions/20187048/apache-spark-localk-master-url-job-gets-stuck > > Thx > Vijay Gaikwad > University of Washington MSIM > [email protected] > (206) 261-5828 > > >
