On Fri, Feb 7, 2014 at 7:48 AM, Aaron Davidson <[email protected]> wrote:
> Sorry for delay, by long-running I just meant if you were running an > iterative algorithm that was slowing down over time. We have observed this > in the spark-perf benchmark; as file system state builds up, the job can > slow down. Once the job finishes, however, it is cleaned up and should not > affect subsequent jobs. > > I can think of three other possibilities for a slowdown: (1) unclean > shutdown of Spark (i.e., kill -9), which doesn't allow us to clean up our > data > By 'shutdown of Spark', do you mean shutting down the spark app, or the spark cluster? How is it possible to gracefully shut down a spark app? > (2) buildup of logs in the work/ directory or files in the Spark tmp > directory, and (3) bug in Spark (woo!). > > > On Tue, Feb 4, 2014 at 5:58 AM, Aureliano Buendia <[email protected]>wrote: > >> >> >> >> On Mon, Feb 3, 2014 at 12:26 AM, Aaron Davidson <[email protected]>wrote: >> >>> Are you seeing any exceptions in between running apps? Does restarting >>> the master/workers actually cause Spark to speed back up again? It's >>> possible, for instance, that you run out of disk space, which should cause >>> exceptions but not go away by restarting the master/workers. >>> >> >> Not really, no exceptions and plenty of disk space left. At this point >> I'm not certain that restarting spark master/workers definitely helps. The >> only thing that does help is bringing up a fresh ec2 cluster, which is not >> a solution. This could suggest that spark leaves some stuff and get build >> up every time the app is executed. >> >> >>> >>> One thing to worry about is long-running jobs or shells. >>> >> >> What do you mean by long running jobs? >> >> >>> Currently, state buildup of a single job in Spark *is* a problem, as >>> certain state such as shuffle files and RDD metadata is not cleaned up >>> until the job (or shell) exits. We have hacky ways to reduce this, and are >>> working on a long term solution. However, separate, consecutive jobs should >>> be independent in terms of performance. >>> >>> >>> On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[email protected]> wrote: >>> >>>> Is your spark app an iterative one ? If so, your app is creating a big >>>> DAG in every iteration. You should use checkpoint it periodically, say, 10 >>>> iterations one checkpoint. >>>> >>>> >>>> 2014-02-01 Aureliano Buendia <[email protected]>: >>>> >>>> Hi, >>>>> >>>>> I've noticed my spark app (on ec2) gets slower and slower as I >>>>> repeatedly execute it. >>>>> >>>>> With a fresh ec2 cluster, it is snappy and takes about 15 mins to >>>>> complete, after running the same app 4 times it gets slower and takes to >>>>> 40 >>>>> mins and more. >>>>> >>>>> While the cluster gets slower, the monitoring metrics show less and >>>>> less activities (almost no cpu, or io). >>>>> >>>>> When it gets slow, sometimes the number of running tasks (light blue >>>>> in web ui progress bar) is zero, and only the number of completed tasks >>>>> (dark blue) increments. >>>>> >>>>> Is this a known spark issue? >>>>> >>>>> Do I need to restart spark master and workers in between running apps? >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> ----------------------------------- >>>> Xusen Yin 尹绪森 >>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>> Multimedia >>>> Beijing University of Posts & Telecommunications >>>> Intel Labs China >>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>> >>> >>> >> >
