Re: RE: Spark checkpoint problem
I don't think it is a deliberate design. So you may need do action on the RDD before the action of RDD, if you want to explicitly checkpoint RDD. 2015-11-26 13:23 GMT+08:00 wyphao.2007 : > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" 写道: > > What’s your spark version? > > *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com] > *发送时间:* 2015年11月26日 10:04 > *收件人:* user > *抄送:* d...@spark.apache.org > *主题:* Spark checkpoint problem > > I am test checkpoint to understand how it works, My code as following: > > > > scala> val data = sc.parallelize(List("a", "b", "c")) > > data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :15 > > > > scala> sc.setCheckpointDir("/tmp/checkpoint") > > 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be > non-local if Spark is running on a cluster: /tmp/checkpoint1 > > > > scala> data.checkpoint > > > > scala> val temp = data.map(item => (item, 1)) > > temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map > at :17 > > > > scala> temp.checkpoint > > > > scala> temp.count > > > > but I found that only the temp RDD is checkpont in the /tmp/checkpoint > directory, The data RDD is not checkpointed! I found the doCheckpoint > function in the org.apache.spark.rdd.RDD class: > > > > private[spark] def doCheckpoint(): Unit = { > > RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, > ignoreParent = true) { > > if (!doCheckpointCalled) { > > doCheckpointCalled = true > > if (checkpointData.isDefined) { > > checkpointData.get.checkpoint() > > } else { > > dependencies.foreach(_.rdd.doCheckpoint()) > > } > > } > > } > > } > > > > from the code above, Only the last RDD(In my case is temp) will be > checkpointed, My question : Is deliberately designed or this is a bug? > > > > Thank you. > > > > > > > > > > > > > > > > > > > -- 王海华
Re: when cached RDD will unpersist its data
In a case that memory cannot hold all the cached RDD, then BlockManager will evict some older block for storage of new RDD block. Hope that will helpful. 2015-06-24 13:22 GMT+08:00 bit1...@163.com : > I am kind of consused about when cached RDD will unpersist its data. I > know we can explicitly unpersist it with RDD.unpersist ,but can it be > unpersist automatically by the spark framework? > Thanks. > > -- > bit1...@163.com > -- 王海华
How to set DEBUG level log of spark executor on Standalone deploy mode
Hi, I want to check the DEBUG log of spark executor on Standalone deploy mode. But, 1. Set log4j.properties in spark/conf folder on master node and restart cluster. no means above works. 2. usning spark-submit --properties-file log4j. Just print debug log to screen but executor log still seems to be INFO level So how could i set the log level of spark executor on Standalone to DEBUG? Env Info--- spark 1.1.0 Standalone deploy mode. Submit shell: bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master spark://master:7077 --executor-memory 600m --properties-file log4j.properties lib/spark-examples-1.1.0-hadoop2.3.0.jar hdfs://master:8000/kmeans/data-Kmeans-5.3g 8 1 Thanks! Wang Haihua
WebUI shows poor locality when task schduling
Hi, When running a exprimental KMeans job for expriment, the Cached RDD is original Points data. I saw poor locality in Task details from WebUI. Almost one half of the input of task is Network instead of Memory. And Task with network input consumes almost the same time compare with the task with Hadoop(Disk) input, and twice with task(Memory input). e.g Task(Memory): 16s Task(Network): 9s Task(Hadoop): 9s I see fectching RDD with 30MB form remote node consumes 5 seconds in executor logs like below: 15/03/31 04:08:52 INFO CoarseGrainedExecutorBackend: Got assigned task 58 15/03/31 04:08:52 INFO Executor: Running task 15.0 in stage 1.0 (TID 58) 15/03/31 04:08:52 INFO HadoopRDD: Input split: hdfs://master:8000/kmeans/data-Kmeans-5.3g:2013265920+134217728 15/03/31 04:08:52 INFO BlockManager: Found block rdd_3_15 locally 15/03/31 04:08:58 INFO Executor: Finished task 15.0 in stage 1.0 (TID 58). 1920 bytes result sent to driver 15/03/31 04:08:58 INFO CoarseGrainedExecutorBackend: Got assigned task 60 -Task60 15/03/31 04:08:58 INFO Executor: Running task 17.0 in stage 1.0 (TID 60) 15/03/31 04:08:58 INFO HadoopRDD: Input split: hdfs://master:8000/kmeans/data-Kmeans-5.3g:2281701376+134217728 15/03/31 04:09:02 INFO BlockManager: Found block rdd_3_17 remotely 15/03/31 04:09:12 INFO Executor: Finished task 17.0 in stage 1.0 (TID 60). 1920 bytes result sent to driver So 1)is that means i should use RDD with cache(MEMORY_AND_DISK) instead of Memory only? 2)And should i expand Network capacity or turn Schduling locality parameter? Any suggestion will be welcome. --Env info--- Cluster: 4 worker, with 1 Cores and 2G executor memory Spark version: 1.1.0 Network: 30MB/s Submit shell: bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master spark://master:7077 --executor-memory 1g lib/spark-examples-1.1.0-hadoop2.3.0.jar hdfs://master:8000/kmeans/data-Kmeans-7g 8 1 Thanks very much and forgive for my poor English. -- Wang Haihua
Re: How does Spark honor data locality when allocating computing resources for an application
you seem like not to note the configuration varible "spreadOutApps" And it's comment: // As a temporary workaround before better ways of configuring memory, we allow users to set // a flag that will perform round-robin scheduling across the nodes (spreading out each app // among all the nodes) instead of trying to consolidate each app onto a small # of nodes. 2015-03-14 10:41 GMT+08:00 bit1...@163.com : > Hi, sparkers, > When I read the code about computing resources allocation for the newly > submitted application in the Master#schedule method, I got a question > about data locality: > > // Pack each app into as few nodes as possible until we've assigned all > its cores > for (worker <- workers if worker.coresFree > 0 && worker.state == > WorkerState.ALIVE) { >for (app <- waitingApps if app.coresLeft > 0) { > if (canUse(app, worker)) { > val coresToUse = math.min(worker.coresFree, app.coresLeft) > if (coresToUse > 0) { > val exec = app.addExecutor(worker, coresToUse) > launchExecutor(worker, exec) > app.state = ApplicationState.RUNNING > } > } > } > } > > Looks that the resource allocation policy here is that Master will assign > as few workers as possible, so long as these few workers has enough > resources for the application. > My question is: Assume that the data the application will process is > spread on all the worker nodes, then the data locality is lost if using > the above policy? > Not sure whether I have unstandood correctly or I have missed something. > > > -- > bit1...@163.com > -- 王海华
Re: Re: I think I am almost lost in the internals of Spark
A good beginning if you are chinese. https://github.com/JerryLead/SparkInternals/tree/master/markdown 2015-01-07 10:13 GMT+08:00 bit1...@163.com : > Thank you, Tobias. I will look into the Spark paper. But it looks that > the paper has been moved, > http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf. > A web page is returned (Resource not found)when I access it. > > -- > bit1...@163.com > > > *From:* Tobias Pfeiffer > *Date:* 2015-01-07 09:24 > *To:* Todd > *CC:* user > *Subject:* Re: I think I am almost lost in the internals of Spark > Hi, > > On Tue, Jan 6, 2015 at 11:24 PM, Todd wrote: > >> I am a bit new to Spark, except that I tried simple things like word >> count, and the examples given in the spark sql programming guide. >> Now, I am investigating the internals of Spark, but I think I am almost >> lost, because I could not grasp a whole picture what spark does when it >> executes the word count. >> > > I recommend understanding what an RDD is and how it is processed, using > > http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds > and probably also > http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf > (once the server is back). > Understanding how an RDD is processed is probably most helpful to > understand the whole of Spark. > > Tobias > > -- 王海华
Re: how to set log level of spark executor on YARN(using yarn-cluster mode)
Thanks for your reply! Sorry for i forgot referring the spark which i'm using is *Version 1.0.2* instead of 1.1.0. Also the document of 1.0.2 seems not same like 1.1.0: http://spark.apache.org/docs/1.0.2/running-on-yarn.html And i tried your suggestion(upload ) but did not work: *1. set my copy of log4j.properties like:* *log4j.rootCategory=DEBUG, console* *2. upload when using spark-submit script:* *./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master yarn-cluster --num-executors 5 --executor-memory 2g --executor-cores 1 /data/hadoopspark/MySparkTest.jar hdfs://master:8000/srcdata/searchengine/* 5 5 hdfs://master:8000/resultdata/searchengine/2014102001/ * *--files log4j.properties* So plz point out my fault and any suggestion would be welcome Thanks! 2014-10-16 9:45 GMT+08:00 Marcelo Vanzin : > Hi Eric, > > Check the "Debugging Your Application" section at: > http://spark.apache.org/docs/latest/running-on-yarn.html > > Long story short: upload your log4j.properties using the "--files" > argument of spark-submit. > > (Mental note: we could make the log level configurable via a system > property...) > > > On Wed, Oct 15, 2014 at 5:58 PM, eric wong wrote: > > Hi, > > > > I want to check the DEBUG log of spark executor on YARN(using > yarn-cluster > > mode), but > > > > 1. yarn daemonlog setlevel DEBUG YarnChild.class > > 2. set log4j.properties in spark/conf folder on client node. > > > > no means above works. > > > > So how could i set the log level of spark executor on YARN container to > > DEBUG? > > > > Thanks! > > > > > > > > > > -- > > Wang Haihua > > > > > > -- > Marcelo > -- 王海华
how to submit multiple jar files when using spark-submit script in shell?
Hi, i using the comma separated style for submit multiple jar files in the follow shell but it does not work: bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master yarn-cluster --execur-memory 2g *--jars lib/spark-examples-1.0.2-hadoop2.2.0.jar,lib/spark-mllib_2.10-1.0.0.jar *hdfs://master:8000/srcdata/kmeans 8 4 Thanks! -- WangHaihua
how to set log level of spark executor on YARN(using yarn-cluster mode)
Hi, I want to check the DEBUG log of spark executor on YARN(using yarn-cluster mode), but 1. yarn daemonlog setlevel DEBUG YarnChild.class 2. set log4j.properties in spark/conf folder on client node. no means above works. So how could i set the log level of spark executor* on YARN container to DEBUG?* Thanks! -- Wang Haihua