The usage of OpenBLAS
Hi, I found out that the instructions for OpenBLAS has been changed by the author of netlib-java in: https://github.com/apache/spark/pull/4448 since Spark 1.3.0 In that PR, I asked whether there’s still a need to compile OpenBLAS with USE_THREAD=0, and also about Intel MKL. Is it still applicable or no longer the case anymore? Thanks, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Issues building 1.4.0 using make-distribution
Hi, I downloaded the source from Downloads page and ran the make-distribution.sh script. # ./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package The script has “-x” set in the beginning. ++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=project.version -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package ++ grep -v INFO ++ tail -n 1 + VERSION='[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' ++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=scala.binary.version -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package ++ grep -v INFO ++ tail -n 1 + SCALA_VERSION='[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' ++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=hadoop.version -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package ++ grep -v INFO ++ tail -n 1 … + TARDIR_NAME='spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' + TARDIR='/tmp/a/spark-1.4.0/spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' + rm -rf '/tmp/a/spark-1.4.0/spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' + cp -r /tmp/a/spark-1.4.0/dist '/tmp/a/spark-1.4.0/spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin' cp: cannot create directory `/tmp/a/spark-1.4.0/spark-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin': No such file or directory The dist directory seems complete and does work. Thanks, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Documentation for external shuffle service in 1.4.0
Hi, I can’t seem to find any documentation on this feature in 1.4.0? Regards, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Not getting event logs = spark 1.3.1
Forgot to mention this is on standalone mode. Is my configuration wrong? Thanks, Liming On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, I have this in my spark-defaults.conf (same for hdfs): spark.eventLog.enabled true spark.eventLog.dir file:/tmp/spark-events spark.history.fs.logDirectory file:/tmp/spark-events While the app is running, there is a “.inprogress” directory. However when the job completes, the directory is always empty. I’m submitting the job like this, using either the Pi or world count examples: $ bin/spark-submit /opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py This used to be working in 1.2.1 and didn’t test 1.3.0. Regards, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Not getting event logs = spark 1.3.1
Hi, I have this in my spark-defaults.conf (same for hdfs): spark.eventLog.enabled true spark.eventLog.dir file:/tmp/spark-events spark.history.fs.logDirectory file:/tmp/spark-events While the app is running, there is a “.inprogress” directory. However when the job completes, the directory is always empty. I’m submitting the job like this, using either the Pi or world count examples: $ bin/spark-submit /opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py This used to be working in 1.2.1 and didn’t test 1.3.0. Regards, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Logstash as a source?
I have been using a logstash alternative - fluentd to ingest the data into hdfs. I had to configure fluentd to not append the data so that spark streaming will be able to pick up the new logs. -Liming On 2 Feb, 2015, at 6:05 am, NORD SC jan.algermis...@nordsc.com wrote: Hi, I plan to have logstash send log events (as key value pairs) to spark streaming using Spark on Cassandra. Being completely fresh to Spark, I have a couple of questions: - is that a good idea at all, or would it be better to put e.g. Kafka in between to handle traffic peeks (IOW: how and how well would Spark Streaming handle peeks?) - Is there already a logstash-source implementation for Spark Streaming - assuming there is none yet and assuming it is a good idea: I’d dive into writing it myself - what would the core advice be to avoid biginner traps? Jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Confused why I'm losing workers/executors when writing a large file to S3
I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22 07:29:36 WARN master.Master: Removing worker-20150121231529-numaq1-4-34948 because we got no heartbeat in 60 seconds On 14 Nov, 2014, at 3:04 pm, Reynold Xin r...@databricks.com wrote: Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file. I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see below). This is happening at the end of my job when I'm trying to persist a large RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster consists of 6 r3.8xlarge machines. The job successfully works when I'm outputting 100GB or 200GB. If you have any thoughts/insights, it would be appreciated. Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf() .setAppName(SparkSync Application) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.rdd.compress,true) .set(spark.core.connection.ack.wait.timeout,600); On the master, I see the following in the log file. 4/11/13 17:20:39 WARN master.Master: Removing worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no heartbeat in 60 seconds 14/11/13 17:20:39 INFO master.Master: Removing worker worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on ip-10-35-184-232.ec2.internal:51877 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 On a worker, I see something like the following in the log file. 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO
Understanding stages in WebUI
Hi, I have the classic word count example: file.flatMap(line = line.split( )).map(word = (word,1)).reduceByKey(_ + _).collect() From the Job UI, I can only see 2 stages: 0-collect and 1-map. What happened to ShuffledRDD in reduceByKey? And both flatMap and map operations is collapsed into a single stage? 14/11/25 16:02:35 INFO SparkContext: Starting job: collect at console:15 14/11/25 16:02:35 INFO DAGScheduler: Registering RDD 6 (map at console:15) 14/11/25 16:02:35 INFO DAGScheduler: Got job 0 (collect at console:15) with 2 output partitions (allowLocal=false) 14/11/25 16:02:35 INFO DAGScheduler: Final stage: Stage 0(collect at console:15) 14/11/25 16:02:35 INFO DAGScheduler: Parents of final stage: List(Stage 1) 14/11/25 16:02:35 INFO DAGScheduler: Missing parents: List(Stage 1) 14/11/25 16:02:35 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[6] at map at console:15), which has no missing parents 14/11/25 16:02:35 INFO MemoryStore: ensureFreeSpace(3464) called with curMem=163705, maxMem=278302556 14/11/25 16:02:35 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.4 KB, free 265.3 MB) 14/11/25 16:02:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at console:15) 14/11/25 16:02:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 14/11/25 16:02:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1208 bytes) 14/11/25 16:02:35 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1208 bytes) 14/11/25 16:02:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 0) 14/11/25 16:02:35 INFO Executor: Running task 1.0 in stage 1.0 (TID 1) 14/11/25 16:02:35 INFO HadoopRDD: Input split: file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:0+2405 14/11/25 16:02:35 INFO HadoopRDD: Input split: file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:2405+2406 14/11/25 16:02:35 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/11/25 16:02:35 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/11/25 16:02:35 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/11/25 16:02:35 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/11/25 16:02:35 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 14/11/25 16:02:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1869 bytes result sent to driver 14/11/25 16:02:36 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 1869 bytes result sent to driver 14/11/25 16:02:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 536 ms on localhost (1/2) 14/11/25 16:02:36 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 529 ms on localhost (2/2) 14/11/25 16:02:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/11/25 16:02:36 INFO DAGScheduler: Stage 1 (map at console:15) finished in 0.562 s 14/11/25 16:02:36 INFO DAGScheduler: looking for newly runnable stages 14/11/25 16:02:36 INFO DAGScheduler: running: Set() 14/11/25 16:02:36 INFO DAGScheduler: waiting: Set(Stage 0) 14/11/25 16:02:36 INFO DAGScheduler: failed: Set() 14/11/25 16:02:36 INFO DAGScheduler: Missing parents for Stage 0: List() 14/11/25 16:02:36 INFO DAGScheduler: Submitting Stage 0 (ShuffledRDD[7] at reduceByKey at console:15), which is now runnable 14/11/25 16:02:36 INFO MemoryStore: ensureFreeSpace(2112) called with curMem=167169, maxMem=278302556 14/11/25 16:02:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.1 KB, free 265.2 MB) 14/11/25 16:02:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ShuffledRDD[7] at reduceByKey at console:15) 14/11/25 16:02:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/11/25 16:02:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 948 bytes) 14/11/25 16:02:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 948 bytes) 14/11/25 16:02:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 2) 14/11/25 16:02:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 3) 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 5 ms 14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0
RDD memory and storage level option
Hi, This is on version 1.1.0. I’m did a simple test on MEMORY_AND_DISK storage level. var file = sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK) file.count() The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node: ID Name Cores Memory per Node Submitted Time User State Duration app-20141120193912-0002 Spark shell 641024.0 MB 2014/11/20 19:39:12 root RUNNING 6.0 min After doing a simple count, the job web ui indicates the entire file is saved on disk? RDD NameStorage Level Cached Fraction Size in Size in Size on Partitions Cached Memory Tachyon Disk file:///path/to/file.txt Disk Serialized 1x 46 100% 0.0 B 0.0 B1476.5 MB Replicated 1. Shouldn’t some partitions be saved into memory? 2. If I run with MEMORY_ONLY option, I can save some partitions into memory but there are still space left according to the executor page 220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 73MB. RDD Name Storage Level Cached Fraction Size in Size inSize on Partitions Cached Memory Tachyon Disk file:///path/to/file.txt Memory Deserialized 3 7%220.6 MB0.0 B0.0 B 1x Replicated ExecutorAddress RDD MemoryDisk Active Failed CompleteTotal Task Input Shuffle Shuffle ID BlocksUsed Used TasksTasks Tasks Tasks TimeReadWrite 220.6 MB 1457.4MB 0 foo.co:48660 3/ 530.3 0.0 B 0046 46 14.2 m 0.0 B0.0 B MB 14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on foo.co:48660 (size: 73.6 MB, free: 309.6 MB) 14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 29833 ms on foo.co (43/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 31502 ms on foo.co (44/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 31651 ms on foo.co (45/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 31782 ms on foo.co (46/46) 14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at console:16) finished in 31.818 s 14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/11/20 19:53:24 INFO SparkContext: Job finished: count at console:16, took 31.926585742 s res0: Long = 1000 Is this correct? 3. I can’t seem to work out the math to derive 530MB that is made available in the executor? 1024MB * memoryFraction(0.6) = 614.4 Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?
Another observation I had was reading over local filesystem with “file://“. it was stated as PROCESS_LOCAL which was confusing. Regards, Liming On 13 Sep, 2014, at 3:12 am, Nicholas Chammas nicholas.cham...@gmail.com wrote: Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote: The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The main tunable option is how far long the scheduler waits before starting to move data rather than code. Those are the spark.locality.* settings here: http://spark.apache.org/docs/latest/configuration.html If you want to prevent this from happening entirely, you can set the values to ridiculously high numbers. The documentation also mentions that 0 has special meaning, so you can try that as well. Good luck! Andrew On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening (barring the machine actually going down, I'd rather stick with the same process)?
Re: Hadoop LR comparison
Thanks. What will be equivalent code in Hadoop where Spark published the 110s/0.9s comparison? On 1 Apr, 2014, at 2:44 pm, DB Tsai dbt...@alpinenow.com wrote: Hi Li-Ming, This binary logistic regression using SGD is in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala We're working on multinomial logistic regression using Newton and L-BFGS optimizer now. Will be released soon. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, Is the code available for Hadoop to calculate the Logistic Regression hyperplane? I’m looking at the Examples: http://spark.apache.org/examples.html, where there is the 110s vs 0.9s in Hadoop vs Spark comparison. Thanks!
Re: Configuring shuffle write directory
Hi, Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly.. I took a look at the process table and saw that the “org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the -Dspark.local.dir set. On 28 Mar, 2014, at 1:05 pm, Matei Zaharia matei.zaha...@gmail.com wrote: I see, are you sure that was in spark-env.sh instead of spark-env.sh.template? You need to copy it to just a .sh file. Also make sure the file is executable. Try doing println(sc.getConf.toDebugString) in your driver program and seeing what properties it prints. As far as I can tell, spark.local.dir should *not* be set there, so workers should get it from their spark-env.sh. It’s true that if you set spark.local.dir in the driver it would pass that on to the workers for that job. Matei On Mar 27, 2014, at 9:57 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Yes, I have tried that by adding it to the Worker. I can see the app-20140328124540-000” in the local spark directory of the worker. But the “spark-local” directories are always written to /tmp since is the default spark.local.dir is taken from java.io.tempdir? On 28 Mar, 2014, at 12:42 pm, Matei Zaharia matei.zaha...@gmail.com wrote: Yes, the problem is that the driver program is overriding it. Have you set it manually in the driver? Or how did you try setting it in workers? You should set it by adding export SPARK_JAVA_OPTS=“-Dspark.local.dir=whatever” to conf/spark-env.sh on those workers. Matei On Mar 27, 2014, at 9:04 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there a way to properly define it in the worker node? Thanks!
Re: Configuring shuffle write directory
Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there a way to properly define it in the worker node? Thanks!
Setting SPARK_MEM higher than available memory in driver
Hi, My worker nodes have more memory than the host that I’m submitting my driver program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell? $ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x7f736e13, 205634994176, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 205634994176 bytes for committing reserved memory. I want to allocate at least 100GB of memory per executor. The allocated memory on the executor seems to depend on the -Xmx heap size of the driver? Thanks!
Re: Kmeans example reduceByKey slow
Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. Thanks! On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng men...@gmail.com wrote: Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms But it stops there for some significant time before any movement. In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes. I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks!
Re: Kmeans example reduceByKey slow
Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote: K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem. -Xiangrui On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here's the code I use: http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example. Thanks! On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng men...@gmail.com wrote: Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL) 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms But it stops there for some significant time before any movement. In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes. I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows). Is this a normal behaviour? Thanks!
Configuring shuffle write directory
Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there a way to properly define it in the worker node? Thanks!
Spark temp dir (spark.local.dir)
Hi, I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir). What's the difference? I have set -Dspark.local.dir for all my worker nodes but I'm still seeing directories being created in /tmp when the job is running. I have also tried setting -Dspark.local.dir when I run the application. Thanks!
Re: Spark temp dir (spark.local.dir)
spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory) Do you mean the worker nodes? Don’t think they are jetty connectors and the directories are empty: /tmp/spark-3e330cdc-7540-4313-9f32-9fa109935f17/jars /tmp/spark-3e330cdc-7540-4313-9f32-9fa109935f17/files I run the application like this, even with the java.io.tmpdir : bin/run-example -Dspark.executor.memory=14g -Dspark.local.dir=/mnt/storage1/lm -Djava.io.tmpdir=/mnt/storage1/lm org.apache.spark.examples.SparkLR spark://oct1:7077 10 On 13 Mar, 2014, at 5:33 pm, Guillaume Pitel guillaume.pi...@exensa.com wrote: Also, I think the jetty connector will create a small file or directory in /tmp regardless of the spark.local.dir It's very small, about 10KB Guillaume I'm not 100% sure but I think it goes like this : spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory) the SPARK_WORKER_DIR is where the jars and the log output of the executors is placed (default $SPARK_HOME/work/) and it should be cleaned regularly In $SPARK_HOME/logs are found the logs of the workers and master Guillaume Hi, I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir). What's the difference? I have set -Dspark.local.dir for all my worker nodes but I'm still seeing directories being created in /tmp when the job is running. I have also tried setting -Dspark.local.dir when I run the application. Thanks! -- Mail Attachment.png Guillaume PITEL, Président +33(0)6 25 48 86 80 eXenSa S.A.S. 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 -- exensa_logo_mail.png Guillaume PITEL, Président +33(0)6 25 48 86 80 eXenSa S.A.S. 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05