The usage of OpenBLAS

2015-06-26 Thread Tsai Li Ming
Hi,

I found out that the instructions for OpenBLAS has been changed by the author 
of netlib-java in:
https://github.com/apache/spark/pull/4448 since Spark 1.3.0

In that PR, I asked whether there’s still a need to compile OpenBLAS with 
USE_THREAD=0, and also about Intel MKL.

Is it still applicable or no longer the case anymore?

Thanks,
Liming


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issues building 1.4.0 using make-distribution

2015-06-17 Thread Tsai Li Ming
Hi,

I downloaded the source from Downloads page and ran the make-distribution.sh 
script.

# ./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests 
clean package

The script has “-x” set in the beginning.

++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=project.version 
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
++ grep -v INFO
++ tail -n 1
+ VERSION='[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=scala.binary.version 
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
++ grep -v INFO
++ tail -n 1
+ SCALA_VERSION='[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=hadoop.version 
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
++ grep -v INFO
++ tail -n 1

…

+ TARDIR_NAME='spark-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
+ TARDIR='/tmp/a/spark-1.4.0/spark-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
+ rm -rf '/tmp/a/spark-1.4.0/spark-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
+ cp -r /tmp/a/spark-1.4.0/dist '/tmp/a/spark-1.4.0/spark-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin'
cp: cannot create directory `/tmp/a/spark-1.4.0/spark-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin-bin-[WARNING] See 
http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin': No such file or 
directory


The dist directory seems complete and does work.

Thanks,
Liming



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Documentation for external shuffle service in 1.4.0

2015-06-17 Thread Tsai Li Ming
Hi,

I can’t seem to find any documentation on this feature in 1.4.0?

Regards,
Liming


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Not getting event logs = spark 1.3.1

2015-06-16 Thread Tsai Li Ming
Forgot to mention this is on standalone mode.

Is my configuration wrong?

Thanks,
Liming

On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming mailingl...@ltsai.com wrote:

 Hi,
 
 I have this in my spark-defaults.conf (same for hdfs):
 spark.eventLog.enabled  true
 spark.eventLog.dir  file:/tmp/spark-events
 spark.history.fs.logDirectory   file:/tmp/spark-events
 
 While the app is running, there is a “.inprogress” directory. However when 
 the job completes, the directory is always empty.
 
 I’m submitting the job like this, using either the Pi or world count examples:
 $ bin/spark-submit 
 /opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py 
 
 This used to be working in 1.2.1 and didn’t test 1.3.0.
 
 
 Regards,
 Liming
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Not getting event logs = spark 1.3.1

2015-06-15 Thread Tsai Li Ming
Hi,

I have this in my spark-defaults.conf (same for hdfs):
spark.eventLog.enabled  true
spark.eventLog.dir  file:/tmp/spark-events
spark.history.fs.logDirectory   file:/tmp/spark-events

While the app is running, there is a “.inprogress” directory. However when the 
job completes, the directory is always empty.

I’m submitting the job like this, using either the Pi or world count examples:
$ bin/spark-submit 
/opt/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/wordcount.py 

This used to be working in 1.2.1 and didn’t test 1.3.0.


Regards,
Liming






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Logstash as a source?

2015-02-01 Thread Tsai Li Ming
I have been using a logstash alternative - fluentd to ingest the data into hdfs.

I had to configure fluentd to not append the data so that spark streaming will 
be able to pick up the new logs.

-Liming


On 2 Feb, 2015, at 6:05 am, NORD SC jan.algermis...@nordsc.com wrote:

 Hi,
 
 I plan to have logstash send log events (as key value pairs) to spark 
 streaming using Spark on Cassandra.
 
 Being completely fresh to Spark, I have a couple of questions:
 
 - is that a good idea at all, or would it be better to put e.g. Kafka in 
 between to handle traffic peeks
  (IOW: how and how well would Spark Streaming handle peeks?)
 
 - Is there already a logstash-source implementation for Spark Streaming 
 
 - assuming there is none yet and assuming it is a good idea: I’d dive into 
 writing it myself - what would the core advice be to avoid biginner traps?
 
 Jan
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming
I’m getting the same issue on Spark 1.2.0. Despite having set 
“spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in 
the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 
seconds” error. 

spark.core.connection.ack.wait.timeout=3600

15/01/22 07:29:36 WARN master.Master: Removing 
worker-20150121231529-numaq1-4-34948 because we got no heartbeat in 60 seconds


On 14 Nov, 2014, at 3:04 pm, Reynold Xin r...@databricks.com wrote:

 Darin,
 
 You might want to increase these config options also:
 
 spark.akka.timeout 300
 spark.storage.blockManagerSlaveTimeoutMs 30
 
 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid 
 wrote:
 For one of my Spark jobs, my workers/executors are dying and leaving the 
 cluster.
 
 On the master, I see something like the following in the log file.  I'm 
 surprised to see the '60' seconds in the master log below because I 
 explicitly set it to '600' (or so I thought) in my spark job (see below).   
 This is happening at the end of my job when I'm trying to persist a large RDD 
 (probably around 300+GB) back to S3 (in 256 partitions).  My cluster consists 
 of 6 r3.8xlarge machines.  The job successfully works when I'm outputting 
 100GB or 200GB.
 
 If  you have any thoughts/insights, it would be appreciated. 
 
 Thanks.
 
 Darin.
 
 Here is where I'm setting the 'timeout' in my spark job.
 
 SparkConf conf = new SparkConf()
 .setAppName(SparkSync Application)
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 .set(spark.rdd.compress,true)   
 .set(spark.core.connection.ack.wait.timeout,600);
 ​
 On the master, I see the following in the log file.
 
 4/11/13 17:20:39 WARN master.Master: Removing 
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
 heartbeat in 60 seconds
 14/11/13 17:20:39 INFO master.Master: Removing worker 
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
 ip-10-35-184-232.ec2.internal:51877
 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2
 
 On a worker, I see something like the following in the log file.
 
 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO 

Understanding stages in WebUI

2014-11-25 Thread Tsai Li Ming
Hi,

I have the classic word count example:
 file.flatMap(line = line.split( )).map(word = (word,1)).reduceByKey(_ + 
 _).collect()

From the Job UI, I can only see 2 stages: 0-collect and 1-map.

What happened to ShuffledRDD in reduceByKey? And both flatMap and map 
operations is collapsed into a single stage?

14/11/25 16:02:35 INFO SparkContext: Starting job: collect at console:15
14/11/25 16:02:35 INFO DAGScheduler: Registering RDD 6 (map at console:15)
14/11/25 16:02:35 INFO DAGScheduler: Got job 0 (collect at console:15) with 2 
output partitions (allowLocal=false)
14/11/25 16:02:35 INFO DAGScheduler: Final stage: Stage 0(collect at 
console:15)
14/11/25 16:02:35 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/11/25 16:02:35 INFO DAGScheduler: Missing parents: List(Stage 1)
14/11/25 16:02:35 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[6] at map at 
console:15), which has no missing parents
14/11/25 16:02:35 INFO MemoryStore: ensureFreeSpace(3464) called with 
curMem=163705, maxMem=278302556
14/11/25 16:02:35 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.4 KB, free 265.3 MB)
14/11/25 16:02:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(MappedRDD[6] at map at console:15)
14/11/25 16:02:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/11/25 16:02:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, 
localhost, PROCESS_LOCAL, 1208 bytes)
14/11/25 16:02:35 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, 
localhost, PROCESS_LOCAL, 1208 bytes)
14/11/25 16:02:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
14/11/25 16:02:35 INFO Executor: Running task 1.0 in stage 1.0 (TID 1)
14/11/25 16:02:35 INFO HadoopRDD: Input split: 
file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:0+2405
14/11/25 16:02:35 INFO HadoopRDD: Input split: 
file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:2405+2406
14/11/25 16:02:35 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
14/11/25 16:02:35 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
14/11/25 16:02:35 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
14/11/25 16:02:35 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
14/11/25 16:02:35 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
14/11/25 16:02:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1869 
bytes result sent to driver
14/11/25 16:02:36 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 1869 
bytes result sent to driver
14/11/25 16:02:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) 
in 536 ms on localhost (1/2)
14/11/25 16:02:36 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) 
in 529 ms on localhost (2/2)
14/11/25 16:02:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have 
all completed, from pool 
14/11/25 16:02:36 INFO DAGScheduler: Stage 1 (map at console:15) finished in 
0.562 s
14/11/25 16:02:36 INFO DAGScheduler: looking for newly runnable stages
14/11/25 16:02:36 INFO DAGScheduler: running: Set()
14/11/25 16:02:36 INFO DAGScheduler: waiting: Set(Stage 0)
14/11/25 16:02:36 INFO DAGScheduler: failed: Set()
14/11/25 16:02:36 INFO DAGScheduler: Missing parents for Stage 0: List()
14/11/25 16:02:36 INFO DAGScheduler: Submitting Stage 0 (ShuffledRDD[7] at 
reduceByKey at console:15), which is now runnable
14/11/25 16:02:36 INFO MemoryStore: ensureFreeSpace(2112) called with 
curMem=167169, maxMem=278302556
14/11/25 16:02:36 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 2.1 KB, free 265.2 MB)
14/11/25 16:02:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(ShuffledRDD[7] at reduceByKey at console:15)
14/11/25 16:02:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/25 16:02:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, 
localhost, PROCESS_LOCAL, 948 bytes)
14/11/25 16:02:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3, 
localhost, PROCESS_LOCAL, 948 bytes)
14/11/25 16:02:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 2)
14/11/25 16:02:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 3)
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 5 ms
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 

RDD memory and storage level option

2014-11-20 Thread Tsai Li Ming
Hi,

This is on version 1.1.0.

I’m did a simple test on MEMORY_AND_DISK storage level.

 var file = 
 sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
 file.count()

The file is 1.5GB and there is only 1 worker. I have requested for 1GB of 
worker memory per node:

  
 ID   Name Cores Memory per Node   Submitted Time   
 User  State  Duration
   app-20141120193912-0002 Spark shell 641024.0 MB   2014/11/20 
19:39:12 root RUNNING 6.0 min 


After doing a simple count, the job web ui indicates the entire file is saved 
on disk?

   RDD NameStorage Level Cached 
Fraction  Size in Size in Size on 
   Partitions
Cached   Memory  Tachyon   Disk  
   file:///path/to/file.txt Disk Serialized 1x 46   
100%   0.0 B   0.0 B1476.5 MB
 Replicated 
  
 
1. Shouldn’t some partitions be saved into memory? 




2. If I run with MEMORY_ONLY option, I can save some partitions into memory but 
there are still space left according to the executor page
220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 
73MB.

  RDD Name  Storage Level  Cached
Fraction  Size in Size inSize on
  Partitions   
Cached   Memory  Tachyon  Disk 
   file:///path/to/file.txt Memory Deserialized  3
7%220.6 MB0.0 B0.0 B  
 1x Replicated  
  
  
ExecutorAddress  RDD MemoryDisk   Active   Failed   
CompleteTotal   Task   Input  Shuffle  Shuffle
   ID   BlocksUsed Used   TasksTasks  Tasks 
Tasks   TimeReadWrite 
220.6 MB
  1457.4MB  
   0  foo.co:48660 3/ 530.3   0.0 B  0046   
   46  14.2 m 0.0 B0.0 B  
MB

14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on 
foo.co:48660 (size: 73.6 MB, free: 309.6 MB)
14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) 
in 29833 ms on foo.co (43/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) 
in 31502 ms on foo.co (44/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) 
in 31651 ms on foo.co (45/46)
14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) 
in 31782 ms on foo.co (46/46)
14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at console:16) finished 
in 31.818 s
14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
14/11/20 19:53:24 INFO SparkContext: Job finished: count at console:16, took 
31.926585742 s
res0: Long = 1000

Is this correct?



3. I can’t seem to work out the math to derive 530MB that is made available in 
the executor? 1024MB * memoryFraction(0.6) = 614.4

Thanks!





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it 
was stated as PROCESS_LOCAL which was confusing. 

Regards,
Liming

On 13 Sep, 2014, at 3:12 am, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

 Andrew,
 
 This email was pretty helpful. I feel like this stuff should be summarized in 
 the docs somewhere, or perhaps in a blog post.
 
 Do you know if it is?
 
 Nick
 
 
 On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 
 The main tunable option is how far long the scheduler waits before starting 
 to move data rather than code.  Those are the spark.locality.* settings here: 
 http://spark.apache.org/docs/latest/configuration.html
 
 If you want to prevent this from happening entirely, you can set the values 
 to ridiculously high numbers.  The documentation also mentions that 0 has 
 special meaning, so you can try that as well.
 
 Good luck!
 Andrew
 
 
 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu 
 wrote:
 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume 
 that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
 
 When these happen things get extremely slow.
 
 Does this mean that the executor got terminated and restarted?
 
 Is there a way to prevent this from happening (barring the machine actually 
 going down, I'd rather stick with the same process)?
 
 



Re: Hadoop LR comparison

2014-04-01 Thread Tsai Li Ming
Thanks.

What will be equivalent code in Hadoop where Spark published the 110s/0.9s 
comparison?


On 1 Apr, 2014, at 2:44 pm, DB Tsai dbt...@alpinenow.com wrote:

 Hi Li-Ming,
 
 This binary logistic regression using SGD is in 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 
 We're working on multinomial logistic regression using Newton and L-BFGS 
 optimizer now. Will be released soon.
 
 
 Sincerely,
 
 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/
 
 
 On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
 Hi,
 
 Is the code available for Hadoop to calculate the Logistic Regression 
 hyperplane?
 
 I’m looking at the Examples:
 http://spark.apache.org/examples.html,
 
 where there is the 110s vs 0.9s in Hadoop vs Spark comparison.
 
 Thanks!
 



Re: Configuring shuffle write directory

2014-03-28 Thread Tsai Li Ming

Hi,

Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly..

I took a look at the process table and saw that the 
“org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the 
-Dspark.local.dir set.




On 28 Mar, 2014, at 1:05 pm, Matei Zaharia matei.zaha...@gmail.com wrote:

 I see, are you sure that was in spark-env.sh instead of 
 spark-env.sh.template? You need to copy it to just a .sh file. Also make sure 
 the file is executable.
 
 Try doing println(sc.getConf.toDebugString) in your driver program and seeing 
 what properties it prints. As far as I can tell, spark.local.dir should *not* 
 be set there, so workers should get it from their spark-env.sh. It’s true 
 that if you set spark.local.dir in the driver it would pass that on to the 
 workers for that job.
 
 Matei
 
 On Mar 27, 2014, at 9:57 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
 
 Yes, I have tried that by adding it to the Worker. I can see the 
 app-20140328124540-000” in the local spark directory of the worker.
 
 But the “spark-local” directories are always written to /tmp since is the 
 default spark.local.dir is taken from java.io.tempdir?
 
 
 
 On 28 Mar, 2014, at 12:42 pm, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Yes, the problem is that the driver program is overriding it. Have you set 
 it manually in the driver? Or how did you try setting it in workers? You 
 should set it by adding
 
 export SPARK_JAVA_OPTS=“-Dspark.local.dir=whatever”
 
 to conf/spark-env.sh on those workers.
 
 Matei
 
 On Mar 27, 2014, at 9:04 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
 
 Anyone can help?
 
 How can I configure a different spark.local.dir for each executor?
 
 
 On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote:
 
 Hi,
 
 Each of my worker node has its own unique spark.local.dir.
 
 However, when I run spark-shell, the shuffle writes are always written to 
 /tmp despite being set when the worker node is started.
 
 By specifying the spark.local.dir for the driver program, it seems to 
 override the executor? Is there a way to properly define it in the worker 
 node?
 
 Thanks!
 
 
 
 



Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help?

How can I configure a different spark.local.dir for each executor?


On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote:

 Hi,
 
 Each of my worker node has its own unique spark.local.dir.
 
 However, when I run spark-shell, the shuffle writes are always written to 
 /tmp despite being set when the worker node is started.
 
 By specifying the spark.local.dir for the driver program, it seems to 
 override the executor? Is there a way to properly define it in the worker 
 node?
 
 Thanks!



Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi,

My worker nodes have more memory than the host that I’m submitting my driver 
program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell?

$ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell

Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x7f736e13, 205634994176, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 205634994176 bytes for 
committing reserved memory.

I want to allocate at least 100GB of memory per executor. The allocated memory 
on the executor seems to depend on the -Xmx heap size of the driver?

Thanks!





Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Hi,

This is on a 4 nodes cluster each with 32 cores/256GB Ram. 

(0.9.0) is deployed in a stand alone mode.

Each worker is configured with 192GB. Spark executor memory is also 192GB. 

This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.

Thanks!



On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng men...@gmail.com wrote:

 Hi Tsai,
 
 Could you share more information about the machine you used and the
 training parameters (runs, k, and iterations)? It can help solve your
 issues. Thanks!
 
 Best,
 Xiangrui
 
 On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
 Hi,
 
 At the reduceBuyKey stage, it takes a few minutes before the tasks start 
 working.
 
 I have -Dspark.default.parallelism=127 cores (n-1).
 
 CPU/Network/IO is idling across all nodes when this is happening.
 
 And there is nothing particular on the master log file. From the spark-shell:
 
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
 executor 2: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
 bytes in 193 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
 executor 1: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
 bytes in 96 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
 executor 0: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
 bytes in 100 ms
 
 But it stops there for some significant time before any movement.
 
 In the stage detail of the UI, I can see that there are 127 tasks running 
 but the duration each is at least a few minutes.
 
 I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
 (50M rows).
 
 Is this a normal behaviour?
 
 Thanks!



Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks, Let me try with a smaller K.

Does the size of the input data matters for the example? Currently I have 50M 
rows. What is a reasonable size to demonstrate the capability of Spark?





On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote:

 K = 50 is certainly a large number for k-means. If there is no
 particular reason to have 50 clusters, could you try to reduce it
 to, e.g, 100 or 1000? Also, the example code is not for large-scale
 problems. You should use the KMeans algorithm in mllib clustering for
 your problem.
 
 -Xiangrui
 
 On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
 Hi,
 
 This is on a 4 nodes cluster each with 32 cores/256GB Ram.
 
 (0.9.0) is deployed in a stand alone mode.
 
 Each worker is configured with 192GB. Spark executor memory is also 192GB.
 
 This is on the first iteration. K=50. Here's the code I use:
 http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
 
 Thanks!
 
 
 
 On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng men...@gmail.com wrote:
 
 Hi Tsai,
 
 Could you share more information about the machine you used and the
 training parameters (runs, k, and iterations)? It can help solve your
 issues. Thanks!
 
 Best,
 Xiangrui
 
 On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
 Hi,
 
 At the reduceBuyKey stage, it takes a few minutes before the tasks start 
 working.
 
 I have -Dspark.default.parallelism=127 cores (n-1).
 
 CPU/Network/IO is idling across all nodes when this is happening.
 
 And there is nothing particular on the master log file. From the 
 spark-shell:
 
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
 executor 2: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
 bytes in 193 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
 executor 1: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
 bytes in 96 ms
 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
 executor 0: XXX (PROCESS_LOCAL)
 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
 bytes in 100 ms
 
 But it stops there for some significant time before any movement.
 
 In the stage detail of the UI, I can see that there are 127 tasks running 
 but the duration each is at least a few minutes.
 
 I'm working off local storage (not hdfs) and the kmeans data is about 
 6.5GB (50M rows).
 
 Is this a normal behaviour?
 
 Thanks!
 



Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi,

Each of my worker node has its own unique spark.local.dir.

However, when I run spark-shell, the shuffle writes are always written to /tmp 
despite being set when the worker node is started.

By specifying the spark.local.dir for the driver program, it seems to override 
the executor? Is there a way to properly define it in the worker node?

Thanks!

Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
Hi,

I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir).

What's the difference?

I have set -Dspark.local.dir for all my worker nodes but I'm still seeing 
directories being created in /tmp when the job is running.

I have also tried setting -Dspark.local.dir when I run the application.

Thanks!



Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
 spark.local.dir can and should be set both on the executors and on the 
 driver (if the driver broadcast variables, the files will be stored in this 
 directory)
Do you mean the worker nodes?

Don’t think they are jetty connectors and the directories are empty:
/tmp/spark-3e330cdc-7540-4313-9f32-9fa109935f17/jars
/tmp/spark-3e330cdc-7540-4313-9f32-9fa109935f17/files

I run the application like this, even with the java.io.tmpdir :
bin/run-example -Dspark.executor.memory=14g -Dspark.local.dir=/mnt/storage1/lm 
-Djava.io.tmpdir=/mnt/storage1/lm org.apache.spark.examples.SparkLR 
spark://oct1:7077 10




On 13 Mar, 2014, at 5:33 pm, Guillaume Pitel guillaume.pi...@exensa.com wrote:

 Also, I think the jetty connector will create a small file or directory in 
 /tmp regardless of the spark.local.dir 
 
 It's very small, about 10KB
 
 Guillaume
 I'm not 100% sure but I think it goes like this : 
 
 spark.local.dir can and should be set both on the executors and on the 
 driver (if the driver broadcast variables, the files will be stored in this 
 directory)
 
 the SPARK_WORKER_DIR is where the jars and the log output of the executors 
 is placed (default $SPARK_HOME/work/) and it should be cleaned regularly 
 
 In $SPARK_HOME/logs are found the logs of the workers and master
 
 Guillaume
 Hi,
 
 I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir).
 
 What's the difference?
 
 I have set -Dspark.local.dir for all my worker nodes but I'm still seeing 
 directories being created in /tmp when the job is running.
 
 I have also tried setting -Dspark.local.dir when I run the application.
 
 Thanks!
 
 
 
 -- 
 Mail Attachment.png
 Guillaume PITEL, Président 
 +33(0)6 25 48 86 80
 
 eXenSa S.A.S. 
 41, rue Périer - 92120 Montrouge - FRANCE 
 Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
 
 
 -- 
 exensa_logo_mail.png
 Guillaume PITEL, Président 
 +33(0)6 25 48 86 80
 
 eXenSa S.A.S. 
 41, rue Périer - 92120 Montrouge - FRANCE 
 Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05