questions about shuffle time and parallel degree

2014-06-29 Thread wxhsdp
Hi, all i have two questions about shuffle time and parallel degree. question 1: we assume that cluster size is fixed, for example a cluster of 16 nodes, each node has 2 cores in EC2 case 1: a total shuffle of 64GB data between 32 partitions case 2: a total shuffle of 128GB data between

Re: spark master UI does not keep detailed application history

2014-06-14 Thread wxhsdp
hi, zhen i met the same problem in ec2, application details can not be accessed. but i can read stdout and stderr. the problem has not been solved yet -- View this message in context:

executor idle during task schedule

2014-06-04 Thread wxhsdp
Hi, all i've observed that sometimes when the executor finishes one task, it will wait about 5 seconds to get another task to work, during the 5 seconds, the executor does nothing: cpu idle, no disk access, no network transfer. is that normal for spark? thanks! -- View this message in

can not access app details on ec2

2014-05-31 Thread wxhsdp
hi, all i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything goes well except that i can not access application details on the web ui, i just click on the application name, but there's not response, has anyone met this before? is this a bug? thanks! -- View this

Re: can communication and computation be overlapped in spark?

2014-05-25 Thread wxhsdp
anyone see my thread? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348p6368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

can communication and computation be overlapped in spark?

2014-05-24 Thread wxhsdp
Hi, all fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be blocking on

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui i check the stderr of worker node, yes it's failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS... what do you mean by include breeze-natives or netlib:all? things i've already done: 1. add breeze and breeze native dependency in sbt build file

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui you said It doesn't work if you put the netlib-native jar inside an assembly jar. Try to mark it provided in the dependencies, and use --jars to include them with spark-submit. -Xiangrui i'am not use an assembly jar which contains every thing, i also mark breeze

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
ok Spark Executor Command: java -cp

breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp
Dear, all i'am testing double precision matrix multiplication in spark on ec2 m1.large machines. i use breeze linalg library, and internally it calls native library(openblas nehalem single threaded) m1.large: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz cpu MHz :

Re: breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp
i think maybe it's related to m1.large, because i also tested on my laptop, the two case cost nearly the same amount of time. my laptop: model name : Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz cpu MHz : 2893.549 os: Linux ubuntu 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
-natives.jar 4. i also include classpath of the above jars but does not work:( DB Tsai-2 wrote Hi Wxhsdp, I also have some difficulties witth sc.addJar(). Since we include the breeze library by using Spark 1.0, we don't have the problem you ran into. However, when we add external jars via

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
finally i fixed it. previous failure is caused by lack of some jars. i pasted the classpath in local mode to workers by using show compile:dependencyClasspath and it works! -- View this message in context:

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-05-15 Thread wxhsdp
Hi, mayur i've met the same problem. the instances are on, i can see them from ec2 console, and connect to them wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem root@54.86.181.108 The authenticity of host '54.86.181.108 (54.86.181.108)' can't be established. ECDSA key

os buffer cache does not cache shuffle output file

2014-05-15 Thread wxhsdp
Hi, patrick said The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk. i

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-14 Thread wxhsdp
/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar to /tmp/fetchFileTemp7468892065227766972.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-natives_2.10-0.7.jar to class loader

something about pipeline

2014-05-13 Thread wxhsdp
Dear, all definition of fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to

Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp
i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: details about event log

2014-05-07 Thread wxhsdp
any ideas? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sbt run with spark.ContextCleaner ERROR

2014-05-06 Thread wxhsdp
Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

details about event log

2014-05-05 Thread wxhsdp
Hi, i'am looking at the event log, i'am a little confuse about some metrics here's the info of one task: Launch Time:1399336904603 Finish Time:1399336906465 Executor Run Time:1781 Shuffle Read Metrics:Shuffle Finish Time:1399336906027, Fetch Wait Time:0 Shuffle Write Metrics:{Shuffle Bytes

Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread wxhsdp
Hi, TD actually, i'am not very clear with my spark version. i check out from https://github.com/apache/spark/trunk on Apr 30. please tell me from where do you get the version Spark 1.0 RC3 i do not call sparkContext.stop. now i add it to the end of my code here's the log 14/05/04 18:48:21 INFO

NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp
Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt run local..., everything is ok but when turn to standalone mode, sbt run

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp
Hi, DB, i think it's something related to sbt publishLocal if i remove the breeze dependency in my sbt file, breeze can not be found [error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not found: object breeze [error] import breeze.linalg._ [error]^ here's my sbt file

Re: same partition id means same location?

2014-05-01 Thread wxhsdp
anyone talk something about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread wxhsdp
i fixed it. i make my sbt project depend on spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar and it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html Sent from the

same partition id means same location?

2014-04-30 Thread wxhsdp
Hi, i'am just reviewing advanced spark features. it's about the pagerank example. it said any shuffle operation on two RDDs will take on the partitioner of one of them, if one is set. so first we partition the Links by hashPartitioner, then we join the Links and Ranks0. Ranks0 will take

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp
i met with the same question when update to spark 0.9.1 (svn checkout https://github.com/apache/spark/) Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq; at

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp
thank you for your help, Sourav. i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not equal to estimated size 135.6 KB. i opened it and found it's content has no relations with my read in file. i guess broadcast_0 is a config file about spark, is that right? -- View this

Re: how to declare tuple return type

2014-04-28 Thread wxhsdp
you need to import org.apache.spark.rdd.RDD to include RDD. http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD here are some examples you can learn https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib SK wrote I am a new user of

Re: questions about debugging a spark application

2014-04-28 Thread wxhsdp
thanks for your reply, daniel what do you mean by the logs contain everything to reconstruct the same data. ? i also use times to look into the logs, but only get a little. as i can see, it logs the flow to run the application, but there are no more details about each task, for example, see the

how to get subArray without copy

2014-04-26 Thread wxhsdp
Hi, all i want to do the following operations: (1) each partition do some operations on the partition data in Array format (2) split the array into subArrays, and combine each subArray with an id (3) do a shuffle according to the id here is the pseudo code /*pseudo code*/ case

Re: how to get subArray without copy

2014-04-26 Thread wxhsdp
the way i can find out is to use 2-D Array if the split has regularity -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

questions about debugging a spark application

2014-04-26 Thread wxhsdp
Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app, like shuffle size, duration time, stage information... there are not sufficient informations in the master UI. do i need to hang

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
thank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster(spark://127.0.0.1:7077) .setAppName(Simple App) .set(spark.executor.memory, 1g) .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar)) val sc = new SparkContext(conf) --

Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i tried, but no effect Qin Wei wrote try the complete path qinwei  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes       val conf = new SparkConf()  

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
... On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt; wxhsdp@ gt; wrote: thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
-2.10/simple-project_2.10-2.0.jar)) val tr = sc.textFile(logFile).cache tr.take(100).foreach(println) } } This will work On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp lt; wxhsdp@ gt; wrote: hi arpit, on spark shell, i can read local file properly

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
anyone knows the reason? i've googled a bit, and found some guys had the same problem, but with no replies... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html Sent from the Apache Spark User

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i noticed that error occurs at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) at

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread wxhsdp
i have a similar question i'am testing in standalone mode in only one pc. i use ./sbin/start-master.sh to start a master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077 to connect to the master from the web ui, i can see the local worker registered

how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
hi i'am testing SimpleApp.scala in standalone mode with only one pc, so i have one master and one local worker on the same pc with rather small input file size(4.5K), i have got the java.lang.OutOfMemoryError: Java heap space error here's my settings: spark-env.sh: export

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
by the way, codes run ok in spark shell -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

no response in spark web UI

2014-04-22 Thread wxhsdp
Hi, all i used to run my app using sbt run. but now i want to see the job information in spark web ui. i'am in local mode, i start the spark shell, and access the web ui on http://ubuntu.local:4040/stages/. but when i sbt run some application, there is no response in the web ui. how to make

questions about toArray and ClassTag

2014-04-19 Thread wxhsdp
Hi, all i'am quite new in scala, i do some tests in spark shell val b = a.mapPartitions{D = val p = D.toArray . p.toIterator } when a is an RDD of type RDD[Int], b.collect() works. but when i change a to RDD[MyOwnType], b.collect() returns error: 14/04/20 10:14:46 ERROR

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp
thank you so much, davidson ye, you are right, in both sbt and spark shell, the result of my code is 28MB, it's irrelevant to numSlices. yesterday i had the result of 4.2MB in spark shell, because i remove array initialization for laziness:) for(i - 0 until size) { array(i) = i } --

storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp
Hi, all in order to understand the memory usage about spark, i do the following test val size = 1024*1024 val array = new Array[Int](size) for(i - 0 until size) { array(i) = i } val a = sc.parallelize(array).cache() /*4MB*/ val b = a.mapPartitions{ c = { val d = c.toArray val e = new

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp
thanks for your help, Davidson! i modified val a:RDD[Int] = sc.parallelize(array).cache() to keep val a an RDD of Int, but has the same result another question JVM and spark memory locate at different parts of system memory, the spark code is executed in JVM memory, malloc operation like val e =

SVD under spark/mllib/linalg

2014-04-11 Thread wxhsdp
Hi, all the code under https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg has changed. previous matrix classes are all removed, like MatrixEntry, MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze Linear Algebra when do linear

Re: Where does println output go?

2014-04-10 Thread wxhsdp
rdd.foreach(p = { print(p) }) The above closure gets executed on workers, you need to look at the logs of the workers to see the output. but if i'm in local mode, where's the logs of local driver, there are no /logs and /work dirs in /SPARK_HOME which are set in standalone mode. -- View

Re: Only TraversableOnce?

2014-04-09 Thread wxhsdp
thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}} -- View this message in context:

Only TraversableOnce?

2014-04-08 Thread wxhsdp
In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E =

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
8, 2014 at 8:40 AM, wxhsdp wrote: yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from