from:"wxhsdp"

Re: Kryo is slower, and the size saving is minimal

2014-07-09 Thread wxhsdp

i'am not familiar with kryo and my opinion may be not right. in my case, kryo only saves about 5% of the original size when dealing with primitive types such as Arrays. i'am not sure whether it is the common case. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

questions about shuffle time and parallel degree

2014-06-29 Thread wxhsdp

Hi, all i have two questions about shuffle time and parallel degree. question 1: we assume that cluster size is fixed, for example a cluster of 16 nodes, each node has 2 cores in EC2 case 1: a total shuffle of 64GB data between 32 partitions case 2: a total shuffle of 128GB data between

Re: spark master UI does not keep detailed application history

2014-06-14 Thread wxhsdp

hi, zhen i met the same problem in ec2, application details can not be accessed. but i can read stdout and stderr. the problem has not been solved yet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-hist

executor idle during task schedule

2014-06-04 Thread wxhsdp

Hi, all i've observed that sometimes when the executor finishes one task, it will wait about 5 seconds to get another task to work, during the 5 seconds, the executor does nothing: cpu idle, no disk access, no network transfer. is that normal for spark? thanks! -- View this message in c

can not access app details on ec2

2014-05-31 Thread wxhsdp

hi, all i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything goes well except that i can not access application details on the web ui, i just click on the application name, but there's not response, has anyone met this before? is this a bug? thanks! -- View this mes

Re: can communication and computation be overlapped in spark?

2014-05-25 Thread wxhsdp

anyone see my thread? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348p6368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

can communication and computation be overlapped in spark?

2014-05-24 Thread wxhsdp

Hi, all fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be blocking on bl

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

correct what i said above: ldconfig does work, it automatically makes a link: libopenblas.so.0 -> libopenblas_nehalemp-r0.2.9.rc2.so but what i need is libblas.so.3, so i tried several ways 1. create a file called libblas.so.3, then ldconfig. 2. create a file called libblas.so.3.0 then ldconfig

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

o rename it to /usr/lib/libblas.so.3 and > /usr/lib/liblapack.so.3 . This is the way I made it work. -Xiangrui > > On Sun, May 18, 2014 at 4:49 PM, wxhsdp < > wxhsdp@ > > wrote: >> ok >> >> Spark Executor Command: "java" "-cp" >> &

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

ok Spark Executor Command: "java" "-cp" ":/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/ro

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

Hi, xiangrui you said "It doesn't work if you put the netlib-native jar inside an assembly jar. Try to mark it "provided" in the dependencies, and use --jars to include them with spark-submit. -Xiangrui" i'am not use an assembly jar which contains every thing, i also mark breeze depende

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

in case 1, breeze dependency in sbt.build file automatically downloads the jars and add them to classpath. in spark case, i manually download all the jars and add them to spark classpath why case 1 succeeded, and case 2 failed? do i miss something? -- View this message in context: http://apac

Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp

Hi, xiangrui i check the stderr of worker node, yes it's failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS... what do you mean by "include breeze-natives or netlib:all"? things i've already done: 1. add breeze and breeze native dependency in sbt build fil

Re: breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp

i think maybe it's related to m1.large, because i also tested on my laptop, the two case cost nearly the same amount of time. my laptop: model name : Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz cpu MHz : 2893.549 os: Linux ubuntu 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2

breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp

Dear, all i'am testing double precision matrix multiplication in spark on ec2 m1.large machines. i use breeze linalg library, and internally it calls native library(openblas nehalem single threaded) m1.large: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz cpu MHz : 1795.672

os buffer cache does not cache shuffle output file

2014-05-15 Thread wxhsdp

Hi, patrick said "The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk." i

application detail ui can not open on ec2

2014-05-15 Thread wxhsdp

Hi, all i follow the instruments on http://spark.apache.org/docs/latest/ec2-scripts.html to setup a standalone mode cluster on ec2, spark version is v1.0.0.rc3 i set spark.eventLog.enabled to true, and can see the log file in /tmp/spark-event, but i can not access application detail ui, an

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-05-15 Thread wxhsdp

Hi, mayur i've met the same problem. the instances are on, i can see them from ec2 console, and connect to them wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem root@54.86.181.108 The authenticity of host '54.86.181.108 (54.86.181.108)' can't be es

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp

finally i fixed it. previous failure is caused by lack of some jars. i pasted the classpath in local mode to workers by using "show compile:dependencyClasspath" and it works! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-Den

ERROR: Unknown Spark version

2014-05-15 Thread wxhsdp

hello it's my first time to run spark on ec2, i follow the instruments on http://spark.apache.org/docs/latest/ec2-scripts.html i use the command below to launch the cluter and error occurs ./spark-ec2 -w 500 -k wxhsdp -i wxhsdp.pem -s 1 -v 1.0.0 launch wxhsdp ~/spark-ec2 Initial

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp

x86_64-1.1-natives.jar, netlib-native_system-linux-x86_64-1.1-natives.jar 4. i also include classpath of the above jars but does not work:( DB Tsai-2 wrote > Hi Wxhsdp, > > I also have some difficulties witth "sc.addJar()". Since we include the > breeze library

Re: ERROR: Unknown Spark version

2014-05-14 Thread wxhsdp

i've tried 0.9.0 and it's ok, is v1.0.0 too new to ec2? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Unknown-Spark-version-tp5500p5502.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-14 Thread wxhsdp

7376 14/05/14 20:36:02 INFO Utils: Fetching http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar to /tmp/fetchFileTemp7468892065227766972.tmp 14/05/14 20:36:02 INFO Executor: Adding file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-natives_2.10-0.7.jar to cl

Re: Turn BLAS on MacOSX

2014-05-14 Thread wxhsdp

Hi, Xiangrui i compile openblas on ec2 m1.large, when breeze calls the native lib, error occurs: INFO: successfully loaded /mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so [error] (run-main-0) java.lang.UnsatisfiedLinkError

Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp

i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: details about event log

2014-05-13 Thread wxhsdp

thank you very much, Andrew by the difinition of "Fetch Wait Time", can i make a conclusion that task pipelines block fetch and job doing? Andrew Or-2 wrote > Hi wxhsdp, > > These times are computed from Java's System.currentTimeMillis(), which is > "the > d

something about pipeline

2014-05-13 Thread wxhsdp

Dear, all definition of fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be

Re: ERROR: Unknown Spark version

2014-05-12 Thread wxhsdp

thank you Madhu, it's a great help for me! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Unknown-Spark-version-tp5500p5519.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error starting EC2 cluster

2014-05-11 Thread wxhsdp

your ssh connection refuse is due to not long enough wait time. your remote machine is not ready at the time, i set wait time to 500 secs, and it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-starting-EC2-cluster-tp5332p5501.html Sent from the

time exhausted in BlockFetcher

2014-05-10 Thread wxhsdp

Hi, all i'am tuning my app in local mode, and found there was lots of time spent in local block fetch. in stage1: i read in input data, and do a repartition, in stage2: i do some operation on the repartitioned RDD, so it involves a local block fetch, i find that the fetch

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread wxhsdp

is there something wrong with the mailing list? very few people see my thread -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html Sent from the Apache Spark User List mailing list archive at Nab

Re: details about event log

2014-05-07 Thread wxhsdp

any ideas? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sbt run with spark.ContextCleaner ERROR

2014-05-05 Thread wxhsdp

Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

details about event log

2014-05-05 Thread wxhsdp

Hi, i'am looking at the event log, i'am a little confuse about some metrics here's the info of one task: "Launch Time":1399336904603 "Finish Time":1399336906465 "Executor Run Time":1781 "Shuffle Read Metrics":"Shuffle Finish Time":1399336906027, "Fetch Wait Time":0 "Shuffle Write Metrics":{"Shuf

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp

Hi, DB, i think it's something related to "sbt publishLocal" if i remove the breeze dependency in my sbt file, breeze can not be found [error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not found: object breeze [error] import breeze.linalg._ [error]^ her

NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp

Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt "run local...", everything is ok but when turn to standalone mode, sbt "run spark://127.0.

Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread wxhsdp

Hi, TD actually, i'am not very clear with my spark version. i check out from https://github.com/apache/spark/trunk on Apr 30. please tell me from where do you get the version Spark 1.0 RC3 i do not call sparkContext.stop. now i add it to the end of my code here's the log 14/05/04 18:48:21 INFO

sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread wxhsdp

Hi, all i use sbt to run my spark application, after the app completes, error occurs: 14/05/04 17:32:28 INFO network.ConnectionManager: Selector thread was interrupted! 14/05/04 17:32:28 ERROR spark.ContextCleaner: Error in cleaning thread java.lang.InterruptedException at java.lang.Objec

Re: same partition id means same location?

2014-05-01 Thread wxhsdp

anyone talk something about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: something about memory usage

2014-04-30 Thread wxhsdp

Hi, daniel, thx for your help i'am just running 1 core slaves. but still i can not work it out. the executor does the task one by one, task0, task1, task2... how can i get the memory task1 used with so many threads running in the background, also with GC.

same partition id means same location?

2014-04-30 Thread wxhsdp

Hi, i'am just reviewing "advanced spark features". it's about the pagerank example. it said "any shuffle operation on two RDDs will take on the partitioner of one of them, if one is set". so first we partition the Links by hashPartitioner, then we join the Links and Ranks0. Ranks0 will tak

something about memory usage

2014-04-30 Thread wxhsdp

Hi, guys i want to do some optimizations of my spark codes. i use VisualVM to monitor the executor when run the app. here's the snapshot: from the snapshot, i can get the memory usage information about the executor

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread wxhsdp

i fixed it. i make my sbt project depend on spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar and it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html Sent from the A

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp

Hi, patrick i checked out https://github.com/apache/spark/ this morning and built /spark/trunk with ./sbt/sbt assembly is it spark 1.0? so how can i update my sbt file? the latest version in http://repo1.maven.org/maven2/org/apache/spark/ is 0.9.1 thank you for your help -- View this message

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp

i met with the same question when update to spark 0.9.1 (svn checkout https://github.com/apache/spark/) Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq; at org.apache.spark.examples.GroupByTest$.main(

RE: questions about debugging a spark application

2014-04-29 Thread wxhsdp

Hi Liu, is it the feature of spark 0.9.1? my version is 0.9.0, it has no effect when i set spark.eventLog.enabled -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p5028.html Sent from the Apache Spark User

Re: questions about debugging a spark application

2014-04-28 Thread wxhsdp

thanks for your reply, daniel what do you mean by "the logs contain everything to reconstruct the same data." ? i also use times to look into the logs, but only get a little. as i can see, it logs the flow to run the application, but there are no more details about each task, for example, see the

Re: how to declare tuple return type

2014-04-28 Thread wxhsdp

you need to import org.apache.spark.rdd.RDD to include RDD. http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD here are some examples you can learn https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib SK wrote > I am a new user of

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp

ate a broadcast > variable. Check the broadcast variable id(say its x). Then go to the /tmp > to open broadcast_x file. You will find the content is serialized output > of > your variable. > > > > > > On Mon, Apr 28, 2014 at 4:26 PM, wxhsdp < > wxhsdp@ > >

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp

thank you for your help, Sourav. i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not equal to estimated size 135.6 KB. i opened it and found it's content has no relations with my read in file. i guess broadcast_0 is a config file about spark, is that right? -- View this m

what does broadcast_0 stand for

2014-04-28 Thread wxhsdp

Hi, guys when i read in a file on spark shell, the console shows that broadcast_0 is stored to memory. i guess it's related to the file, but broadcast_0 is not the file itself, because they have different size. what does broadcast_0 stand for? logs: 14/04/28 18:02:50 INFO MemoryStore: ensureFreeS

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp

or should i run my app on spark shell by using addJars ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4911.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp

or should i run my app in spark shell by using addJars -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4910.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

questions about debugging a spark application

2014-04-26 Thread wxhsdp

Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app, like shuffle size, duration time, stage information... there are not sufficient informations in the master UI. do i need to hang the

Re: how to get subArray without copy

2014-04-26 Thread wxhsdp

the way i can find out is to use 2-D Array if the split has regularity -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to get subArray without copy

2014-04-26 Thread wxhsdp

Hi, all i want to do the following operations: (1) each partition do some operations on the partition data in Array format (2) split the array into subArrays, and combine each subArray with an id (3) do a shuffle according to the id here is the pseudo code /*pseudo code*/ case class

Re: how to set spark.executor.memory and heap size

2014-04-26 Thread wxhsdp

Hi, finally, i solve this problem by using the SPARK_HOME/bin/run-example script to run my application, and it works. i guess the error is due to lack of some classpath -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-h

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

i noticed that error occurs at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:28

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

anyone knows the reason? i've googled a bit, and found some guys had the same problem, but with no replies... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html Sent from the Apache Spark User Lis

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

ext("spark://localhost:7077", "Simple > App", > "/home/ubuntu/spark-0.9.1-incubating/", > List("target/scala-2.10/simple-project_2.10-2.0.jar")) > val tr = sc.textFile(logFile).cache > tr.take(100).foreach(println) > &

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

gt; > also, print stack message of your spark-shell... > > > On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp < > wxhsdp@ > > wrote: > >> thanks for your reply, adnan, i tried >> val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md&quo

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

thanks for your reply, adnan, i tried val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md" i think there needs three left slash behind file: it's just the same as val logFile = "home/wxhsdp/spark/example/standalone/README.md" the error remains:(

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

i think maybe it's the problem of read local file val logFile = "/home/wxhsdp/spark/example/standalone/README.md" val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully

Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp

i tried, but no effect Qin Wei wrote > try the complete path > > > qinwei > From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set > spark.executor.memory and heap sizethank you, i add setJars, but nothing > changes > > val conf = new SparkConf() > .setMaster("spark://12

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp

thank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Simple App") .set("spark.executor.memory", "1g") .setJars(Seq("target/scala-2.10/simple-project_2.10-1.0.jar")) val sc = new SparkContext(conf)

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp

by the way, codes run ok in spark shell -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp

hi i'am testing SimpleApp.scala in standalone mode with only one pc, so i have one master and one local worker on the same pc with rather small input file size(4.5K), i have got the java.lang.OutOfMemoryError: Java heap space error here's my settings: spark-env.sh: export SPARK_MASTER_IP="127.0.0

Re: no response in spark web UI

2014-04-23 Thread wxhsdp

thanks for help now i'am testing in standalone mode in only one pc. i use ./sbin/start-master.sh to start a master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077 to connect to the master from the web ui, i can see the local worker registered with 2.9GB memory <

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread wxhsdp

i have a similar question i'am testing in standalone mode in only one pc. i use ./sbin/start-master.sh to start a master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077 to connect to the master from the web ui, i can see the local worker registered

no response in spark web UI

2014-04-22 Thread wxhsdp

Hi, all i used to run my app using sbt run. but now i want to see the job information in spark web ui. i'am in local mode, i start the spark shell, and access the web ui on http://ubuntu.local:4040/stages/. but when i sbt run some application, there is no response in the web ui. how to make con

questions about toArray and ClassTag

2014-04-19 Thread wxhsdp

Hi, all i'am quite new in scala, i do some tests in spark shell val b = a.mapPartitions{D => val p = D.toArray . p.toIterator } when a is an RDD of type RDD[Int], b.collect() works. but when i change a to RDD[MyOwnType], b.collect() returns error: 14/04/20 10:14:46 ERROR OneForOneSt

Re: what is the difference between element and partition?

2014-04-17 Thread wxhsdp

what do you mean by "element"? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317p4378.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: groupByKey(None) returns partitions according to the keys?

2014-04-17 Thread wxhsdp

No, partition number is determined by the parameter you set in groupByKey, see http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions for details, suggest you reading some docs before ask questions Joe L wrote > I was wonder if groupByKey returns 2 partition

Re: groupByKey returns a single partition in a RDD?

2014-04-15 Thread wxhsdp

groupByKey has the numPartitions parameter, you can set it to determine the partition num. if not set, the generated RDD has the same partition num of the previous one Joe L wrote > I want to apply the following transformations to 60Gbyte data on 7nodes > with 10Gbyte memory. And I am wondering i

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp

thank you so much, davidson ye, you are right, in both sbt and spark shell, the result of my code is 28MB, it's irrelevant to numSlices. yesterday i had the result of 4.2MB in spark shell, because i remove array initialization for laziness:) for(i <- 0 until size) { array(i) = i } --

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp

sorry, davidosn, i don't catch the point. what's the essential difference between our codes? /*my code*/ val array = new Array[Int](size) val a = sc.parallelize(array).cache() /*4MB*/ /*your code*/ val numSlices = 8 val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / numSlices) } v

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp

thanks for your help, Davidson! i modified val a:RDD[Int] = sc.parallelize(array).cache() to keep "val a" an RDD of Int, but has the same result another question JVM and spark memory locate at different parts of system memory, the spark code is executed in JVM memory, malloc operation like val e

storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp

Hi, all in order to understand the memory usage about spark, i do the following test val size = 1024*1024 val array = new Array[Int](size) for(i <- 0 until size) { array(i) = i } val a = sc.parallelize(array).cache() /*4MB*/ val b = a.mapPartitions{ c => { val d = c.toArray val e = new Arr

SVD under spark/mllib/linalg

2014-04-11 Thread wxhsdp

Hi, all the code under https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg has changed. previous matrix classes are all removed, like MatrixEntry, MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze Linear Algebra when do linear algor

Re: Where does println output go?

2014-04-10 Thread wxhsdp

rdd.foreach(p => { print(p) }) The above closure gets executed on workers, you need to look at the logs of the workers to see the output. but if i'm in local mode, where's the logs of local driver, there are no /logs and /work dirs in /SPARK_HOME which are set in standalone mode. -- View

Re: Only TraversableOnce?

2014-04-09 Thread wxhsdp

thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D => {val p = D.toArray; ...; p.toIterator}} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-Traversable

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp

t; Nan Zhu > > > > On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote: > >> yes, how can i do this conveniently? i can use filter, but there will be >> so >> many RDDs and it's not concise >> >> >> >> -- >> View this mes

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp

yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at Nabb

Only TraversableOnce?

2014-04-08 Thread wxhsdp

In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E = A+C

84 matches

Mail list logo