Hi, all
i have two questions about shuffle time and parallel degree.
question 1:
we assume that cluster size is fixed, for example a cluster of 16 nodes,
each node has 2 cores in EC2
case 1: a total shuffle of 64GB data between 32 partitions
case 2: a total shuffle of 128GB data between
hi, zhen
i met the same problem in ec2, application details can not be accessed.
but i can read stdout
and stderr. the problem has not been solved yet
--
View this message in context:
Hi, all
i've observed that sometimes when the executor finishes one task, it will
wait about 5 seconds to
get another task to work, during the 5 seconds, the executor does nothing:
cpu idle, no disk access,
no network transfer. is that normal for spark?
thanks!
--
View this message in
hi, all
i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything
goes well except that i
can not access application details on the web ui, i just click on the
application name, but there's
not response, has anyone met this before? is this a bug?
thanks!
--
View this
anyone see my thread?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348p6368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, all
fetch wait time:
* Time the task spent waiting for remote shuffle blocks. This only
includes the time
* blocking on shuffle input data. For instance if block B is being
fetched while the task is
* still not finished processing block A, it is not considered to be
blocking on
Hi, xiangrui
i check the stderr of worker node, yes it's failed to load implementation
from:
com.github.fommil.netlib.NativeSystemBLAS...
what do you mean by include breeze-natives or netlib:all?
things i've already done:
1. add breeze and breeze native dependency in sbt build file
Hi, xiangrui
you said It doesn't work if you put the netlib-native jar inside an
assembly
jar. Try to mark it provided in the dependencies, and use --jars to
include them with spark-submit. -Xiangrui
i'am not use an assembly jar which contains every thing, i also mark
breeze
ok
Spark Executor Command: java -cp
Dear, all
i'am testing double precision matrix multiplication in spark on ec2
m1.large machines.
i use breeze linalg library, and internally it calls native
library(openblas nehalem single threaded)
m1.large:
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
cpu MHz :
i think maybe it's related to m1.large, because i also tested on my laptop,
the two case cost nearly
the same amount of time.
my laptop:
model name : Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
cpu MHz : 2893.549
os:
Linux ubuntu 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC
-natives.jar
4. i also include classpath of the above jars
but does not work:(
DB Tsai-2 wrote
Hi Wxhsdp,
I also have some difficulties witth sc.addJar(). Since we include the
breeze library by using Spark 1.0, we don't have the problem you ran into.
However, when we add external jars via
finally i fixed it. previous failure is caused by lack of some jars.
i pasted the classpath in local mode to workers by using show
compile:dependencyClasspath
and it works!
--
View this message in context:
Hi, mayur
i've met the same problem. the instances are on, i can see them from ec2
console, and connect to them
wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem
root@54.86.181.108
The authenticity of host '54.86.181.108 (54.86.181.108)' can't be
established.
ECDSA key
Hi,
patrick said The intermediate shuffle output gets written to disk, but it
often hits the OS-buffer cache
since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the
shuffle is agnostic to whether the base RDD is in cache or in disk.
i
/14 20:36:02 INFO Utils: Fetching
http://192.168.0.106:42883/jars/breeze-natives_2.10-0.7.jar to
/tmp/fetchFileTemp7468892065227766972.tmp
14/05/14 20:36:02 INFO Executor: Adding
file:/home/wxhsdp/spark/spark/tags/v1.0.0-rc3/work/app-20140514203557-/0/./breeze-natives_2.10-0.7.jar
to class loader
Dear, all
definition of fetch wait time:
* Time the task spent waiting for remote shuffle blocks. This only
includes the time
* blocking on shuffle input data. For instance if block B is being
fetched while the task is
* still not finished processing block A, it is not considered to
i think so, fewer questions and answers these three days
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
any ideas? thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, TD
i tried on v1.0.0-rc3 and still got the error
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
i'am looking at the event log, i'am a little confuse about some metrics
here's the info of one task:
Launch Time:1399336904603
Finish Time:1399336906465
Executor Run Time:1781
Shuffle Read Metrics:Shuffle Finish Time:1399336906027, Fetch Wait
Time:0
Shuffle Write Metrics:{Shuffle Bytes
Hi, TD
actually, i'am not very clear with my spark version. i check out from
https://github.com/apache/spark/trunk on Apr 30.
please tell me from where do you get the version Spark 1.0 RC3
i do not call sparkContext.stop. now i add it to the end of my code
here's the log
14/05/04 18:48:21 INFO
Hi,
i'am trying to use breeze linalg library for matrix operation in my spark
code. i already add dependency
on breeze in my build.sbt, and package my code sucessfully.
when i run on local mode, sbt run local..., everything is ok
but when turn to standalone mode, sbt run
Hi, DB, i think it's something related to sbt publishLocal
if i remove the breeze dependency in my sbt file, breeze can not be found
[error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not
found: object breeze
[error] import breeze.linalg._
[error]^
here's my sbt file
anyone talk something about this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
i fixed it.
i make my sbt project depend on
spark/trunk/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
and it works
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-from-Spark-Java-tp4937p5096.html
Sent from the
Hi,
i'am just reviewing advanced spark features. it's about the pagerank
example.
it said any shuffle operation on two RDDs will take on the partitioner of
one of them, if one is set.
so first we partition the Links by hashPartitioner, then we join the Links
and Ranks0. Ranks0 will take
i met with the same question when update to spark 0.9.1
(svn checkout https://github.com/apache/spark/)
Exception in thread main java.lang.NoSuchMethodError:
org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq;
at
thank you for your help, Sourav.
i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not
equal to estimated size 135.6 KB.
i opened it and found it's content has no relations with my read in file. i
guess broadcast_0 is a config
file about spark, is that right?
--
View this
you need to import org.apache.spark.rdd.RDD to include RDD.
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
here are some examples you can learn
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
SK wrote
I am a new user of
thanks for your reply, daniel
what do you mean by the logs contain everything to reconstruct the same
data. ?
i also use times to look into the logs, but only get a little.
as i can see, it logs the flow to run the application, but there are no more
details about
each task, for example, see the
Hi, all
i want to do the following operations:
(1) each partition do some operations on the partition data in Array
format
(2) split the array into subArrays, and combine each subArray with an id
(3) do a shuffle according to the id
here is the pseudo code
/*pseudo code*/
case
the way i can find out is to use 2-D Array if the split has regularity
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, all
i have some questions about debug in spark:
1) when application finished, application UI is shut down, i can not see
the details about the app, like
shuffle size, duration time, stage information... there are not
sufficient informations in the master UI.
do i need to hang
thank you, i add setJars, but nothing changes
val conf = new SparkConf()
.setMaster(spark://127.0.0.1:7077)
.setAppName(Simple App)
.set(spark.executor.memory, 1g)
.setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar))
val sc = new SparkContext(conf)
--
i tried, but no effect
Qin Wei wrote
try the complete path
qinwei
From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set
spark.executor.memory and heap sizethank you, i add setJars, but nothing
changes
val conf = new SparkConf()
i think maybe it's the problem of read local file
val logFile = /home/wxhsdp/spark/example/standalone/README.md
val logData = sc.textFile(logFile).cache()
if i replace the above code with
val logData = sc.parallelize(Array(1,2,3,4)).cache()
the job can complete successfully
can't i read
thanks for your reply, adnan, i tried
val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
i think there needs three left slash behind file:
it's just the same as val logFile =
home/wxhsdp/spark/example/standalone/README.md
the error remains:(
--
View this message in context
...
On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt;
wxhsdp@
gt; wrote:
thanks for your reply, adnan, i tried
val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
i think there needs three left slash behind file:
it's just the same as val logFile =
home/wxhsdp/spark/example
-2.10/simple-project_2.10-2.0.jar))
val tr = sc.textFile(logFile).cache
tr.take(100).foreach(println)
}
}
This will work
On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp lt;
wxhsdp@
gt; wrote:
hi arpit,
on spark shell, i can read local file properly
anyone knows the reason? i've googled a bit, and found some guys had the same
problem, but with no replies...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html
Sent from the Apache Spark User
i noticed that error occurs
at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
at
i have a similar question
i'am testing in standalone mode in only one pc.
i use ./sbin/start-master.sh to start a master and
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
to connect to the master
from the web ui, i can see the local worker registered
hi
i'am testing SimpleApp.scala in standalone mode with only one pc, so i have
one master and one local worker on the same pc
with rather small input file size(4.5K), i have got the
java.lang.OutOfMemoryError: Java heap space error
here's my settings:
spark-env.sh:
export
by the way, codes run ok in spark shell
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, all
i used to run my app using sbt run. but now i want to see the job
information in spark web ui.
i'am in local mode, i start the spark shell, and access the web ui on
http://ubuntu.local:4040/stages/.
but when i sbt run some application, there is no response in the web ui.
how to make
Hi, all
i'am quite new in scala, i do some tests in spark shell
val b = a.mapPartitions{D =
val p = D.toArray
.
p.toIterator
}
when a is an RDD of type RDD[Int], b.collect() works. but when i change a to
RDD[MyOwnType], b.collect() returns error:
14/04/20 10:14:46 ERROR
thank you so much, davidson
ye, you are right, in both sbt and spark shell, the result of my code is
28MB, it's irrelevant to numSlices.
yesterday i had the result of 4.2MB in spark shell, because i remove array
initialization for laziness:)
for(i - 0 until size) {
array(i) = i
}
--
Hi, all
in order to understand the memory usage about spark, i do the following test
val size = 1024*1024
val array = new Array[Int](size)
for(i - 0 until size) {
array(i) = i
}
val a = sc.parallelize(array).cache() /*4MB*/
val b = a.mapPartitions{ c = {
val d = c.toArray
val e = new
thanks for your help, Davidson!
i modified
val a:RDD[Int] = sc.parallelize(array).cache()
to keep val a an RDD of Int, but has the same result
another question
JVM and spark memory locate at different parts of system memory, the spark
code is executed in JVM memory, malloc operation like val e =
Hi, all
the code under
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/linalg
has changed. previous matrix classes are all removed, like MatrixEntry,
MatrixSVD. Instead breeze matrix definition appears. Do we move to Breeze
Linear Algebra when do linear
rdd.foreach(p = {
print(p)
})
The above closure gets executed on workers, you need to look at the logs of
the workers to see the output.
but if i'm in local mode, where's the logs of local driver, there are no
/logs and /work dirs in /SPARK_HOME which are set in standalone mode.
--
View
thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
--
View this message in context:
In my application, data parts inside an RDD partition have ralations. so I
need to do some operations beween them.
for example
RDD T1 has several partitions, each partition has three parts A, B and C.
then I transform T1 to T2. after transform, T2 also has three parts D, E and
F, D = A+B, E =
yes, how can i do this conveniently? i can use filter, but there will be so
many RDDs and it's not concise
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from the Apache Spark User List mailing list archive at
8, 2014 at 8:40 AM, wxhsdp wrote:
yes, how can i do this conveniently? i can use filter, but there will be
so
many RDDs and it's not concise
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from
56 matches
Mail list logo