Spark ec2 cluster lost worker
Hi, According to the Spark UI, one worker is lost after a failed job. It is not a "lost executor" error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is "running" and no check alarms. So I am confused how I could reconnect the lost machine in aws ec2? I met this problem before, and my solution was to rebuilt a new cluster. However now it is a little hard to rebuild a cluster, so I am wondering if there's some way to find back the lost machine? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost-worker-tp23482.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Array[T].distinct doesn't work inside RDD
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id, arr.distinct)} to remove duplicates inside RDD. However this doesn't work since I did some further tests by using val dataNoDuplicates = dataDuplicates.map{case(id, arr) => val uniqArr = arr.distinct if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last) (id, uniqArr) } And from the worker stdout I could see that it always returns "TRUE" results. I then tried removing duplicates by using Array[T].toSet instead of Array[T].distinct and it is working! Could anybody explain why the Array[T].toSet and Array[T].distinct behaves differently here? And Why is Array[T].distinct not working? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to configure SparkUI to use internal ec2 ip
Hi, For security reasons, we added a server between my aws Spark Cluster and local, so I couldn't connect to the cluster directly. To see the SparkUI and its related work's stdout and stderr, I used dynamic forwarding and configured the SOCKS proxy. Now I could see the SparkUI using the internal ec2 ip, however when I click on the application UI (4040) or the worker's UI (8081), it still automatically uses the public DNS instead of internal ec2 ip, which the browser now couldn't show. Is there a way that I could configure this? I saw that one could configure the LOCAL_ADDRESS_IP in the spark-env.sh, but not sure whether this could help. Does anyone experience the same issue? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-SparkUI-to-use-internal-ec2-ip-tp22311.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
output worker stdout to one place
Hi, I am wondering if there's some way that could lead some of the worker stdout to one place instead of in each worker's stdout. For example, I have the following code RDD.foreach{line => try{ do something }catch{ case e:exception => println(line) } } Every time I want to check what's causing the exception, I have to check one worker after another in the UI, because I don't know which worker will be dealing with the exception case. Is there a way that the "println" could print to one place instead of separate worker stdout so that I only need to check one place? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/output-worker-stdout-to-one-place-tp21742.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to output to S3 and keep the order
Hi, I am using Spark on AWS and want to write the output to S3. It is a relatively small file and I don't want them to output as multiple parts. So I use result.repartition(1).saveAsTextFile("s3://...") However as long as I am using the saveAsTextFile method, the output doesn't keep the original order. But if I use BufferedWriter in Java to write the output, I could only write to the master machine instead of S3 directly. Is there a way that I could write to S3 and the same time keep the order? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-output-to-S3-and-keep-the-order-tp21246.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org/apache/commons/math3/random/RandomGenerator issue
Hi Lev, I also finally couldn't solve that problem and switched to Java.util.Random. Thanks~ Anny On Sat, Nov 8, 2014 at 4:21 AM, lev [via Apache Spark User List] < ml-node+s1001560n18406...@n3.nabble.com> wrote: > Hi, > I'm using breeze.stats.distributions.Binomial with spark 1.1.0 and having > the same error. > I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it > didn't help. > > Any ideas what might be the problem? > > Thanks, > Lev. > > anny9699 wrote > I use the breeze.stats.distributions.Bernoulli in my code, however met > this problem > java.lang.NoClassDefFoundError: > org/apache/commons/math3/random/RandomGenerator > > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18406.html > To unsubscribe from org/apache/commons/math3/random/RandomGenerator > issue, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15748&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NzQ4fC0xMzE2OTg2NzMw> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18415.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
worker_instances vs worker_cores
Hi, I have a question about the worker_instances setting and worker_cores setting in aws ec2 cluster. I understand it is a cluster and the default setting in the cluster is *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 1* However after I changed it to *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 8* Seems the speed doesn't change very much. Could anyone give an explanation about this? Maybe more details about work_cores vs worker_instances? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/worker-instances-vs-worker-cores-tp16855.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark output to s3 extremely slow
Hi, I found writing output back to s3 using rdd.saveAsTextFile() is extremely slow, much slower than reading from s3. Is there a way to make it faster? The rdd has 150 partitions so parallelism is enough I assume. Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
lazy evaluation of RDD transformation
Hi, I see that this type of question has been asked before, however still a little confused about it in practice. Such as there are two ways I could deal with a series of RDD transformation before I do a RDD action, which way is faster: Way 1: val data = sc.textFile() val data1 = data.map(x => f1(x)) val data2 = data.map(x1 = f2(x1)) println(data2.count()) Way2: val data = sc.textFile(0 val data2 = data.map(x => f2(f1(x))) println(data2.count()) Since Spark doesn't materialize RDD transformations, so I assume the two ways are equal? I asked this because the memory of my cluster is very limited and I don't want to cache a RDD at the very early stage. Is it true that if I cache a RDD early and take the space, then I need to unpersist it before I cache another in order to save the memory? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/lazy-evaluation-of-RDD-transformation-tp15811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org/apache/commons/math3/random/RandomGenerator issue
Thanks Ted this is working now! Previously I added another commons-math3 jar to my classpath and that one doesn't work. This one included by maven seems to work. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15760.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org/apache/commons/math3/random/RandomGenerator issue
Hi Ted, I tried including org.apache.commons commons-math3 3.3 in my pom file and adding this jar to my classpath. However this error still appears as Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.stats.distributions.Bernoulli$.$lessinit$greater$default$2(Bernoulli.scala:30) So seems this jar should be included in breeze when then built the package, and simply adding it to dependency doesn't help. Thanks a lot! Anny On Sat, Oct 4, 2014 at 2:32 PM, Ted Yu [via Apache Spark User List] < ml-node+s1001560n1575...@n3.nabble.com> wrote: > breeze jar doesn't contain RandomGenerator class. > > Have you tried with commons-math3 3.1.1 in your pom.xml ? > > Let us know if you still encounter problems. > > Cheers > > On Sat, Oct 4, 2014 at 2:11 PM, 陈韵竹 <[hidden email] > <http://user/SendEmail.jtp?type=node&node=15754&i=0>> wrote: > >> Hi Ted, >> >> I did include explicitly breeze in my pom.xml >> >> >> >> org.scalanlp >> >> breeze_${scala.binary.version} >> >> 0.9 >> >> >> >> But this error message still appears. >> >> >> Thanks! >> >> On Sat, Oct 4, 2014 at 2:03 PM, Ted Yu <[hidden email] >> <http://user/SendEmail.jtp?type=node&node=15754&i=1>> wrote: >> >>> See the last comment in that thread from Xiangrui: >>> >>> bq. include breeze in the dependency set of your project. Do not rely >>> on transitive dependencies >>> >>> Cheers >>> >>> On Sat, Oct 4, 2014 at 1:48 PM, 陈韵竹 <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=15754&i=2>> wrote: >>> >>>> Hi Ted, >>>> >>>> So according to previous posts, the problem should be solved by >>>> changing the spark-1.1.0 core pom file? >>>> >>>> Thanks! >>>> >>>> On Sat, Oct 4, 2014 at 1:06 PM, Ted Yu <[hidden email] >>>> <http://user/SendEmail.jtp?type=node&node=15754&i=3>> wrote: >>>> >>>>> Cycling bits: >>>>> http://search-hadoop.com/m/JW1q5UX9S1/breeze+spark&subj=Build+error+when+using+spark+with+breeze >>>>> >>>>> On Sat, Oct 4, 2014 at 12:59 PM, anny9699 <[hidden email] >>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=4>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I use the breeze.stats.distributions.Bernoulli in my code, however >>>>>> met this >>>>>> problem >>>>>> java.lang.NoClassDefFoundError: >>>>>> org/apache/commons/math3/random/RandomGenerator >>>>>> >>>>>> I read the posts about this problem before, and if I added >>>>>> >>>>>> org.apache.commons >>>>>> commons-math3 >>>>>> 3.3 >>>>>> runtime >>>>>> >>>>>> to my pom.xml, more serious issues will appear. The breeze dependency >>>>>> is >>>>>> already in my pom.xml. I am using Spark-1.1.0. Seems I didn't meet >>>>>> this >>>>>> issue when I was using Spark-1.0.0. Does anyone have some suggestions? >>>>>> >>>>>> Thanks a lot! >>>>>> Anny >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: [hidden email] >>>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=5> >>>>>> For additional commands, e-mail: [hidden email] >>>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=6> >>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15754.html > To unsubscribe from org/apache/commons/math3/random/RandomGenerator > issue, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15748&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NzQ4fC0xMzE2OTg2NzMw> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15755.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
org/apache/commons/math3/random/RandomGenerator issue
Hi, I use the breeze.stats.distributions.Bernoulli in my code, however met this problem java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator I read the posts about this problem before, and if I added org.apache.commons commons-math3 3.3 runtime to my pom.xml, more serious issues will appear. The breeze dependency is already in my pom.xml. I am using Spark-1.1.0. Seems I didn't meet this issue when I was using Spark-1.0.0. Does anyone have some suggestions? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
array size limit vs partition number
Hi, Sorry I am not very familiar with Java. I found that if I set the RDD partition number to be higher, I meet this error message"java.lang.OutOfMemoryError: Requested array size exceeds VM limit"; however if I set the RDD partition number to be lower, the error is gone. My aws ec2 cluster has 72 cores, so I first set the partition number to be 150, and met the above problem. Then I set the partition number to be 100, the error is gone. Could anybody explain? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/array-size-limit-vs-partition-number-tp15695.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: still "GC overhead limit exceeded" after increasing heap space
Hi Liquan, I have 8 workers, each with 15.7GB memory. What you said makes sense, but if I don't increase heap space, it keeps telling me "GC overhead limit exceeded". Thanks! Anny On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] < ml-node+s1001560n1554...@n3.nabble.com> wrote: > Hi > > How many nodes in your cluster? It seems to me 64g does not help if each > of your node doesn't have that many memory. > > Liquan > > On Wed, Oct 1, 2014 at 1:37 PM, anny9699 <[hidden email] > <http://user/SendEmail.jtp?type=node&node=15541&i=0>> wrote: > >> Hi, >> >> After reading some previous posts about this issue, I have increased the >> java heap space to "-Xms64g -Xmx64g", but still met the >> "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does >> anyone >> have other suggestions? >> >> I am reading a data of 200 GB and my total memory is 120 GB, so I use >> "MEMORY_AND_DISK_SER" and kryo serialization. >> >> Thanks a lot! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: [hidden email] >> <http://user/SendEmail.jtp?type=node&node=15541&i=1> >> For additional commands, e-mail: [hidden email] >> <http://user/SendEmail.jtp?type=node&node=15541&i=2> >> >> > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540p15541.html > To unsubscribe from still "GC overhead limit exceeded" after increasing > heap space, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15540&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NTQwfC0xMzE2OTg2NzMw> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540p15543.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
still "GC overhead limit exceeded" after increasing heap space
Hi, After reading some previous posts about this issue, I have increased the java heap space to "-Xms64g -Xmx64g", but still met the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so I use "MEMORY_AND_DISK_SER" and kryo serialization. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
memory vs data_size
Hi, Is there a guidance about for a data of certain data size, how much total memory should be needed to achieve a relatively good speed? I have a data of around 200 GB and the current total memory for my 8 machines are around 120 GB. Is that too small to run the data of this big? Even the read in and simple initial processing seems to last forever. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-vs-data-size-tp15443.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
about partition number
Hi, I read the past posts about partition number, but am still a little confused about partitioning strategy. I have a cluster with 8 works and 2 cores for each work. Is it true that the optimal partition number should be 2-4 * total_coreNumber or should approximately equal to total_coreNumber? Or it's the task number that really determines the speed rather then partition number? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-partition-number-tp15362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sc.textFile can't recognize '\004'
Thanks a lot Sean! It works now for me now~~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
sc.textFile can't recognize '\004'
Hi, I need to parse a file which is separated by a series of separators. I used SparkContext.textFile and I met two problems: 1) One of the separators is '\004', which could be recognized by python or R or Hive, however Spark seems can't recognize this one and returns a symbol looking like '?'. Also this symbol is not a question mark and I don't know how to parse. 2) Some of the separator are composed of several Chars, like "} =>". If I use str.split(Array('}', '=>')), it will separate the string but with many white spaces included in the middle. Is there a good way that I could separate by String instead of by Array of Chars? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Do all classes involving RDD operation need to be registered?
Thanks so much Sonal! I am much clearer now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Do all classes involving RDD operation need to be registered?
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3446.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Do all classes involving RDD operation need to be registered?
Hi, I am sorry if this has been asked before. I found that if I wrapped up some methods in a class with parameters, spark will throw "Task Nonserializable" exception; however if wrapped up in an object or case class without parameters, it will work fine. Is it true that all classes involving RDD operation should be registered so that SparkContext could recognize them? Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: java.lang.NullPointerException met when computing new RDD or use .count
Hi Andrew, Thanks for the reply. However I did almost the same thing in another closure: val simi=dataByRow.map(point => { val corrs=dataByRow.map(x => arrCorr(point._2,x._2)) (point._1,corrs) }) here dataByRow is of format RDD[(Int,Array[Double])] and arrCorr is a function that I wrote to compute correlation between two scala arrays. and it worked. So I am a little confused why it worked here and not in other places. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-met-when-computing-new-RDD-or-use-count-tp2766p2779.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
java.lang.NullPointerException met when computing new RDD or use .count
Hi, I met this exception when computing new RDD from an existing RDD or using .count on some RDDs. The following is the situation: val DD1=D.map(d => { (d._1,D.map(x => math.sqrt(x._2*d._2)).toArray) }) DD is in the format RDD[(Int,Double)] and the error message is: org.apache.spark.SparkException: Job aborted: Task 14.0:8 failed more than 0 times; aborting job java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157) I also met this kind of problem when using .count() on some RDDs. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-met-when-computing-new-RDD-or-use-count-tp2766.html Sent from the Apache Spark User List mailing list archive at Nabble.com.