Spark ec2 cluster lost worker

2015-06-24 Thread anny9699
Hi,

According to the Spark UI, one worker is lost after a failed job. It is not
a "lost executor" error, but that the UI now only shows 8 workers (I have 9
workers). However from the ec2 console, it shows the machine is "running"
and no check alarms. So I am confused how I could reconnect the lost machine
in aws ec2? 

I met this problem before, and my solution was to rebuilt a new cluster.
However now it is a little hard to rebuild a cluster, so I am wondering if
there's some way to find back the lost machine? 

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ec2-cluster-lost-worker-tp23482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, 

I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use

val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id,
arr.distinct)}

to remove duplicates inside RDD. However this doesn't work since I did some
further tests by using

val dataNoDuplicates = dataDuplicates.map{case(id, arr) =>
val uniqArr = arr.distinct
if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last)
(id, uniqArr)
}

And from the worker stdout I could see that it always returns "TRUE"
results. I then tried removing duplicates by using Array[T].toSet instead of
Array[T].distinct and it is working!

Could anybody explain why the Array[T].toSet and Array[T].distinct behaves
differently here? And Why is Array[T].distinct not working? 

Thanks a lot!
Anny




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to configure SparkUI to use internal ec2 ip

2015-03-30 Thread anny9699
Hi,

For security reasons, we added a server between my aws Spark Cluster and
local, so I couldn't connect to the cluster directly. To see the SparkUI and
its related work's  stdout and stderr, I used dynamic forwarding and
configured the SOCKS proxy. Now I could see the SparkUI using the  internal
ec2 ip, however when I click on the application UI (4040) or the worker's UI
(8081), it still automatically uses the public DNS instead of internal ec2
ip, which the browser now couldn't show. 

Is there a way that I could configure this? I saw that one could configure
the LOCAL_ADDRESS_IP in the spark-env.sh, but not sure whether this could
help. Does anyone experience the same issue?

Thanks a lot!
Anny




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-SparkUI-to-use-internal-ec2-ip-tp22311.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



output worker stdout to one place

2015-02-20 Thread anny9699
Hi,

I am wondering if there's some way that could lead some of the worker stdout
to one place instead of in each worker's stdout. For example, I have the
following code

RDD.foreach{line =>
try{
do something
}catch{
case e:exception => println(line)
}
}

Every time I want to check what's causing the exception, I have to check one
worker after another in the UI, because I don't know which worker will be
dealing with the exception case. Is there a way that the "println" could
print to one place instead of separate worker stdout so that I only need to
check one place?

Thanks a lot!
Anny



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/output-worker-stdout-to-one-place-tp21742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to output to S3 and keep the order

2015-01-19 Thread anny9699
Hi,

I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as multiple parts. So
I use

result.repartition(1).saveAsTextFile("s3://...")

However as long as I am using the saveAsTextFile method, the output doesn't
keep the original order. But if I use BufferedWriter in Java to write the
output, I could only write to the master machine instead of S3 directly. Is
there a way that I could write to S3 and the same time keep the order?

Thanks a lot!
Anny



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-output-to-S3-and-keep-the-order-tp21246.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-08 Thread anny9699
Hi Lev,

I also finally couldn't solve that problem and switched to Java.util.Random.

Thanks~
Anny

On Sat, Nov 8, 2014 at 4:21 AM, lev [via Apache Spark User List] <
ml-node+s1001560n18406...@n3.nabble.com> wrote:

> Hi,
> I'm using breeze.stats.distributions.Binomial with spark 1.1.0 and having
> the same error.
> I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it
> didn't help.
>
> Any ideas what might be the problem?
>
> Thanks,
> Lev.
>
> anny9699 wrote
> I use the breeze.stats.distributions.Bernoulli in my code, however met
> this problem
> java.lang.NoClassDefFoundError:
> org/apache/commons/math3/random/RandomGenerator
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18406.html
>  To unsubscribe from org/apache/commons/math3/random/RandomGenerator
> issue, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15748&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NzQ4fC0xMzE2OTg2NzMw>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

worker_instances vs worker_cores

2014-10-20 Thread anny9699
Hi,

I have a question about the worker_instances setting and worker_cores
setting in aws ec2 cluster. I understand it is a cluster and the default
setting in the cluster is

*SPARK_WORKER_CORES = 8
SPARK_WORKER_INSTANCES = 1*

However after I changed it to

*SPARK_WORKER_CORES = 8
SPARK_WORKER_INSTANCES = 8*

Seems the speed doesn't change very much. Could anyone give an explanation
about this? Maybe more details about work_cores vs worker_instances?

Thanks a lot!
Anny




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/worker-instances-vs-worker-cores-tp16855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark output to s3 extremely slow

2014-10-14 Thread anny9699
Hi,

I found writing output back to s3 using rdd.saveAsTextFile() is extremely
slow, much slower than reading from s3. Is there a way to make it faster?
The rdd has 150 partitions so parallelism is enough I assume.

Thanks a lot!
Anny



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



lazy evaluation of RDD transformation

2014-10-06 Thread anny9699
Hi,

I see that this type of question has been asked before, however still a
little confused about it in practice. Such as there are two ways I could
deal with a series of RDD transformation before I do a RDD action, which way
is faster:

Way 1:
val data = sc.textFile()
val data1 = data.map(x => f1(x))
val data2 = data.map(x1 = f2(x1))
println(data2.count())

Way2:
val data = sc.textFile(0
val data2 = data.map(x => f2(f1(x)))
println(data2.count())

Since Spark doesn't materialize RDD transformations, so I assume the two
ways are equal?

I asked this because the memory of my cluster is very limited and I don't
want to cache a RDD at the very early stage. Is it true that if I cache a
RDD early and take the space, then I need to unpersist it before I cache
another in order to save the memory?

Thanks a lot!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/lazy-evaluation-of-RDD-transformation-tp15811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Thanks Ted this is working now!

Previously I added another commons-math3 jar to my classpath and that one
doesn't work. This one included by maven seems to work.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15760.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Hi Ted,

I tried including

  org.apache.commons
  commons-math3
  3.3

in my pom file and adding this jar to my classpath. However this error
still appears as

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator
at
breeze.stats.distributions.Bernoulli$.$lessinit$greater$default$2(Bernoulli.scala:30)
So seems this jar should be included in breeze when then built the package,
and simply adding it to dependency doesn't help.

Thanks a lot!
Anny

On Sat, Oct 4, 2014 at 2:32 PM, Ted Yu [via Apache Spark User List] <
ml-node+s1001560n1575...@n3.nabble.com> wrote:

> breeze jar doesn't contain RandomGenerator class.
>
> Have you tried with commons-math3 3.1.1 in your pom.xml ?
>
> Let us know if you still encounter problems.
>
> Cheers
>
> On Sat, Oct 4, 2014 at 2:11 PM, 陈韵竹 <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=15754&i=0>> wrote:
>
>> Hi Ted,
>>
>> I did include explicitly breeze in my pom.xml
>>
>> 
>>
>>   org.scalanlp
>>
>>   breeze_${scala.binary.version}
>>
>>   0.9
>>
>> 
>>
>> But this error message still appears.
>>
>>
>> Thanks!
>>
>> On Sat, Oct 4, 2014 at 2:03 PM, Ted Yu <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=15754&i=1>> wrote:
>>
>>> See the last comment in that thread from Xiangrui:
>>>
>>> bq. include breeze in the dependency set of your project. Do not rely
>>> on transitive dependencies
>>>
>>> Cheers
>>>
>>> On Sat, Oct 4, 2014 at 1:48 PM, 陈韵竹 <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=15754&i=2>> wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> So according to previous posts, the problem should be solved by
>>>> changing the spark-1.1.0 core pom file?
>>>>
>>>> Thanks!
>>>>
>>>> On Sat, Oct 4, 2014 at 1:06 PM, Ted Yu <[hidden email]
>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=3>> wrote:
>>>>
>>>>> Cycling bits:
>>>>> http://search-hadoop.com/m/JW1q5UX9S1/breeze+spark&subj=Build+error+when+using+spark+with+breeze
>>>>>
>>>>> On Sat, Oct 4, 2014 at 12:59 PM, anny9699 <[hidden email]
>>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=4>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I use the breeze.stats.distributions.Bernoulli in my code, however
>>>>>> met this
>>>>>> problem
>>>>>> java.lang.NoClassDefFoundError:
>>>>>> org/apache/commons/math3/random/RandomGenerator
>>>>>>
>>>>>> I read the posts about this problem before, and if I added
>>>>>> 
>>>>>>   org.apache.commons
>>>>>>   commons-math3
>>>>>>   3.3
>>>>>>   runtime
>>>>>> 
>>>>>> to my pom.xml, more serious issues will appear. The breeze dependency
>>>>>> is
>>>>>> already in my pom.xml. I am using Spark-1.1.0. Seems I didn't meet
>>>>>> this
>>>>>> issue when I was using Spark-1.0.0. Does anyone have some suggestions?
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Anny
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=5>
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>> <http://user/SendEmail.jtp?type=node&node=15754&i=6>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15754.html
>  To unsubscribe from org/apache/commons/math3/random/RandomGenerator
> issue, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15748&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NzQ4fC0xMzE2OTg2NzMw>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p15755.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Hi,

I use the breeze.stats.distributions.Bernoulli in my code, however met this
problem
java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator

I read the posts about this problem before, and if I added 

  org.apache.commons
  commons-math3
  3.3
  runtime

to my pom.xml, more serious issues will appear. The breeze dependency is
already in my pom.xml. I am using Spark-1.1.0. Seems I didn't meet this
issue when I was using Spark-1.0.0. Does anyone have some suggestions?

Thanks a lot!
Anny





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



array size limit vs partition number

2014-10-03 Thread anny9699
Hi,

Sorry I am not very familiar with Java. I found that if I set the RDD
partition number to be higher, I meet this error
message"java.lang.OutOfMemoryError: Requested array size exceeds VM limit";
however if I set the RDD partition number to be lower, the error is gone.

My aws ec2 cluster has 72 cores, so I first set the partition number to be
150, and met the above problem. Then I set the partition number to be 100,
the error is gone.

Could anybody explain?

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/array-size-limit-vs-partition-number-tp15695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi Liquan,

I have 8 workers, each with 15.7GB memory.

What you said makes sense, but if I don't increase heap space, it keeps
telling me "GC overhead limit exceeded".

Thanks!
Anny

On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] <
ml-node+s1001560n1554...@n3.nabble.com> wrote:

> Hi
>
> How many nodes in your cluster? It seems to me 64g does not help if each
> of your node doesn't have that many memory.
>
> Liquan
>
> On Wed, Oct 1, 2014 at 1:37 PM, anny9699 <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=15541&i=0>> wrote:
>
>> Hi,
>>
>> After reading some previous posts about this issue, I have increased the
>> java heap space to "-Xms64g -Xmx64g", but still met the
>> "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does
>> anyone
>> have other suggestions?
>>
>> I am reading a data of 200 GB and my total memory is 120 GB, so I use
>> "MEMORY_AND_DISK_SER" and kryo serialization.
>>
>> Thanks a lot!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> <http://user/SendEmail.jtp?type=node&node=15541&i=1>
>> For additional commands, e-mail: [hidden email]
>> <http://user/SendEmail.jtp?type=node&node=15541&i=2>
>>
>>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540p15541.html
>  To unsubscribe from still "GC overhead limit exceeded" after increasing
> heap space, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=15540&code=YW5ueTk2OTlAZ21haWwuY29tfDE1NTQwfC0xMzE2OTg2NzMw>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540p15543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi,

After reading some previous posts about this issue, I have increased the
java heap space to "-Xms64g -Xmx64g", but still met the
"java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone
have other suggestions?

I am reading a data of 200 GB and my total memory is 120 GB, so I use
"MEMORY_AND_DISK_SER" and kryo serialization. 

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



memory vs data_size

2014-09-30 Thread anny9699
Hi,

Is there a guidance about for a data of certain data size, how much total
memory should be needed to achieve a relatively good speed?

I have a data of around 200 GB and the current total memory for my 8
machines are around 120 GB. Is that too small to run the data of this big?
Even the read in and simple initial processing seems to last forever.

Thanks a lot!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/memory-vs-data-size-tp15443.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



about partition number

2014-09-29 Thread anny9699
Hi,

I read the past posts about partition number, but am still a little confused
about partitioning strategy. 

I have a cluster with 8 works and 2 cores for each work. Is it true that the
optimal partition number should be 2-4 * total_coreNumber or should
approximately equal to total_coreNumber? Or it's the task number that really
determines the speed rather then partition number?

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-partition-number-tp15362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sc.textFile can't recognize '\004'

2014-06-21 Thread anny9699
Thanks a lot Sean! It works now for me now~~



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


sc.textFile can't recognize '\004'

2014-06-20 Thread anny9699
Hi,

I need to parse a file which is separated by a series of separators. I used
SparkContext.textFile and I met two problems:

1) One of the separators is '\004', which could be recognized by python or R
or Hive, however Spark seems can't recognize this one and returns a symbol
looking like '?'. Also this symbol is not a question mark and I don't know
how to parse.

2) Some of the separator are composed of several Chars, like "} =>". If I
use str.split(Array('}', '=>')), it will separate the string but with many
white spaces included in the middle. Is there a good way that I could
separate by String instead of by Array of Chars? 

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread anny9699
Thanks so much Sonal! I am much clearer now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen!

It's not a fancy class that I wrote, and now I realized I neither extends
Serializable or register with Kyro and that's why it is not working.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3446.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Hi,

I am sorry if this has been asked before. I found that if I wrapped up some
methods in a class with parameters, spark will throw "Task Nonserializable"
exception; however if wrapped up in an object or case class without
parameters, it will work fine. Is it true that all classes involving RDD
operation should be registered so that SparkContext could recognize them?

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi Andrew,

Thanks for the reply. However I did almost the same thing in another
closure:

val simi=dataByRow.map(point => {
val corrs=dataByRow.map(x => arrCorr(point._2,x._2))
(point._1,corrs)
})

here dataByRow is of format RDD[(Int,Array[Double])] and arrCorr is a
function that I wrote to compute correlation between two scala arrays.
 
and it worked. So I am a little confused why it worked here and not in other
places.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-met-when-computing-new-RDD-or-use-count-tp2766p2779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi,

I met this exception when computing new RDD from an existing RDD or using
.count on some RDDs. The following is the situation:

val DD1=D.map(d => {
(d._1,D.map(x => math.sqrt(x._2*d._2)).toArray)
})

DD is in the format RDD[(Int,Double)] and the error message is:

org.apache.spark.SparkException: Job aborted: Task 14.0:8 failed more than 0
times; aborting job java.lang.NullPointerException
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

I also met this kind of problem when using .count() on some RDDs. 

Thanks a lot!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-met-when-computing-new-RDD-or-use-count-tp2766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.