Re: Spark Processing Large Data Stuck

2014-06-21 Thread Peng Cheng
JVM will quit after spending most of its time on GC (about 95%), but usually
before that you have to wait for a long time, particularly if your job is
already at massive scale.

Since it is hard to run profiling online, maybe its easier for debugging if
you make a lot of partitions (so you can watch the progress bar) and post
the last log before it froze.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Powered by Spark addition

2014-06-21 Thread Sonal Goyal
Thanks a lot Matei. 

Sent from my iPad

> On Jun 22, 2014, at 5:20 AM, Matei Zaharia  wrote:
> 
> Alright, added you — sorry for the delay.
> 
> Matei
> 
>> On Jun 12, 2014, at 10:29 PM, Sonal Goyal  wrote:
>> 
>> Hi,
>> 
>> Can we get added too? Here are the details:
>> 
>> Name: Nube Technologies
>> URL: www.nubetech.co
>> Description: Nube provides solutions for data curation at scale helping 
>> customer targetting, accurate inventory and efficient analysis.
>> 
>> Thanks!
>> 
>> Best Regards,
>> Sonal
>> Nube Technologies 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Thu, Jun 12, 2014 at 11:33 PM, Derek Mansen  
>>> wrote:
>>> Awesome, thank you!
>>> 
>>> 
 On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia  
 wrote:
 Alright, added you.
 
 Matei
 
> On Jun 11, 2014, at 1:28 PM, Derek Mansen  wrote:
> 
> Hello, I was wondering if we could add our organization to the "Powered 
> by Spark" page. The information is:
> 
> Name: Vistar Media
> URL: www.vistarmedia.com
> Description: Location technology company enabling brands to reach 
> on-the-go consumers.
> 
> Let me know if you need anything else.
> 
> Thanks!
> Derek Mansen
> 


Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Hi Sean,

OK I'm about 90% sure about the cause of this problem: Just another classic
Dependency conflict:
Myproject -> Selenium -> apache.httpcomponents:httpcore 4.3.1 (has
ContentType)
Spark -> Spark SQL Hive -> Hive -> Thrift -> apache.httpcomponents:httpcore
4.1.3 (has no ContentType)

Though I generated an uber jar excluding Spark/Shark as 'provided' and
indeed include the latest httpcore 4.3. By default spark-submit will load
the uber jar of itself first, then load application's, so unfortunately my
dependency was shaded. I hope I can change the class loading sequence (which
is very unlikely unless someone submit a JIRA), but in worst case I can only
resort the dumb way - manually renaming packages in maven-shade plugin.

That will be the plan for tomorrow. However, I'm wondering if there is a
'clean solution'? Like some plugin that automagically put packages in
different versions, or detect conflicts and rename to aliases?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8083.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Powered by Spark addition

2014-06-21 Thread Matei Zaharia
Alright, added you — sorry for the delay.

Matei

On Jun 12, 2014, at 10:29 PM, Sonal Goyal  wrote:

> Hi,
> 
> Can we get added too? Here are the details:
> 
> Name: Nube Technologies
> URL: www.nubetech.co
> Description: Nube provides solutions for data curation at scale helping 
> customer targetting, accurate inventory and efficient analysis.
> 
> Thanks!
> 
> Best Regards,
> Sonal
> Nube Technologies 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 12, 2014 at 11:33 PM, Derek Mansen  wrote:
> Awesome, thank you!
> 
> 
> On Wed, Jun 11, 2014 at 6:53 PM, Matei Zaharia  
> wrote:
> Alright, added you.
> 
> Matei
> 
> On Jun 11, 2014, at 1:28 PM, Derek Mansen  wrote:
> 
>> Hello, I was wondering if we could add our organization to the "Powered by 
>> Spark" page. The information is:
>> 
>> Name: Vistar Media
>> URL: www.vistarmedia.com
>> Description: Location technology company enabling brands to reach on-the-go 
>> consumers.
>> 
>> Let me know if you need anything else.
>> 
>> Thanks!
>> Derek Mansen
> 
> 
> 



Re: Using Spark

2014-06-21 Thread Matei Zaharia
Alright, added you.

On Jun 20, 2014, at 2:52 PM, Ricky Thomas  wrote:

> Hi, 
> 
> Would like to add ourselves to the user list if possible please?
> 
> Company: truedash
> url: truedash.io
> 
> Automatic pulling of all your data in to Spark for enterprise visualisation, 
> predictive analytics and data exploration at a low cost. 
> 
> Currently in development with a few clients.
> 
> Thanks
> 



Re: Spark Processing Large Data Stuck

2014-06-21 Thread yxzhao
Thanks Krishna,
I use a small cluster and each compute node has 16GB of RAM and 8 2.66GHz
CPU cores.









On Sat, Jun 21, 2014 at 3:16 PM, Krishna Sankar [via Apache Spark User
List]  wrote:

> Hi,
>
>- I have seen similar behavior before. As far as I can tell, the root
>cause is the out of memory error - verified this by monitoring the memory.
>   - I had a 30 GB file and was running on a single machine with 16GB.
>   So I knew it would fail.
>   - But instead of raising an exception, some part of the system
>   keeps on churning.
>- My suggestion is to follow the memory settings for the JVM (try
>bigger settings), make sure the settings are propagated to all the workers
>and finally monitor the memory while the job is running.
>- Another vector is to split the file, try with progressively
>increasing size.
>- I also see symptoms of failed connections. While I can't positively
>say that it is a problem, check your topology & network connectivity.
>- Out of curiosity, what kind of machines are you running ? Bare metal
>? EC2 ? How much memory ? 64 bit OS ?
>   - I assume these are big machines and so the resources themselves
>   might not be a problem.
>
> Cheers
> 
>
>
> On Sat, Jun 21, 2014 at 12:55 PM, yxzhao <[hidden email]
> > wrote:
>
>> I run the pagerank example processing a large data set, 5GB in size,
>> using 48
>> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
>> attached log shows. It was stuck there for more than 10 hours and then I
>> killed it at last. But I did not find any information explaining why it
>> was
>> stuck. Any suggestions? Thanks.
>>
>> Spark_OK_48_pagerank.log
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8077.html
>  To unsubscribe from Spark Processing Large Data Stuck, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075p8080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
I also found that any buggy application submitted in --deploy-mode = cluster
mode will crash the worker (turn status to 'DEAD'). This shouldn't really
happen, otherwise nobody will use this mode. It is yet unclear whether all
workers will crash or only the one running the driver will (as I only have
one worker)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8079.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Latest Advancement:
I found the cause of NoClassDef exception: I wasn't using spark-submit,
instead I tried to run the spark application directly with SparkConf set in
the code. (this is handy in local debugging). However the old problem
remains: Even my maven-shade plugin doesn't give any warning of duplicate,
it still gives me the same error:

14/06/21 16:43:59 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-2,5,main]
java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.entity.ContentType.parse(ContentType.java:229)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar
Hi,

   - I have seen similar behavior before. As far as I can tell, the root
   cause is the out of memory error - verified this by monitoring the memory.
  - I had a 30 GB file and was running on a single machine with 16GB.
  So I knew it would fail.
  - But instead of raising an exception, some part of the system keeps
  on churning.
   - My suggestion is to follow the memory settings for the JVM (try bigger
   settings), make sure the settings are propagated to all the workers and
   finally monitor the memory while the job is running.
   - Another vector is to split the file, try with progressively increasing
   size.
   - I also see symptoms of failed connections. While I can't positively
   say that it is a problem, check your topology & network connectivity.
   - Out of curiosity, what kind of machines are you running ? Bare metal ?
   EC2 ? How much memory ? 64 bit OS ?
  - I assume these are big machines and so the resources themselves
  might not be a problem.

Cheers



On Sat, Jun 21, 2014 at 12:55 PM, yxzhao  wrote:

> I run the pagerank example processing a large data set, 5GB in size, using
> 48
> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
> attached log shows. It was stuck there for more than 10 hours and then I
> killed it at last. But I did not find any information explaining why it was
> stuck. Any suggestions? Thanks.
>
> Spark_OK_48_pagerank.log
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n8075/Spark_OK_48_pagerank.log
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Indeed I see a lot of duplicate package warning in the maven-shade assembly
package output, so I tried to eliminate them:

First I set scope of dependency to apache-spark to 'provided', as suggested
in this page:
http://spark.apache.org/docs/latest/submitting-applications.html

But spark master gave me a blunt dependency not found error:
Exception in thread "main" java.lang.NoClassDefFoundError:
scala/collection/Seq
at ... [my main object]

Then I revert it back to 'compile' to see if things got better, but after
which I again saw duplicates of packages, then random errors (like
NoSuchFieldError, IllegalStateException etc.)

Is setting scope = 'provided' mandatory in deployment? I mere remove this
line for debugging locally.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark Processing Large Data Stuck

2014-06-21 Thread yxzhao
I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
attached log shows. It was stuck there for more than 10 hours and then I
killed it at last. But I did not find any information explaining why it was
stuck. Any suggestions? Thanks.

Spark_OK_48_pagerank.log

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Processing-Large-Data-Stuck-tp8075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Performance problems on SQL JOIN

2014-06-21 Thread Michael Armbrust
Its probably because our LEFT JOIN performance isn't super great ATM since
we'll use a nest loop join. Sorry! We are aware of the problem and there is
a JIRA to let us do this with a HashJoin instead. If you are feeling brave
you might try pulling in the related PR.

https://issues.apache.org/jira/browse/SPARK-2212


On Fri, Jun 20, 2014 at 8:16 AM, mathias 
wrote:

> Hi there,
>
> We're trying out Spark and are experiencing some performance issues using
> Spark SQL.
> Anyone who can tell us if our results are normal?
>
> We are using the Amazon EC2 scripts to create a cluster with 3
> workers/executors (m1.large).
> Tried both spark 1.0.0 as well as the git master; the Scala as well as the
> Python shells.
>
> Running the following code takes about 5 minutes, which seems a long time
> for this query.
>
> val file = sc.textFile("s3n:// ...  .csv");
> val data = file.map(x => x.split('|')); // 300k rows
>
> case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
> ...);
> val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 50k rows
> val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 30k rows
>
> rooms2.registerAsTable("rooms2");
> cacheTable("rooms2");
> rooms3.registerAsTable("rooms3");
> cacheTable("rooms3");
>
> sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
> rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count();
>
>
> Are we doing something wrong here?
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Thanks a lot! Let me check my maven shade plugin config and see if there is a
fix



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: zip in pyspark truncates RDD to number of processors

2014-06-21 Thread Kan Zhang
I couldn't reproduce your issue locally, but I suspect it has something to
do with partitioning. zip() does it by partition and it assumes the two
RDDs have the same number of partitions and the same number of elements in
each partition. By default, map() doesn't preserve partitioning. Try set
preservesPartitioning to True and see if the problem persists.


On Sat, Jun 21, 2014 at 9:37 AM, madeleine 
wrote:

> Consider the following simple zip:
>
> n = 6
> a = sc.parallelize(range(n))
> b = sc.parallelize(range(n)).map(lambda j: j)
> c = a.zip(b)
> print a.count(), b.count(), c.count()
>
> >> 6 6 4
>
> by varying n, I find that c.count() is always min(n,4), where 4 happens to
> be the number of threads on my computer. by calling c.collect(), I see the
> RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
> happen without calling map on b.
>
> Any ideas?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: sc.textFile can't recognize '\004'

2014-06-21 Thread anny9699
Thanks a lot Sean! It works now for me now~~



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


zip in pyspark truncates RDD to number of processors

2014-06-21 Thread madeleine
Consider the following simple zip:

n = 6
a = sc.parallelize(range(n))
b = sc.parallelize(range(n)).map(lambda j: j) 
c = a.zip(b)
print a.count(), b.count(), c.count()

>> 6 6 4

by varying n, I find that c.count() is always min(n,4), where 4 happens to
be the number of threads on my computer. by calling c.collect(), I see the
RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
happen without calling map on b.

Any ideas?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Set the number/memory of workers under mesos

2014-06-21 Thread Mayur Rustagi
You can do that after as well,  it changes application wide settings for
subsequent task.
On 20 Jun 2014 17:05, "Shuo Xiang"  wrote:

> Hi Mayur,
>   Are you referring to overriding the default sc in sparkshell? Is there
> any way to do that before running the shell?
>
>
> On Fri, Jun 20, 2014 at 1:40 PM, Mayur Rustagi 
> wrote:
>
>> You should be able to configure in spark context in Spark shell.
>> spark.cores.max & memory.
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang 
>> wrote:
>>
>>> Hi, just wondering anybody knows how to set up the number of workers
>>> (and the amount of memory) in mesos, while lauching spark-shell? I was
>>> trying to edit conf/spark-env.sh and it looks like that the environment
>>> variables are for YARN of standalone. Thanks!
>>>
>>>
>>>
>>>
>>
>


Re: How to terminate job from the task code?

2014-06-21 Thread Mayur Rustagi
You can terminate job group from spark context,  Youll have to send across
the spark context to your task.
On 21 Jun 2014 01:09, "Piotr Kołaczkowski"  wrote:

> If the task detects unrecoverable error, i.e. an error that we can't
> expect to fix by retrying nor moving the task to another node, how to stop
> the job / prevent Spark from retrying it?
>
> def process(taskContext: TaskContext, data: Iterator[T]) {
>...
>
>if (unrecoverableError) {
>   ??? // terminate the job immediately
>}
>...
>  }
>
> Somewhere else:
> rdd.sparkContext.runJob(rdd, something.process _)
>
>
> Thanks,
> Piotr
>
>
> --
> Piotr Kolaczkowski, Lead Software Engineer
> pkola...@datastax.com
>
> http://www.datastax.com/
> 777 Mariners Island Blvd., Suite 510
> San Mateo, CA 94404
>


Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Sean Owen
This inevitably means the run-time classpath includes a different copy
of the same library/class as something in your uber jar and the
different version is taking precedence. Here it's Commons
HttpComponents. Where exactly it's coming from is specific to your
deployment, but that's the issue.

On Sat, Jun 21, 2014 at 9:30 AM, Peng Cheng  wrote:
> I have a Spark application that runs perfectly in local mode with 8 threads,
> but when deployed on a single-node cluster. It gives the following error:
>
> ROR TaskSchedulerImpl: Lost executor 0 on 192.168.42.202: Uncaught exception
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 14/06/21 04:18:53 ERROR TaskSetManager: Task 2.0:0 failed 3 times; aborting
> job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 2.0:0 failed 3 times, most recent failure: Exception
> failure in TID 7 on host 192.168.42.202: java.lang.NoSuchFieldError:
> INSTANCE
> org.apache.http.entity.ContentType.parse(ContentType.java:229)
> ...
>
> This is weird as this error is supposed to be caught by compiler but not jvm
> (unless Spark has changed the content of a class internally, which is
> impossible because the class is in the uber-jar but not closure). Also, I
> can confirm that the class that contains INSTANCE as a property is in the
> uber jar, so there is really no reason for Spark to throw it.
>
> Here is another independent question: I've also encounter several errors
> that only appears in cluster mode, they are hard to fix because I cannot
> debug them. Is there a local cluster simulation mode that can throw all
> errors yet allows me to debug?
>
> Yours Peng
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How do you run your spark app?

2014-06-21 Thread Gerard Maas
Hi Michael,

+1 on the deployment stack. (almost) Same thing here.
One question: Are you deploying the JobServer on Mesos?  Through Marathon?
I've been working on solving some of the port assignment issues on Mesos
but I'm not there yet. Did you guys solved that?

-kr, Gerard.





On Thu, Jun 19, 2014 at 11:53 PM, Michael Cutler  wrote:

> When you start seriously using Spark in production there are basically two
> things everyone eventually needs:
>
>1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
>2. Always-On Jobs - that require monitoring, restarting etc.
>
> There are lots of ways to implement these requirements, everything from
> crontab through to workflow managers like Oozie.
>
> We opted for the following stack:
>
>- Apache Mesos  (mesosphere.io distribution)
>
>
>- Marathon  - init/control
>system for starting, stopping, and maintaining always-on applications.
>
>
>- Chronos  - general-purpose
>scheduler for Mesos, supports job dependency graphs.
>
>
>- ** Spark Job Server  -
>primarily for it's ability to reuse shared contexts with multiple jobs
>
> The majority of our jobs are periodic (batch) jobs run through
> spark-sumit, and we have several always-on Spark Streaming jobs (also run
> through spark-submit).
>
> We always use "client mode" with spark-submit because the Mesos cluster
> has direct connectivity to the Spark cluster and it means all the Spark
> stdout/stderr is externalised into Mesos logs which helps diagnosing
> problems.
>
> I thoroughly recommend you explore using Mesos/Marathon/Chronos to run
> Spark and manage your Jobs, the Mesosphere tutorials are awesome and you
> can be up and running in literally minutes.  The Web UI's to both make it
> easy to get started without talking to REST API's etc.
>
> Best,
>
> Michael
>
>
>
>
> On 19 June 2014 19:44, Evan R. Sparks  wrote:
>
>> I use SBT, create an assembly, and then add the assembly jars when I
>> create my spark context. The main executor I run with something like "java
>> -cp ... MyDriver".
>>
>> That said - as of spark 1.0 the preferred way to run spark applications
>> is via spark-submit -
>> http://spark.apache.org/docs/latest/submitting-applications.html
>>
>>
>> On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo  wrote:
>>
>>> I want to ask this, not because I can't read endless documentation and
>>> several tutorials, but because there seems to be many ways of doing
>>> things
>>> and I keep having issues. How do you run /your /spark app?
>>>
>>> I had it working when I was only using yarn+hadoop1 (Cloudera), then I
>>> had
>>> to get Spark and Shark working and ended upgrading everything and dropped
>>> CDH support. Anyways, this is what I used with master=yarn-client and
>>> app_jar being Scala code compiled with Maven.
>>>
>>> java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER
>>> $CLASSNAME
>>> $ARGS
>>>
>>> Do you use this? or something else? I could never figure out this method.
>>> SPARK_HOME/bin/spark jar APP_JAR ARGS
>>>
>>> For example:
>>> bin/spark-class jar
>>>
>>> /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
>>> pi 10 10
>>>
>>> Do you use SBT or Maven to compile? or something else?
>>>
>>>
>>> ** It seams that I can't get subscribed to the mailing list and I tried
>>> both
>>> my work email and personal.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
I have a Spark application that runs perfectly in local mode with 8 threads,
but when deployed on a single-node cluster. It gives the following error:

ROR TaskSchedulerImpl: Lost executor 0 on 192.168.42.202: Uncaught exception
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
14/06/21 04:18:53 ERROR TaskSetManager: Task 2.0:0 failed 3 times; aborting
job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 2.0:0 failed 3 times, most recent failure: Exception
failure in TID 7 on host 192.168.42.202: java.lang.NoSuchFieldError:
INSTANCE
org.apache.http.entity.ContentType.parse(ContentType.java:229)
...

This is weird as this error is supposed to be caught by compiler but not jvm
(unless Spark has changed the content of a class internally, which is
impossible because the class is in the uber-jar but not closure). Also, I
can confirm that the class that contains INSTANCE as a property is in the
uber jar, so there is really no reason for Spark to throw it.

Here is another independent question: I've also encounter several errors
that only appears in cluster mode, they are hard to fix because I cannot
debug them. Is there a local cluster simulation mode that can throw all
errors yet allows me to debug?

Yours Peng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Repeated Broadcasts

2014-06-21 Thread Daedalus
Anyone who has used this sort of construct? (Read: bump)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repeated-Broadcasts-tp7977p8063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.