date:20140709

Can you try setting SPARK_MASTER_IP in the spark-env.sh file?

Thanks
Best Regards


On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:


  Hi all,
 I have one master and two slave node, I did not set any ip for spark
 driver because I thought it uses its default ( localhost). In my etc/hosts
 I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03
 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do
 something in hosts? or should I set a ip to spark-local-ip? I got the
 following in my stderr:

 Spark
 Executor Command: java -cp ::
 /usr/local/spark-1.0.0/conf:
 /usr/local/spark-1.0.0
 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
  
 -XX:MaxPermSize=128m -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
  akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 


 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend:
 Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -
 [akka.tcp://spark@master:54477] disassociated! Shutting down.





 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com

Re: spark Driver

Can you also paste a little bit more stacktrace?

Thanks
Best Regards


On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote:

 I have the following in spark-env.sh

 SPARK_MASTER_IP=master
 SPARK_MASTER_port=7077


 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com


   On Wednesday, July 9, 2014 2:32 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:


 Can you try setting SPARK_MASTER_IP in the spark-env.sh file?

 Thanks
 Best Regards


 On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:


  Hi all,
 I have one master and two slave node, I did not set any ip for spark
 driver because I thought it uses its default ( localhost). In my etc/hosts
 I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03
 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do
 something in hosts? or should I set a ip to spark-local-ip? I got the
 following in my stderr:

 Spark
 Executor Command: java -cp ::
 /usr/local/spark-1.0.0/conf:
 /usr/local/spark-1.0.0
 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
  
 -XX:MaxPermSize=128m -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
  akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 


 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend:
 Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -
 [akka.tcp://spark@master:54477] disassociated! Shutting down.





 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com

Re: Spark Installation

2014-07-09 Thread 田毅

Hi Srikrishna

the reason to this issue is you had uploaded assembly jar to HDFS twice.

paste your command could be better diagnosis



田毅
===
橘云平台产品线
大数据产品部
亚信联创科技（中国）有限公司
手机：13910177261
电话：010－82166322
传真：010－82166617
Q Q：20057509
MSN：yi.t...@hotmail.com
地址：北京市海淀区东北旺西路10号院东区  亚信联创大厦


===

在 2014年7月9日，上午3:03，Srikrishna S srikrishna...@gmail.com 写道：

 Hi All,
 
 
 I tried the make distribution script and it worked well. I was able to
 compile the spark binary on our CDH5 cluster. Once I compiled Spark, I
 copied over the binaries in the dist folder to all the other machines
 in the cluster.
 
 However, I run into an issue while submit a job in yarn-client mode. I
 get an error message that says the following
 Resource 
 file:/opt/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar
 changed on src filesystem (expected 1404845211000, was 1404845404000)
 
 My end goal is to submit a job (that uses MLLib) in our Yarn cluster.
 
 Any thoughts anyone?
 
 Regards,
 Krishna
 
 
 
 On Tue, Jul 8, 2014 at 9:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Hi Srikrishna,
 
 The binaries are built with something like
 mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 
 -Dyarn.version=2.3.0-cdh5.0.1
 
 -Sandy
 
 
 On Tue, Jul 8, 2014 at 3:14 AM, 田毅 tia...@asiainfo.com wrote:
 
 try this command:
 
 make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive
 
 
 
 
 田毅
 ===
 橘云平台产品线
 大数据产品部
 亚信联创科技（中国）有限公司
 手机：13910177261
 电话：010－82166322
 传真：010－82166617
 Q Q：20057509
 MSN：yi.t...@hotmail.com
 地址：北京市海淀区东北旺西路10号院东区  亚信联创大厦
 
 
 ===
 
 在 2014年7月8日，上午11:53，Krishna Sankar ksanka...@gmail.com 写道：
 
 Couldn't find any reference of CDH in pom.xml - profiles or the 
 hadoop.version.Am also wondering how the cdh compatible artifact was 
 compiled.
 Cheers
 k/
 
 
 On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S srikrishna...@gmail.com 
 wrote:
 
 Hi All,
 
 Does anyone know what the command line arguments to mvn are to generate 
 the pre-built binary for spark on Hadoop 2-CHD5.
 
 I would like to pull in a recent bug fix in spark-master and rebuild the 
 binaries in the exact same way that was used for that provided on the 
 website.
 
 I have tried the following:
 
 mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1
 
 And it doesn't quite work.
 
 Any thoughts anyone?

Re: spark Driver

2014-07-09 Thread amin mohebbi

This is exactly what I got 
Spark 
Executor Command: java -cp ::
/usr/local/spark-1.0.0/conf:
/usr/local/spark-1.0.0
/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
 
-XX:MaxPermSize=128m -Xms512M -Xmx512M 
org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 14/07/04 17:50:14 ERROR 
CoarseGrainedExecutorBackend: 
Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - 
[akka.tcp://spark@master:54477] disassociated! Shutting down.
 I have posted in stackoverflow and I revived this answer  
http://stackoverflow.com/questions/24571922/apache-spark-stderr-and-stdout/24594576#24594576

I am not sure whether  I need to set a ip address to driver ? do I need a 
separate machine for driver ?

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com


On Wednesday, July 9, 2014 2:39 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


Can you also paste a little bit more stacktrace?


Thanks
Best Regards


On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote:

I have the following in spark-env.sh 


SPARK_MASTER_IP=master

SPARK_MASTER_port=7077

 

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com



On Wednesday, July 9, 2014 2:32 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


Can you try setting SPARK_MASTER_IP in the spark-env.sh file?


Thanks
Best Regards


On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:



 Hi all, 
I have one master and two slave node, I did not set any ip for spark driver 
because I thought it uses its default ( localhost). In my etc/hosts I got the 
following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 
127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do something in 
hosts? or should I set a ip to spark-local-ip? I got the following in my 
stderr: 


Spark 
Executor Command: java -cp ::
/usr/local/spark-1.0.0/conf:
/usr/local/spark-1.0.0
/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
 
-XX:MaxPermSize=128m -Xms512M -Xmx512M 
org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 14/07/04 17:50:14 ERROR 
CoarseGrainedExecutorBackend: 
Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - 
[akka.tcp://spark@master:54477] disassociated! Shutting down.
 


 

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com

Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa

Hi all,

I need to run a spark job that need a set of images as input. I need
something that load these images as RDD but I just don't know how to do
that. Do some of you have any idea ?

Cheers,


Jao

Re: Requirements for Spark cluster

You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud
quickly.

If you want to set it up on your own then these are the things that you
will need to do:

1. Make sure you have java (7) installed on all machines.
2. Install and configure spark (add all slave nodes in conf/slaves file)
3. Rsync spark directory to all slaves and start it.

If you are already having hadoop, then you might need to get that version
of spark which is compiled with that hadoop version (or you can compile it
with option SPARK_HADOOP_VERSION)

Thanks
Best Regards


On Wed, Jul 9, 2014 at 6:54 AM, Robert James srobertja...@gmail.com wrote:

 I have a Spark app which runs well on local master.  I'm now ready to
 put it on a cluster.  What needs to be installed on the master? What
 needs to be installed on the workers?

 If the cluster already has Hadoop or YARN or Cloudera, does it still
 need an install of Spark?

How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Haopu Wang

Besides restarting the Master, is there any other way to clear the
Completed Applications in Master web UI?

Re: Spark Streaming using File Stream in Java

Try this out:

JavaStreamingContext sc = new
JavaStreamingContext(...);JavaDStreamString lines =
ctx.fileStream(whatever);JavaDStreamString words = lines.flatMap(
  new FlatMapFunctionString, String() {
public IterableString call(String s) {
  return Arrays.asList(s.split( ));
}
  });

JavaPairDStreamString, Integer ones = words.map(
  new PairFunctionString, String, Integer() {
public Tuple2String, Integer call(String s) {
  return new Tuple2(s, 1);
}
  });

JavaPairDStreamString, Integer counts = ones.reduceByKey(
  new Function2Integer, Integer, Integer() {
public Integer call(Integer i1, Integer i2) {
  return i1 + i2;
}
  });


Actually modified from
https://spark.apache.org/docs/0.9.1/java-programming-guide.html#example

Thanks
Best Regards


On Wed, Jul 9, 2014 at 6:03 AM, Aravind aravindb...@gmail.com wrote:

 Hi all,

 I am trying to run the NetworkWordCount.java file in the streaming
 examples.
 The example shows how to read from a network socket. But my usecase is that
 , I have a local log file which is a stream and continuously updated (say
 /Users/.../Desktop/mylog.log).

 I would like to write the same NetworkWordCount.java using this filestream

 jssc.fileStream(dataDirectory);

 Question:
 1. How do I write a mapreduce function for the above to measure wordcounts
 (in java, not scala)?

 2. Also does the streaming application stop if the file is not updating or
 does it continuously poll for the file updates?

 I am a new user of Apache Spark Streaming. Kindly help me as I am totally
 stuck

 Thanks in advance.

 Regards
 Aravind



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-File-Stream-in-Java-tp9115.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: error when spark access hdfs with Kerberos enable

Hi Cheney,

I haven't heard of anybody deploying non-secure YARN on top of secure HDFS.
 It's conceivable that you might be able to get work, but my guess is that
you'd run into some issues.  Also, without authentication on in YARN, you
could be leaving your HDFS tokens exposed, which others could steal and get
to your data.

-Sandy


On Tue, Jul 8, 2014 at 7:28 PM, Cheney Sun sun.che...@gmail.com wrote:

 Hi Sandy,

 We are also going to grep data from a security enabled (with kerberos)
 HDFS in our Spark application. Per you answer, we have to switch Spark on
 YARN to achieve this.
 We plan to deploy a different Hadoop cluster(with YARN) only to run Spark.
 Is it necessary to deploy YARN with security enabled? Or is it possible to
 access data within a security HDFS from no-security enabled Spark on YARN?


 On Wed, Jul 9, 2014 at 4:19 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 That's correct.  Only Spark on YARN supports Kerberos.

 -Sandy


 On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 Someone might be able to correct me if I'm wrong, but I don't believe
 standalone mode supports kerberos. You'd have to use Yarn for that.

 On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote:
  Hi all,
 
 
 
  I encounter a strange issue when using spark 1.0 to access hdfs with
  Kerberos
 
  I just have one spark test node for spark and HADOOP_CONF_DIR is set
 to the
  location containing the hdfs configuration files(hdfs-site.xml and
  core-site.xml)
 
  When I use spark-shell with local mode, the access to hdfs is
 successfully .
 
  However, If I use spark-shell which connects to the stand alone
 cluster (I
  configured the spark as standalone cluster mode with only one node).
 
  The access to the hdfs fails with the following error: “Can't get
 Master
  Kerberos principal for use as renewer”
 
 
 
  Anyone have any ideas on this ?
 
  Thanks a lot.
 
 
 
  Regards,
  Xiaowei
 
 



 --
 Marcelo

Re: Re： Pig 0.13, Spark, Spork

Hi Bertrand,

We've updated the document
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0

This is our working Github repo
https://github.com/sigmoidanalytics/spork/tree/spork-0.9

Feel free to open issues over here
https://github.com/sigmoidanalytics/spork/issues

Thanks
Best Regards


On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux decho...@gmail.com wrote:

 @Mayur : I won't fight with the semantic of a fork but at the moment, no
 Spork does take the standard Pig as dependency. On that, we should agree.

 As for my use of Pig, I have no limitation. I am however interested to see
 the rise of a 'no-sql high level non programming language' for Spark.

 @Zhang : Could you elaborate your reference about Twitter?


 Bertrand Dechoux


 On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 pelickzh...@qq.com wrote:

 Hi guys, previously I checked out the old spork and updated it to
 Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine
 https://github.com/pelick/flare-spork‍

 It it also highly experimental, and just directly mapping pig physical
 operations to spark RDD transformations/actions. It works for simple
 requests. :)

 I am also interested on the progress of spork, is it undergoing in
 Twitter in an un open-source way?

 --
 Thanks
 Zhang Baofeng
 Blog http://blog.csdn.net/pelick | Github https://github.com/pelick
 | Weibo http://weibo.com/pelickzhang | LinkedIn
 http://www.linkedin.com/pub/zhang-baofeng/70/609/84




 -- 原始邮件 --
 *发件人:* Mayur Rustagi;mayur.rust...@gmail.com;
 *发送时间:* 2014年7月7日(星期一) 晚上11:55
 *收件人:* user@spark.apache.orguser@spark.apache.org;
 *主题:* Re: Pig 0.13, Spark, Spork

 That version is old :).
 We are not forking pig but cleanly separating out pig execution engine.
 Let me know if you are willing to give it a go.

 Also would love to know what features of pig you are using ?

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 I saw a wiki page from your company but with an old version of Spark.

 http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

 I have no reason to use it yet but I am interested in the state of the
 initiative.
 What's your point of view (personal and/or professional) about the Pig
 0.13 release?
 Is the pluggable execution engine flexible enough in order to avoid
 having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

 As a (for now) external observer, I am glad to see competition in that
 space. It can only be good for the community in the end.

 Bertrand Dechoux


 On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Hi,
 We have fixed many major issues around Spork  deploying it with some
 customers. Would be happy to provide a working version to you to try out.
 We are looking for more folks to try it out  submit bugs.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 Hi,

 I was wondering what was the state of the Pig+Spark initiative now
 that the execution engine of Pig is pluggable? Granted, it was done in
 order to use Tez but could it be used by Spark? I know about a
 'theoretical' project called Spork but I don't know any stable and
 maintained version of it.

 Regards

 Bertrand Dechoux

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell

There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.

On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Besides restarting the Master, is there any other way to clear the
 Completed Applications in Master web UI?

how to host the drive node

2014-07-09 Thread aminn_524

I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-host-the-drive-node-tp9149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to host spark driver

2014-07-09 Thread amin mohebbi



 I have one master and two slave nodes, I did not set any ip for spark driver. 
My question is should I set a ip for spark driver and can I host the driver 
inside the cluster in master node? if so, how to host it? will it be hosted 
automatically in that node we submit the application by spark-submit?  

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell

It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of clusters (or using different versions of Spark).

It also unifies the way you bundle and submit an app for Yarn, Mesos,
etc... this was something that became very fragmented over time before
this was added.

Another feature is allowing users to set configuration values
dynamically rather than compile them inside of their program. That's
the one you mention here. You can choose to use this feature or not.
If you know your configs are not going to change, then you don't need
to set them with spark-submit.


On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote:
 What is the purpose of spark-submit? Does it do anything outside of
 the standard val conf = new SparkConf ... val sc = new SparkContext
 ... ?

Re: Why doesn't the driver node do any work?

2014-07-09 Thread aminn_524

I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-doesn-t-the-driver-node-do-any-work-tp3909p9153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Comparative study

2014-07-09 Thread Sean Owen

On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace Hive,
 which is map/reduce based.  It has its own distributed query engine, though
 it does load data from HDFS, and is part of the hadoop ecosystem.  Impala
 really shines when your


(It was not built to replace Hive. It's purpose-built to make interactive
use with a BI tool feasible -- single-digit second queries on huge data
sets. It's very memory hungry. Hive's architecture choices and legacy code
have been throughput-oriented, and can't really get below minutes at scale,
but, remains a right choice when you are in fact doing ETL!)

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá

Hi Jerry, it's all clear to me now, I will try with something like Apache
DBCP for the connection pool

Thanks a lot for your help!


2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com:

  Yes, that would be the Java equivalence to use static class member, but
 you should carefully program to prevent resource leakage. A good choice is
 to use third-party DB connection library which supports connection pool,
 that will alleviate your programming efforts.



 Thanks

 Jerry



 *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
 *Sent:* Tuesday, July 08, 2014 6:54 PM
 *To:* user@spark.apache.org

 *Subject:* Re: Which is the best way to get a connection to an external
 database per task in Spark Streaming?



 Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and
 I only have rudimentary knowledge about Scala, how could I recreate in Java
 the lazy creation of a singleton object that you propose for Scala? Maybe a
 static class member in Java for the connection would be the solution?

 Thanks again for your help,

 Best Regards,

 Juan



 2014-07-08 11:44 GMT+02:00 Shao, Saisai saisai.s...@intel.com:

 I think you can maintain a connection pool or keep the connection as a
 long-lived object in executor side (like lazily creating a singleton object
 in object { } in Scala), so your task can get this connection each time
 executing a task, not creating a new one, that would be good for your
 scenario, since create a connection is quite expensive for each task.



 Thanks

 Jerry



 *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
 *Sent:* Tuesday, July 08, 2014 5:19 PM
 *To:* Tobias Pfeiffer
 *Cc:* user@spark.apache.org
 *Subject:* Re: Which is the best way to get a connection to an external
 database per task in Spark Streaming?



 Hi Tobias, thanks for your help. I understand that with that code we
 obtain a database connection per partition, but I also suspect that with
 that code a new database connection is created per each execution of the
 function used as argument for mapPartitions(). That would be very
 inefficient because a new object and a new database connection would be
 created for each batch of the DStream. But my knowledge about the lifecycle
 of Functions in Spark Streaming is very limited, so maybe I'm wrong, what
 do you think?

 Greetings,

 Juan



 2014-07-08 3:30 GMT+02:00 Tobias Pfeiffer t...@preferred.jp:

 Juan,



 I am doing something similar, just not insert into SQL database, but
 issue some RPC call. I think mapPartitions() may be helpful to you. You
 could do something like



 dstream.mapPartitions(iter = {

   val db = new DbConnection()

   // maybe only do the above if !iter.isEmpty

   iter.map(item = {

 db.call(...)

 // do some cleanup if !iter.hasNext here

 item

   })

 }).count() // force output



 Keep in mind though that the whole idea about RDDs is that operations are
 idempotent and in theory could be run on multiple hosts (to take the result
 from the fastest server) or multiple times (to deal with failures/timeouts)
 etc., which is maybe something you want to deal with in your SQL.



 Tobias





 On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi list,

 I'm writing a Spark Streaming program that reads from a kafka topic,
 performs some transformations on the data, and then inserts each record in
 a database with foreachRDD. I was wondering which is the best way to handle
 the connection to the database so each worker, or even each task, uses a
 different connection to the database, and then database inserts/updates
 would be performed in parallel.
 - I understand that using a final variable in the driver code is not a
 good idea because then the communication with the database would be
 performed in the driver code, which leads to a bottleneck, according to
 http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
 - I think creating a new connection in the call() method of the Function
 passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
 the connection to the database for each batch RDD in the DStream
 - I'm not sure that a broadcast variable with the connection handler is a
 good idea in case the target database is distributed, because if the same
 handler is used for all the nodes of the Spark cluster then than could have
 a negative effect in the data locality of the connection to the database.
 - From
 http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
 I understand that by using an static variable and referencing it in the
 call() method of the Function passed to foreachRDD we get a different
 connection per Spark worker, I guess it's because there is a different JVM
 per worker. But then all the tasks in the same worker would share the same
 database handler object, am I right?
 - Another idea is

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-09 Thread Martin Gammelsæter

Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem
to work. For now we've downgraded dropwizard to 0.6.2 which uses a
compatible version of jetty. Not optimal, but it works for now.

On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote:
 We're doing similar thing to lunch spark job in tomcat, and I opened a
 JIRA for this. There are couple technical discussions there.

 https://issues.apache.org/jira/browse/SPARK-2100

 In this end, we realized that spark uses jetty not only for Spark
 WebUI, but also for distributing the jars and tasks, so it really hard
 to remove the web dependency in Spark.

 In the end, we lunch our spark job in yarn-cluster mode, and in the
 runtime, the only dependency in our web application is spark-yarn
 which doesn't contain any spark web stuff.

 PS, upgrading the spark jetty 8.x to 9.x in spark may not be
 straightforward by just changing the version in spark build script.
 Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires
 Java 7.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers ko...@tresata.com wrote:
 do you control your cluster and spark deployment? if so, you can try to
 rebuild with jetty 9.x


 On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:

 Digging a bit more I see that there is yet another jetty instance that
 is causing the problem, namely the BroadcastManager has one. I guess
 this one isn't very wise to disable... It might very well be that the
 WebUI is a problem as well, but I guess the code doesn't get far
 enough. Any ideas on how to solve this? Spark seems to use jetty
 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
 of the problem. Any ideas?

 On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:
  Hi!
 
  I am building a web frontend for a Spark app, allowing users to input
  sql/hql and get results back. When starting a SparkContext from within
  my server code (using jetty/dropwizard) I get the error
 
  java.lang.NoSuchMethodError:
  org.eclipse.jetty.server.AbstractConnector: method init()V not found
 
  when Spark tries to fire up its own jetty server. This does not happen
  when running the same code without my web server. This is probably
  fixable somehow(?) but I'd like to disable the webUI as I don't need
  it, and ideally I would like to access that information
  programatically instead, allowing me to embed it in my own web
  application.
 
  Is this possible?
 
  --
  Best regards,
  Martin Gammelsæter



 --
 Mvh.
 Martin Gammelsæter
 92209139





-- 
Mvh.
Martin Gammelsæter
92209139

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell

Hey Mikhail,

I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?

- Patrick

On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb...@gmail.com wrote:
 Thanks Andrew,
  ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
 --executor-memory 20g --driver-memory 10g
 works well, just wanted to make sure that I'm not missing anything



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Initial job has not accepted any resources means many things

2014-07-09 Thread Martin Gammelsæter

It seems like the Initial job has not accepted any resources; shows
up for a wide variety of different errors (for example the obvious one
where you've requested more memory than is available) but also for
example in the case where the worker nodes does not have the
appropriate code on their class path. Debugging from this error is
very hard as errors does not show up in the logs on the workers. Is
this a known issue? I'm having issues with getting the code to the
workers without using addJar (my code is a fairly static application,
and I'd like to avoid using addJar every time the app starts up, and
instead manually add the jar to the classpath of every worke), but I
can't seem to find out how)

-- 
Best regards,
Martin Gammelsæter

RE: Kryo is slower, and the size saving is minimal

2014-07-09 Thread innowireless TaeYun Kim

Thank you for your response.

Maybe that applies to my case.
In my test case, The types of almost all of the data are either primitive
types, joda DateTime, or String.
But I'm somewhat disappointed with the speed.
At least it should not be slower than Java default serializer...

-Original Message-
From: wxhsdp [mailto:wxh...@gmail.com] 
Sent: Wednesday, July 09, 2014 5:47 PM
To: u...@spark.incubator.apache.org
Subject: Re: Kryo is slower, and the size saving is minimal

i'am not familiar with kryo and my opinion may be not right. in my case,
kryo only saves about 5% of the original size when dealing with primitive
types such as Arrays. i'am not sure whether it is the common case.



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-is-slower-and-the-s
ize-saving-is-minimal-tp9131p9160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: TaskContext stageId = 0

2014-07-09 Thread silvermast

Oh well, never mind. The problem is that ResultTask's stageId is immutable
and is used to construct the Task superclass. Anyway, my solution now is to
use this.id for the rddId and to gather all rddIds using a spark listener on
stage completed to clean up for any activity registered for those rdds. I
could use TaskContext's hook but I'd have to add some more messaging so I
can clear state that may live on a different executor than the one my
partition is on, but since I don't know that the executor will succeed, this
is not safe.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TaskContext-stageId-0-tp9152p9162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to run a job on all workers?

2014-07-09 Thread silvermast

Is it possible to run a job that assigns work to every worker in the system?
My bootleg right now is to have a spark listener hear whenever a block
manager is added and to increase a split count by 1. It runs a spark job
with that split count and hopes that it will at least run on the newest
worker. There's some weirdness with block managers being removed sometimes
allowing my count to go negative, so I just keep monotonically increasing my
split count. Anyone have a way that doesn't suck?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-a-job-on-all-workers-tp9163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Error using MLlib-NaiveBayes : Matrices are not aligned

2014-07-09 Thread Rahul Bhojwani

I am using Naive Bayes in MLlib .
Below I have printed log of *model.theta*. after training on train data.
You can check that it contains 9 features for 2 class classification.

print numpy.log(model.theta)

[[ 0.31618962  0.16636852  0.07200358  0.05411449  0.08542039  0.17620751
   0.03711986  0.07110912  0.02146691]
 [ 0.26598639  0.23809524  0.06258503  0.01904762  0.05714286  0.18231293
   0.08911565  0.03061224  0.05510204]]




 I am giving the same no. of features for prediction but I am getting
error: *matrices not alligned.*

*ERROR:*

Traceback (most recent call last):
  File .\naive_bayes_analyser.py, line 192, in module
prediction =
model.predict(array([float(pos_count),float(neg_count),float(need_count),float(goal_count),float(try_co
unt),float(means_count),float(persist_count),float(complete_count),float(fail_count)]))
  File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py,
line 101, in predict
return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned



Can someone suggest me the possible error/mistake?
Thanks in advance
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev

Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile(hdfs://...)
val errors = file.filter(line = line.contains(ERROR))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev

Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi

Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi all,

 I wondered if you could help me to clarify the next situation:
 in the classic example

 val file = spark.textFile(hdfs://...)
 val errors = file.filter(line = line.contains(ERROR))

 As I understand, the data is read in memory in first, and after that
 filtering is applying. Is it any way to apply filtering during the read
 step? and don't put all objects into memory?

 Thank you,
 Konstantin Kudryavtsev

Docker Scripts

2014-07-09 Thread dmpour23

Hi,

Regarding docker scripts I know i can change the base image easily but  is
there any specific reason why the base image is hadoop_1.2.1 . Why is this
prefered to Hadoop2 [HDP2, CDH5]) distributions? 

Now that amazon supports docker could this replace ec2-scripts?

Kind regards
Dimitri





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Docker-Scripts-tp9167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

FW: memory question

2014-07-09 Thread michael.lewis

Hi,

Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. 
rather than set spark.cleaner.ttl and have this run periodically, I'd like to 
run it on demand. The problem with periodic cleaning is that it can remove rdd 
that we still require (some calcs are short, others very long).

We're using Spark 0.9.0 with Cloudera distribution.

I have a simple test calculation in a loop as follows:

val test = new TestCalc(sparkContext)
for (i - 1 to 10) {
  val (x) = test.evaluate(rdd)
}

Where TestCalc is defined as:
class TestCalc(sparkContext: SparkContext) extends Serializable  {

  def aplus(a: Double, b:Double) :Double = a+b;

  def evaluate(rdd : RDD[Double]) = {
 /* do some dummy calc. */
  val x = rdd.groupBy(x = x /2.0)
  val y = x.fold((0.0,Seq[Double]()))((a,b)=(aplus(a._1,b._1),Seq()))
  val z = y._1
/* try with/without this... */
  val e :SparkEnv = SparkEnv.getThreadLocal
  e.blockManager.master.removeRdd(x.id,true) // still see memory 
consumption go up...
(z)
  }
}

What I can see on the cluster is the memory usage on the node executing this 
continually
climbs. I'd expect it to level off and not jump up over 1G...
I thought that putting in the line 'removeRdd' might help, but it doesn't seem 
to make a difference


Regards,
Mike


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___

Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread gil...@gmail.com

Hello,While trying to run this example below I am getting errors.I have build
Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0
SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean
assembly-Running the example using
Spark-shell---$
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client
./bin/spark-shellscala val sqlContext = new
org.apache.spark.sql.SQLContext(sc)import sqlContext._case class
Person(name: String, age: Int)val people =
sc.textFile(hdfs://myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))people.registerAsTable(people)val
teenagers = sql(SELECT name FROM people WHERE age = 13 AND age =
19)teenagers.map(t = Name:  +
t(0)).collect().foreach(println)--error---java.lang.NoClassDefFoundError:
Could not initialize class $line10.$read$at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)at
scala.collection.Iterator$$anon$1.next(Iterator.scala:853)at
scala.collection.Iterator$$anon$1.head(Iterator.scala:840)at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
   
at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
   
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)   
at org.apache.spark.scheduler.Task.run(Task.scala:51)at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)   
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
  
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
  
at java.lang.Thread.run(Thread.java:744)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?

for classpath a simple script similar to hadoop classpath that shows what
needs to be added should be sufficient.

on spark standalone I can launch a program just fine with just SparkConf
and SparkContext. not on yarn, so the spark-launch script must be doing a
few things extra there I am missing... which makes things more difficult
because I am not sure its realistic to expect every application that needs
to run something on spark to be launched using spark-submit.
 On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?

Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman

Are there any gaps beyond convenience and code/config separation in using
spark-submit versus SparkConf/SparkContext if you are willing to set your
own config?

If there are any gaps, +1 on having parity within SparkConf/SparkContext
where possible. In my use case, we launch our jobs programmatically. In
theory, we could shell out to spark-submit but it's not the best option for
us.

So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
on the complexities of other modes, though.

-Suren



On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:

 not sure I understand why unifying how you submit app for different
 platforms and dynamic configuration cannot be part of SparkConf and
 SparkContext?

 for classpath a simple script similar to hadoop classpath that shows
 what needs to be added should be sufficient.

 on spark standalone I can launch a program just fine with just SparkConf
 and SparkContext. not on yarn, so the spark-launch script must be doing a
 few things extra there I am missing... which makes things more difficult
 because I am not sure its realistic to expect every application that needs
 to run something on spark to be launched using spark-submit.
  On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa

The idea is to run a job that use images as input so that each work will
process a subset of the images


On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 RDD can only keep objects. How do you plan to encode these images so that
 they can be loaded. Keeping the whole image as a single object in 1 rdd
 would perhaps not be super optimized.
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi all,

 I need to run a spark job that need a set of images as input. I need
 something that load these images as RDD but I just don't know how to do
 that. Do some of you have any idea ?

 Cheers,


 Jao

Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi

 val sem = 0
sc.addSparkListener(new SparkListener {
  override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem +=1
  }

})
sc = spark context



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 9, 2014 at 4:34 AM, abhiguruvayya sharath.abhis...@gmail.com
wrote:

 Hello Mayur,

 How can I implement these methods mentioned below. Do u you have any clue
 on
 this pls et me know.


 public void onJobStart(SparkListenerJobStart arg0) {
 }

 @Override
 public void onStageCompleted(SparkListenerStageCompleted arg0) {
 }

 @Override
 public void onStageSubmitted(SparkListenerStageSubmitted arg0) {

 }



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9104.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Cassandra driver Spark question

2014-07-09 Thread RodrigoB

Hi all,

I am currently trying to save to Cassandra after some Spark Streaming
computation.

I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd = {

var someCol = Seq[MyType]()

foreach(kv ={
  someCol :+ rdd._2 //I only want the RDD value and not the key
 }
val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING TO
RUN THE WORKER
collectionRDD.saveToCassandra(...)
}

I get the NotSerializableException while trying to run the Node (also tried
someCol as shared variable).
I believe this happens because the myDStream doesn't exist yet when the code
is pushed to the Node so the parallelize doens't have any structure to
relate to it. Inside this foreachRDD I should only do RDD calls which are
only related to other RDDs. I guess this was just a desperate attempt

So I have a question
Using the Cassandra Spark driver - Can we only write to Cassandra from an
RDD? In my case I only want to write once all the computation is finished in
a single batch on the driver app.

tnks in advance.

Rod











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Madabhattula Rajesh Kumar

Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS
directory

JavaDStreamString lines_2 =
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);

How to convert JavaDStreamString result to JavaRDDString? if we can
convert. I can use collect() method on JavaRDD and process my textfile.

I'm not able to find collect method on JavaRDDString.

Thank you very much in advance.

Regards,
Rajesh

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez

Is MyType serializable? Everything inside the foreachRDD closure has to be
serializable.

2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com:

Hi all,

I am currently trying to save to Cassandra after some Spark Streaming
computation.

I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd = {

var someCol = Seq[MyType]()

foreach(kv ={
someCol :+ rdd._2 //I only want the RDD value and not the key
}
val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING
TO
RUN THE WORKER
collectionRDD.saveToCassandra(...)
}

I get the NotSerializableException while trying to run the Node (also tried
someCol as shared variable).
I believe this happens because the myDStream doesn't exist yet when the
code
is pushed to the Node so the parallelize doens't have any structure to
relate to it. Inside this foreachRDD I should only do RDD calls which are
only related to other RDDs. I guess this was just a desperate attempt

So I have a question
Using the Cassandra Spark driver - Can we only write to Cassandra from an
RDD? In my case I only want to write once all the computation is finished
in
a single batch on the driver app.

tnks in advance.

Rod

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Robert James

+1 to be able to do anything via SparkConf/SparkContext.  Our app
worked fine in Spark 0.9, but, after several days of wrestling with
uber jars and spark-submit, and so far failing to get Spark 1.0
working, we'd like to go back to doing it ourself with SparkConf.

As the previous poster said, a few scripts should be able to give us
the classpath and any other params we need, and be a lot more
transparent and debuggable.

On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
 Are there any gaps beyond convenience and code/config separation in using
 spark-submit versus SparkConf/SparkContext if you are willing to set your
 own config?

 If there are any gaps, +1 on having parity within SparkConf/SparkContext
 where possible. In my use case, we launch our jobs programmatically. In
 theory, we could shell out to spark-submit but it's not the best option for
 us.

 So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
 on the complexities of other modes, though.

 -Suren



 On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:

 not sure I understand why unifying how you submit app for different
 platforms and dynamic configuration cannot be part of SparkConf and
 SparkContext?

 for classpath a simple script similar to hadoop classpath that shows
 what needs to be added should be sufficient.

 on spark standalone I can launch a program just fine with just SparkConf
 and SparkContext. not on yarn, so the spark-launch script must be doing a
 few things extra there I am missing... which makes things more difficult
 because I am not sure its realistic to expect every application that
 needs
 to run something on spark to be launched using spark-submit.
  On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?




 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io

Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed

Hi,

For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq

On Friday, May 23, 2014 10:33 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:

Well its hard to use text data as time of input.
But if you are adament here's what you would do.
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files you
receive push them into the folder in order of time. Mostly your data would be
produced at t, you would get it at t + say 5 sec, you can push it in get
processed at t + say 10 sec. Then you can timeshift your calculations. If you
are okay with broad enough time frame you should be fine.

Another way is to use queue processing.
QueueJavaRDDInteger rddQueue = new LinkedListJavaRDDInteger();
Create a Dstream to consume from Queue of RDD, keep looking into the folder of
files create rdd's from them at a min level push them into thee queue. This
would cause you to go through your data atleast twice just provide order
guarantees , processing time is still going to vary with how quickly you can
process your RDD.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Thu, May 22, 2014 at 9:08 PM, Ian Holsman i...@holsman.com.au wrote:

Hi.

I'm writing a pilot project, and plan on using spark's streaming app for it.

To start with I have a dump of some access logs with their own timestamps, and
am using the textFileStream and some old files to test it with.

One of the issues I've come across is simulating the windows. I would like use
the timestamp from the access logs as the 'system time' instead of the real
clock time.

I googled a bit and found the 'manual' clock which appears to be used for
testing the job scheduler.. but I'm not sure what my next steps should be.

I'm guessing I'll need to do something like

1. use the textFileStream to create a 'DStream'
2. have some kind of DStream that runs on top of that that creates the RDDs
based on the timestamps Instead of the system time
3. the rest of my mappers.

Is this correct? or do I need to create my own 'textFileStream' to initially
create the RDDs and modify the system clock inside of it.

I'm not too concerned about out-of-order messages, going backwards in time, or
being 100% in sync across workers.. as this is more for
'development'/prototyping.

Are there better ways of achieving this? I would assume that controlling the
windows RDD buckets would be a common use case.

TIA
Ian

Ian Holsman
i...@holsman.com.au
PH: + 61-3-9028 8133 / +1-(425) 998-7083

RDD Cleanup

2014-07-09 Thread premdass

Hi,

I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
system. Basically a long running context is created, which enables to run
multiple jobs under the same context, and hence sharing of the data.

It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
created and cached by a Job-1 gets cleaned up by BlockManager (can see log
statements saying cleaning up RDD) and so the cached RDD's are not available
for Job-2, though Both Job-1 and Job-2 are running under same spark context.

I tried using the spark.cleaner.referenceTracking = false setting, how-ever
this causes the issue that unpersisted RDD's are not cleaned up properly,
and occupying the Spark's memory..


Had anybody faced issue like this before? If so, any advice would be greatly
appreicated.


Also is there any way, to mark an RDD as being used under a context, event
though the job using that had been finished (so subsequent jobs can use that
RDD).


Thanks,
Prem



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Laeeq Ahmed

Hi,

First use foreachrdd and then use collect as


DStream.foreachRDD(rdd = {
   rdd.collect.foreach({

Also its better to use scala. Less verbose. 


Regards,
Laeeq



On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:
 


Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory

JavaDStreamString lines_2 = 
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);

How to convert JavaDStreamString result to JavaRDDString? 
if we can convert. I can use collect() method on JavaRDD and process my 
textfile.

I'm not able to find collect method on JavaRDDString.

Thank you very much in advance.

Regards,
Rajesh

Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed

Hi,

For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.

 


On Sunday, July 6, 2014 10:20 AM, alessandro finamore 
alessandro.finam...@polito.it wrote:
 


On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] 
[hidden email] wrote: 
 Key idea is to simulate your app time as you enter data . So you can connect 
 spark streaming to a queue and insert data in it spaced by time. Easier said 
 than done :). 

I see. 
I'll try to implement also this solution so that I can compare it with 
my current spark implementation. 
I'm interested in seeing if this is faster...as I assume it should be :) 

 What are the parallelism issues you are hitting with your 
 static approach. 

In my current spark implementation, whenever I need to get the 
aggregated stats over the window, I'm re-mapping all the current bins 
to have the same key so that they can be reduced altogether. 
This means that data need to shipped to a single reducer. 
As results, adding nodes/cores to the application does not really 
affect the total time :( 


 
 
 On Friday, July 4, 2014, alessandro finamore [hidden email] wrote: 
 
 Thanks for the replies 
 
 What is not completely clear to me is how time is managed. 
 I can create a DStream from file. 
 But if I set the window property that will be bounded to the application 
 time, right? 
 
 If I got it right, with a receiver I can control the way DStream are 
 created. 
 But, how can apply then the windowing already shipped with the framework 
 if 
 this is bounded to the application time? 
 I would like to do define a window of N files but the window() function 
 requires a duration as input... 
 
 
 
 
 -- 
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
 
 Sent from the Apache Spark User List mailing list archive at Nabble.com. 
 
 
 
 -- 
 Sent from Gmail Mobile 
 
 
  
 If you reply to this email, your message will be added to the discussion 
 below: 
 http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
 To unsubscribe from window analysis with Spark and Spark streaming, click 
 here. 
 NAML 


-- 
-- 
Alessandro Finamore, PhD 
Politecnico di Torino 
-- 
Office:    +39 0115644127 
Mobile:   +39 3280251485 
SkypeId: alessandro.finamore 
--- 


 View this message in context: Re: window analysis with Spark and Spark 
streaming

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam

+1 as well for being able to submit jobs programmatically without using
shell script.

we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar
to submit jobs in shell.



On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in using
  spark-submit versus SparkConf/SparkContext if you are willing to set your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option
 for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be doing
 a
  few things extra there I am missing... which makes things more difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io

Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB

Hi Luis,

Yes it's actually an ouput of the previous RDD. 
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD Cleanup

did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...


On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
 system. Basically a long running context is created, which enables to run
 multiple jobs under the same context, and hence sharing of the data.

 It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
 created and cached by a Job-1 gets cleaned up by BlockManager (can see log
 statements saying cleaning up RDD) and so the cached RDD's are not
 available
 for Job-2, though Both Job-1 and Job-2 are running under same spark
 context.

 I tried using the spark.cleaner.referenceTracking = false setting, how-ever
 this causes the issue that unpersisted RDD's are not cleaned up properly,
 and occupying the Spark's memory..


 Had anybody faced issue like this before? If so, any advice would be
 greatly
 appreicated.


 Also is there any way, to mark an RDD as being used under a context, event
 though the job using that had been finished (so subsequent jobs can use
 that
 RDD).


 Thanks,
 Prem



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez

Yes, I'm using it to count concurrent users from a kafka stream of events
without problems. I'm currently testing it using the local mode but any
serialization problem would have already appeared so I don't expect any
serialization issue when I deployed to my cluster.


2014-07-09 15:39 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com:

 Hi Luis,

 Yes it's actually an ouput of the previous RDD.
 Have you ever used the Cassandra Spark Driver on the driver app? I believe
 these limitations go around that - it's designed to save RDDs from the
 nodes.

 tnks,
 Rod



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming and Storm

2014-07-09 Thread Dan H.

Xichen_tju,

I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 
24GBRAM with 3 servers) and was not able to achieve a realistic scale for my 
business domain needs.  Storm is really only a framework, which allows you to 
put in code to do whatever it is you need for a distributed system…so it’s 
completely flexible and distributable, but it comes at a price.  In Storm, the 
one of the biggest performance hits, came down to how the “acks” work within 
the tuple trees.  You can have the framework default ack messages between 
spouts and/or bolts, but in the end, you most likely want to manage acks 
yourself, due to how much reliability you’re system will need (to replay 
messages…).  All this means, is that if you don’t have massive amounts of data 
that you need to process within a few seconds, (which I do) then Storm may work 
well for you, but you’re performance will diminish as you add in more and more 
business rules (unless of course you add in more servers for processing).  If 
you need to ingest at least 1GBps+, then you may want to reevaluate since 
you’re server scale may not mesh with you overall processing needs.

I recently just started using Spark Streaming with Kafka and have been quite 
impressed at the performance level that’s being achieved.  I particularly like 
the fact that Spark isn’t just a framework, but it provides you with simple 
tools with API convenience methods.  Some of those features are reduceByKey 
(mapReduce), sliding and aggregate sub time windows, etc.  Also, In my 
environment, I believe it’s going to be a great fit since we use Hadoop already 
and Spark should fit into that environment well.

You should look into both Storm and Spark Streaming, but in the end it just 
depends on your needs.  If you not looking for Streaming aspects, then Spark on 
Hadoop is a great option since Spark will cache the dataset in memory for all 
queries, which will be much faster than running Hive/Pig onto of Hadoop.  But 
I’m assuming you need some sort of Streaming system for data flow, but if it 
doesn’t need to be real-time or near real-time, you may want to simply look at 
Hadoop, which you could always use Spark ontop of for real-time queries.

Hope this helps…

Dan

 
On Jul 8, 2014, at 7:25 PM, Shao, Saisai saisai.s...@intel.com wrote:

 You may get the performance comparison results from Spark Streaming paper and 
 meetup ppt, just google it.
 Actually performance comparison is case by case and relies on your work load 
 design, hardware and software configurations. There is no actual winner for 
 the whole scenarios.
  
 Thanks
 Jerry
  
 From: xichen_tju@126 [mailto:xichen_...@126.com] 
 Sent: Wednesday, July 09, 2014 9:17 AM
 To: user@spark.apache.org
 Subject: Spark Streaming and Storm
  
 hi all
 I am a newbie to Spark Streaming, and used Strom before.Have u test the 
 performance both of them and which one is better?
  
 xichen_tju@126

Re: RDD Cleanup

2014-07-09 Thread premdass

Hi,

Yes . I am  caching the RDD's by calling cache method..


May i ask, how you are sharing RDD's across jobs in same context? By the RDD
name. I tried printing the RDD's of the Spark context, and when the
referenceTracking is enabled, i get empty list after the clean up.

Thanks,
Prem




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD Cleanup

we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application


On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 Yes . I am  caching the RDD's by calling cache method..


 May i ask, how you are sharing RDD's across jobs in same context? By the
 RDD
 name. I tried printing the RDD's of the Spark context, and when the
 referenceTracking is enabled, i get empty list after the clean up.

 Thanks,
 Prem




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark on Yarn: Connecting to Existing Instance

I am trying to get my head around using Spark on Yarn from a perspective of
a cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This
is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection.  Cool. s
But what I don't understand is how do I setup a Yarn instance that I can
connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it
gave me an error, telling me to use yarn-client.  I see information on
using spark-class or spark-submit.  But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!

Re: Purpose of spark-submit?

2014-07-09 Thread Andrei

One another +1. For me it's a question of embedding. With
SparkConf/SparkContext I can easily create larger projects with Spark as a
separate service (just like MySQL and JDBC, for example). With spark-submit
I'm bound to Spark as a main framework that defines how my application
should look like. In my humble opinion, using Spark as embeddable library
rather than main framework and runtime is much easier.




On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option
 for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io

Re: Purpose of spark-submit?

Spark still supports the ability to submit jobs programmatically without
shell scripts.

Koert,
The main reason that the unification can't be a part of SparkContext is
that YARN and standalone support deploy modes where the driver runs in a
managed process on the cluster.  In this case, the SparkContext is created
on a remote node well after the application is launched.


On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
 srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez

The idea behind YARN is that you can run different application types like 
MapReduce, Storm and Spark.

I would recommend that you build your spark jobs in the main method without 
specifying how you deploy it. Then you can use spark-submit to tell Spark how 
you would want to deploy to it using yarn-cluster as the master. The key point 
here is that once you have YARN setup, the spark client connects to it using 
the $HADOOP_CONF_DIR that contains the resource manager address. In particular, 
this needs to be accessible from the classpath of the submitter since it 
implicitly uses this when it instantiates a YarnConfiguration instance. If you 
want more details, read org.apache.spark.deploy.yarn.Client.scala.

You should be able to download a standalone YARN cluster from any of the Hadoop 
providers like Cloudera or Hortonworks. Once you have that, the spark 
programming guide describes what I mention above in sufficient detail for you 
to proceed.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
 I am trying to get my head around using Spark on Yarn from a perspective of a 
 cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This is 
 done in yarn-client mode and it all works well. 
 
 In multiple examples, I see instances where people have setup Spark Clusters 
 in Stand Alone mode, and then in the examples they connect to this cluster 
 in Stand Alone mode. This is done often times using the spark:// string for 
 connection.  Cool. s
 But what I don't understand is how do I setup a Yarn instance that I can 
 connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it 
 gave me an error, telling me to use yarn-client.  I see information on using 
 spark-class or spark-submit.  But what I'd really like is a instance I can 
 connect a spark-shell too, and have the instance stay up. I'd like to be able 
 run other things on that instance etc. Is that possible with Yarn? I know 
 there may be long running job challenges with Yarn, but I am just testing, I 
 am just curious if I am looking at something completely bonkers here, or just 
 missing something simple. 
 
 Thanks!

Mechanics of passing functions to Spark?

2014-07-09 Thread Seref Arikan

Greetings,
The documentation at
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
says:
Note that while it is also possible to pass a reference to a method in a
class instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method

First, could someone clarify what is meant by sending the object here? How
is the object sent to (presumably) nodes of the cluster? Is it a one time
operation per node? Why does this sound like (according to doc) a less
preferred option compared to a singleton's function? Would not the nodes
require the singleton object anyway?

Some clarification would really help.

Regards
Seref

ps: the expression the object that contains that class sounds a bit
unusual, is the intended meaning the object that is the instance  of that
class ?

Re: Purpose of spark-submit?

sandy, that makes sense. however i had trouble doing programmatic execution
on yarn in client mode as well. the application-master in yarn came up but
then bombed because it was looking for jars that dont exist (it was looking
in the original file paths on the driver side, which are not available on
the yarn node). my guess is that spark-submit is changing some settings
(perhaps preparing the distributed cache and modifying settings
accordingly), which makes it harder to run things programmatically. i could
be wrong however. i gave up debugging and resorted to using spark-submit
for now.



On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Spark still supports the ability to submit jobs programmatically without
 shell scripts.

 Koert,
 The main reason that the unification can't be a part of SparkContext is
 that YARN and standalone support deploy modes where the driver runs in a
 managed process on the cluster.  In this case, the SparkContext is created
 on a remote node well after the application is launched.


 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically.
 In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that
 shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving users
 a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark.
 So
  a user can bundle an application and then use spark-submit to send
 it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't
 need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
 srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside
 of
   the standard val conf = new SparkConf ... val sc = new
 SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH

Re: Spark on Yarn: Connecting to Existing Instance

To add to Ron's answer, this post explains what it means to run Spark
against a YARN cluster, the difference between yarn-client and yarn-cluster
mode, and the reason spark-shell only works in yarn-client mode.
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy

On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:

The idea behind YARN is that you can run different application types like
MapReduce, Storm and Spark.

I would recommend that you build your spark jobs in the main method
without specifying how you deploy it. Then you can use spark-submit to tell
Spark how you would want to deploy to it using yarn-cluster as the master.
The key point here is that once you have YARN setup, the spark client
connects to it using the $HADOOP_CONF_DIR that contains the resource
manager address. In particular, this needs to be accessible from the
classpath of the submitter since it implicitly uses this when it
instantiates a YarnConfiguration instance. If you want more details, read
org.apache.spark.deploy.yarn.Client.scala.

You should be able to download a standalone YARN cluster from any of the
Hadoop providers like Cloudera or Hortonworks. Once you have that, the
spark programming guide describes what I mention above in sufficient detail
for you to proceed.

Thanks,
Ron

Sent from my iPad

On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:

I am trying to get my head around using Spark on Yarn from a perspective
of a cluster. I can start a Spark Shell no issues in Yarn. Works easily.
This is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection. Cool. s
But what I don't understand is how do I setup a Yarn instance that I can
connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it
gave me an error, telling me to use yarn-client. I see information on
using spark-class or spark-submit. But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!

Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr

Hi

My setup is to use localMode standalone, Sprak 1.0.0 release version, scala
2.10.4

I made a job that receive serialized object from Kafka broker. The objects
are serialized using kryo. 
The code :

val sparkConf = new
SparkConf().setMaster(local[4]).setAppName(SparkTest)
  .set(spark.serializer, org.apache.spark.serializer.JavaSerializer)
  .set(spark.kryo.registrator,
com.inneractive.fortknox.kafka.EventDetailRegistrator)

val ssc = new StreamingContext(sparkConf, Seconds(20))
ssc.checkpoint(checkpoint)


val topicMap = topic.split(,).map((_,partitions)).toMap

// Create a Stream using my Decoder EventKryoEncoder 
val events = KafkaUtils.createStream[String, EventDetails,
StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams,
  topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)

val data = events.map(e = (e.getPublisherId, 1L))
val counter = data.reduceByKey(_ + _)
counter.print()

ssc.start()
ssc.awaitTermination()

When I run this I get
java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
~[na:1.7.0_60]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
~[na:1.7.0_60]
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
~[na:1.7.0_60]
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.scheduler.Task.run(Task.scala:51)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]

I've check that my decoder is working I can trace that the deserialization
is OK thus sprark must get ready to use object 

My setup work if I use JSON and not Kryo serialized object
Thanks for help because I don't what to do next






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam

Sandy, I experienced the similar behavior as Koert just mentioned. I don't
understand why there is a difference between using spark-submit and
programmatic execution. Maybe there is something else we need to add to the
spark conf/spark context in order to launch spark jobs programmatically
that are not needed before?



On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote:

 sandy, that makes sense. however i had trouble doing programmatic
 execution on yarn in client mode as well. the application-master in yarn
 came up but then bombed because it was looking for jars that dont exist (it
 was looking in the original file paths on the driver side, which are not
 available on the yarn node). my guess is that spark-submit is changing some
 settings (perhaps preparing the distributed cache and modifying settings
 accordingly), which makes it harder to run things programmatically. i could
 be wrong however. i gave up debugging and resorted to using spark-submit
 for now.



 On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Spark still supports the ability to submit jobs programmatically without
 shell scripts.

 Koert,
 The main reason that the unification can't be a part of SparkContext is
 that YARN and standalone support deploy modes where the driver runs in a
 managed process on the cluster.  In this case, the SparkContext is created
 on a remote node well after the application is launched.


 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically.
 In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that
 shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving
 users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark.
 So
  a user can bundle an application and then use spark-submit to send
 it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program.
 That's
  the one you mention here. You can choose to use this feature or
 not.
  If you know your configs are not going

Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis

Hello,

I am currently learning Apache Spark and I want to see how it integrates
with an existing Hadoop Cluster.

My current Hadoop configuration is version 2.2.0 without Yarn. I have build
Apache Spark (v1.0.0) following the instructions in the README file. Only
setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the HADOOP_CONF_DIR
to point to the configuration directory of Hadoop configuration.

My use-case is the Linear Least Regression MLlib example of Apache Spark
(link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
The only difference in the code is that I give the text file to be an HDFS
file.

However, I get a Runtime Exception: Error in configuring object.

So my question is the following:

Does Spark work with a Hadoop distribution without Yarn?
If yes, am I doing it right? If no, can I build Spark with
SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?

Thank you,
Nick

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez

Koert,
Yeah I had the same problems trying to do programmatic submission of spark jobs 
to my Yarn cluster. I was ultimately able to resolve it by reviewing the 
classpath and debugging through all the different things that the Spark Yarn 
client (Client.scala) did for submitting to Yarn (like env setup, local 
resources, etc), and I compared it to what spark-submit had done.
I have to admit though that it was far from trivial to get it working out of 
the box, and perhaps some work could be done in that regards. In my case, it 
had boiled down to the launch environment not having the HADOOP_CONF_DIR set, 
which prevented the app master from registering itself with the Resource 
Manager.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 9:25 AM, Jerry Lam chiling...@gmail.com wrote:
 
 Sandy, I experienced the similar behavior as Koert just mentioned. I don't 
 understand why there is a difference between using spark-submit and 
 programmatic execution. Maybe there is something else we need to add to the 
 spark conf/spark context in order to launch spark jobs programmatically that 
 are not needed before?
 
 
 
 On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote:
 sandy, that makes sense. however i had trouble doing programmatic execution 
 on yarn in client mode as well. the application-master in yarn came up but 
 then bombed because it was looking for jars that dont exist (it was looking 
 in the original file paths on the driver side, which are not available on 
 the yarn node). my guess is that spark-submit is changing some settings 
 (perhaps preparing the distributed cache and modifying settings 
 accordingly), which makes it harder to run things programmatically. i could 
 be wrong however. i gave up debugging and resorted to using spark-submit for 
 now.
 
 
 
 On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Spark still supports the ability to submit jobs programmatically without 
 shell scripts.
 
 Koert,
 The main reason that the unification can't be a part of SparkContext is 
 that YARN and standalone support deploy modes where the driver runs in a 
 managed process on the cluster.  In this case, the SparkContext is created 
 on a remote node well after the application is launched.
 
 
 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:
 One another +1. For me it's a question of embedding. With 
 SparkConf/SparkContext I can easily create larger projects with Spark as a 
 separate service (just like MySQL and JDBC, for example). With 
 spark-submit I'm bound to Spark as a main framework that defines how my 
 application should look like. In my humble opinion, using Spark as 
 embeddable library rather than main framework and runtime is much easier. 
 
 
 
 
 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:
 +1 as well for being able to submit jobs programmatically without using 
 shell script.
 
 we also experience issues of submitting jobs programmatically without 
 using spark-submit. In fact, even in the Hadoop World, I rarely used 
 hadoop jar to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com 
 wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in 
  using
  spark-submit versus SparkConf/SparkContext if you are willing to set 
  your
  own config?
 
  If there are any gaps, +1 on having parity within 
  SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best 
  option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not 
  knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com 
  wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just 
  SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be 
  doing a
  few things extra there I am missing... which makes things more 
  difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez

I am able to use Client.scala or LauncherExecutor.scala as my programmatic 
entry point for Yarn.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote:
 
 +1 as well for being able to submit jobs programmatically without using shell 
 script.
 
 we also experience issues of submitting jobs programmatically without using 
 spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar 
 to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in using
  spark-submit versus SparkConf/SparkContext if you are willing to set your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be doing a
  few things extra there I am missing... which makes things more difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov

Hi Patrick,

I used 1.0 branch, but it was not an official release, I just git pulled
whatever was there and compiled.

Thanks,
Mikhail



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers

By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Rahul Bhojwani

According to me there is BUG in MLlib Naive Bayes implementation in spark
0.9.1.
Whom should I report this to or with whom should I discuss? I can discuss
this over call as well.

My Skype ID : rahul.bhijwani
Phone no:  +91-9945197359

Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Sean Owen

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Make a JIRA with enough detail to reproduce the error ideally:
https://issues.apache.org/jira/browse/SPARK

and then even more ideally open a PR with a fix:
https://github.com/apache/spark


On Wed, Jul 9, 2014 at 5:57 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:

 According to me there is BUG in MLlib Naive Bayes implementation in spark
 0.9.1.
 Whom should I report this to or with whom should I discuss? I can discuss
 this over call as well.

 My Skype ID : rahul.bhijwani
 Phone no:  +91-9945197359

 Thanks,
 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai

It means pulling the code from latest development branch from git
repository.
On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com
wrote:

 By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
 master? Because I am using v 1.0.0 - Alex



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores

Hi everyone,

I am new to Spark and I'm having problems to make my code compile. I have
the feeling I might be misunderstanding the functions so I would be very
glad to get some insight in what could be wrong.

The problematic code is the following:

JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} );

JavaPairRDDPartition, IterableBody partitions =
bodies.mapToPair(b -
b.computePartitions(maxDistance)).groupByKey();

Partition and Body are defined inside the driver class. Body contains the
following definition:

protected IterableTuple2Partition, Body computePartitions (int
maxDistance)

The idea is to reproduce the following schema:

The first map results in: *body1, body2, ... *
The mapToPair should output several of these:* (partition_i, body1),
(partition_i, body2)...*
Which are gathered by key as follows: *(partition_i, (body1, body_n),
(partition_i', (body2, body_n') ...*

Thanks in advance.
Regards,
Silvina

Re: Comparative study

2014-07-09 Thread Keith Simmons

Good point.  Shows how personal use cases color how we interpret products.


On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:

 On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace
 Hive, which is map/reduce based.  It has its own distributed query engine,
 though it does load data from HDFS, and is part of the hadoop ecosystem.
  Impala really shines when your


 (It was not built to replace Hive. It's purpose-built to make interactive
 use with a BI tool feasible -- single-digit second queries on huge data
 sets. It's very memory hungry. Hive's architecture choices and legacy code
 have been throughput-oriented, and can't really get below minutes at scale,
 but, remains a right choice when you are in fact doing ETL!)

Re: Spark on Yarn: Connecting to Existing Instance

Thank you for the link. In that link the following is written:

For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a single
batch job, an interactive session with multiple jobs spaced apart, or a
long-lived server continually satisfying requests

So, if I wanted to use a long-lived server continually satisfying
requests and then start a shell that connected to that context, how would
I do that in Yarn? That's the problem I am having right now, I just want
there to be that long lived service that I can utilize.

Thanks!

On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy

On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:

The idea behind YARN is that you can run different application types like
MapReduce, Storm and Spark.

Thanks,
Ron

Sent from my iPad

On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:

I am trying to get my head around using Spark on Yarn from a
perspective of a cluster. I can start a Spark Shell no issues in Yarn.
Works easily. This is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection. Cool. s
But what I don't understand is how do I setup a Yarn instance that I
can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
it gave me an error, telling me to use yarn-client. I see information on
using spark-class or spark-submit. But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!

Re: Spark on Yarn: Connecting to Existing Instance

So basically, I have Spark on Yarn running (spark shell) how do I connect
to it with another tool I am trying to test using the spark://IP:7077 URL
it's expecting? If that won't work with spark shell, or yarn-client mode,
how do I setup Spark on Yarn to be able to handle that?

Thanks!

On Wed, Jul 9, 2014 at 12:41 PM, John Omernik j...@omernik.com wrote:

Thank you for the link. In that link the following is written:

For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a
single batch job, an interactive session with multiple jobs spaced apart,
or a long-lived server continually satisfying requests

Thanks!

On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy

On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com
wrote:

The idea behind YARN is that you can run different application types
like MapReduce, Storm and Spark.

Thanks,
Ron

Sent from my iPad

On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:

I am trying to get my head around using Spark on Yarn from a
perspective of a cluster. I can start a Spark Shell no issues in Yarn.
Works easily. This is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection. Cool. s
But what I don't understand is how do I setup a Yarn instance that I
can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
it gave me an error, telling me to use yarn-client. I see information on
using spark-class or spark-submit. But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!

Re: Spark on Yarn: Connecting to Existing Instance

Spark doesn't currently offer you anything special to do this. I.e. if you
want to write a Spark application that fires off jobs on behalf of remote
processes, you would need to implement the communication between those
remote processes and your Spark application code yourself.

On Wed, Jul 9, 2014 at 10:41 AM, John Omernik j...@omernik.com wrote:

Thank you for the link. In that link the following is written:

For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a
single batch job, an interactive session with multiple jobs spaced apart,
or a long-lived server continually satisfying requests

Thanks!

On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy

On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com
wrote:

The idea behind YARN is that you can run different application types
like MapReduce, Storm and Spark.

Thanks,
Ron

Sent from my iPad

On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:

I am trying to get my head around using Spark on Yarn from a
perspective of a cluster. I can start a Spark Shell no issues in Yarn.
Works easily. This is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection. Cool. s
But what I don't understand is how do I setup a Yarn instance that I
can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
it gave me an error, telling me to use yarn-client. I see information on
using spark-class or spark-submit. But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng

We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui

On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Xiangrui,

Thanks for all the help in resolving this issue. The cause turned out to
bethe build environment rather than runtime configuration. The build process
had picked up maven2 while building spark. Using binaries that were rebuilt
using m3, the entire processing went through fine. While I'm aware that the
build instruction page specifies m3 as the min requirement, declaratively
preventing accidental m2 usage (e.g. through something like the maven
enforcer plugin?) might help other developers avoid such issues.

-Bharath

On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng men...@gmail.com wrote:

It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:

export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20

and launched spark-shell with --driver-memory 20g. Could you re-try
with an EC2 setup? If it still doesn't work, please attach all your
code and logs.

Best,
Xiangrui

On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Xiangrui,

1) Yes, I used the same build (compiled locally from source) to the host
that has (master, slave1) and the second host with slave2.

2) The execution was successful when run in local mode with reduced
number
of partitions. Does this imply issues communicating/coordinating across
processes (i.e. driver, master and workers)?

Thanks,
Bharath

On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote:

Hi Bharath,

1) Did you sync the spark jar and conf to the worker nodes after build?
2) Since the dataset is not large, could you try local mode first
using `spark-summit --driver-memory 12g --master local[*]`?
3) Try to use less number of partitions, say 5.

If the problem is still there, please attach the full master/worker log
files.

Best,
Xiangrui

On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
reachb...@gmail.com
wrote:
Xiangrui,

Leaving the frameSize unspecified led to an error message (and
failure)
stating that the task size (~11M) was larger. I hence set it to an
arbitrarily large value ( I realize 500 was unrealistic unnecessary
in
this case). I've now set the size to 20M and repeated the runs. The
earlier
runs were on an uncached RDD. Caching the RDD (and setting
spark.storage.memoryFraction=0.5) resulted in marginal speed up of
execution, but the end result remained the same. The cached RDD size
is
as
follows:

RDD NameStorage LevelCached Partitions
Fraction CachedSize in MemorySize in TachyonSize on
Disk
1084 Memory Deserialized 1x Replicated 80
100% 165.9 MB 0.0 B 0.0 B

The corresponding master logs were:

14/07/04 06:29:34 INFO Master: Removing executor
app-20140704062238-0033/1
because it is EXITED
14/07/04 06:29:34 INFO Master: Launching executor
app-20140704062238-0033/2
on worker worker-20140630124441-slave1-40182
14/07/04 06:29:34 INFO Master: Removing executor
app-20140704062238-0033/0
because it is EXITED
14/07/04 06:29:34 INFO Master: Launching executor
app-20140704062238-0033/3
on worker worker-20140630102913-slave2-44735
14/07/04 06:29:37 INFO Master: Removing executor
app-20140704062238-0033/2
because it is EXITED
14/07/04 06:29:37 INFO Master: Launching executor
app-20140704062238-0033/4
on worker worker-20140630124441-slave1-40182
14/07/04 06:29:37 INFO Master: Removing executor
app-20140704062238-0033/3
because it is EXITED
14/07/04 06:29:37 INFO Master: Launching executor
app-20140704062238-0033/5
on worker worker-20140630102913-slave2-44735
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
14/07/04 06:29:39 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from
Actor[akka://sparkMaster/deadLetters] to

Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
was not delivered. [39] dead letters encountered. This logging can be
turned
off or adjusted with configuration settings 'akka.log-dead-letters'
and
'akka.log-dead-letters-during-shutdown'.
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 INFO Master:

Re: Spark on Yarn: Connecting to Existing Instance