Standalone cluster on Windows

2014-07-09 Thread Chitturi Padma
Hi, 

I wanted to set up standalone cluster on windows machine. But unfortunately,
spark-master.cmd file is not available. Can  someone suggest how to proceed
or is spark-1.0.0 has missed spark-master.cmd file ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-cluster-on-Windows-tp9135.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark Driver

2014-07-09 Thread Akhil Das
Can you try setting SPARK_MASTER_IP in the spark-env.sh file?

Thanks
Best Regards


On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:


  Hi all,
 I have one master and two slave node, I did not set any ip for spark
 driver because I thought it uses its default ( localhost). In my etc/hosts
 I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03
 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do
 something in hosts? or should I set a ip to spark-local-ip? I got the
 following in my stderr:

 Spark
 Executor Command: java -cp ::
 /usr/local/spark-1.0.0/conf:
 /usr/local/spark-1.0.0
 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
  
 -XX:MaxPermSize=128m -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
  akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 


 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend:
 Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -
 [akka.tcp://spark@master:54477] disassociated! Shutting down.





 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com



Re: spark Driver

2014-07-09 Thread Akhil Das
Can you also paste a little bit more stacktrace?

Thanks
Best Regards


On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote:

 I have the following in spark-env.sh

 SPARK_MASTER_IP=master
 SPARK_MASTER_port=7077


 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com


   On Wednesday, July 9, 2014 2:32 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:


 Can you try setting SPARK_MASTER_IP in the spark-env.sh file?

 Thanks
 Best Regards


 On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:


  Hi all,
 I have one master and two slave node, I did not set any ip for spark
 driver because I thought it uses its default ( localhost). In my etc/hosts
 I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03
 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do
 something in hosts? or should I set a ip to spark-local-ip? I got the
 following in my stderr:

 Spark
 Executor Command: java -cp ::
 /usr/local/spark-1.0.0/conf:
 /usr/local/spark-1.0.0
 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
  
 -XX:MaxPermSize=128m -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
  akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 


 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend:
 Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] -
 [akka.tcp://spark@master:54477] disassociated! Shutting down.





 Best Regards

 ...

 Amin Mohebbi

 PhD candidate in Software Engineering
  at university of Malaysia

 H#x2F;P : +60 18 2040 017



 E-Mail : tp025...@ex.apiit.edu.my

   amin_...@me.com







Re: Spark Installation

2014-07-09 Thread 田毅
Hi Srikrishna

the reason to this issue is you had uploaded assembly jar to HDFS twice.

paste your command could be better diagnosis



田毅
===
橘云平台产品线
大数据产品部
亚信联创科技(中国)有限公司
手机:13910177261
电话:010-82166322
传真:010-82166617
Q Q:20057509
MSN:yi.t...@hotmail.com
地址:北京市海淀区东北旺西路10号院东区  亚信联创大厦


===

在 2014年7月9日,上午3:03,Srikrishna S srikrishna...@gmail.com 写道:

 Hi All,
 
 
 I tried the make distribution script and it worked well. I was able to
 compile the spark binary on our CDH5 cluster. Once I compiled Spark, I
 copied over the binaries in the dist folder to all the other machines
 in the cluster.
 
 However, I run into an issue while submit a job in yarn-client mode. I
 get an error message that says the following
 Resource 
 file:/opt/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar
 changed on src filesystem (expected 1404845211000, was 1404845404000)
 
 My end goal is to submit a job (that uses MLLib) in our Yarn cluster.
 
 Any thoughts anyone?
 
 Regards,
 Krishna
 
 
 
 On Tue, Jul 8, 2014 at 9:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Hi Srikrishna,
 
 The binaries are built with something like
 mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 
 -Dyarn.version=2.3.0-cdh5.0.1
 
 -Sandy
 
 
 On Tue, Jul 8, 2014 at 3:14 AM, 田毅 tia...@asiainfo.com wrote:
 
 try this command:
 
 make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive
 
 
 
 
 田毅
 ===
 橘云平台产品线
 大数据产品部
 亚信联创科技(中国)有限公司
 手机:13910177261
 电话:010-82166322
 传真:010-82166617
 Q Q:20057509
 MSN:yi.t...@hotmail.com
 地址:北京市海淀区东北旺西路10号院东区  亚信联创大厦
 
 
 ===
 
 在 2014年7月8日,上午11:53,Krishna Sankar ksanka...@gmail.com 写道:
 
 Couldn't find any reference of CDH in pom.xml - profiles or the 
 hadoop.version.Am also wondering how the cdh compatible artifact was 
 compiled.
 Cheers
 k/
 
 
 On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S srikrishna...@gmail.com 
 wrote:
 
 Hi All,
 
 Does anyone know what the command line arguments to mvn are to generate 
 the pre-built binary for spark on Hadoop 2-CHD5.
 
 I would like to pull in a recent bug fix in spark-master and rebuild the 
 binaries in the exact same way that was used for that provided on the 
 website.
 
 I have tried the following:
 
 mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1
 
 And it doesn't quite work.
 
 Any thoughts anyone?
 
 
 
 
 



Re: spark Driver

2014-07-09 Thread amin mohebbi
This is exactly what I got 
Spark 
Executor Command: java -cp ::
/usr/local/spark-1.0.0/conf:
/usr/local/spark-1.0.0
/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
 
-XX:MaxPermSize=128m -Xms512M -Xmx512M 
org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 14/07/04 17:50:14 ERROR 
CoarseGrainedExecutorBackend: 
Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - 
[akka.tcp://spark@master:54477] disassociated! Shutting down.
 I have posted in stackoverflow and I revived this answer  
http://stackoverflow.com/questions/24571922/apache-spark-stderr-and-stdout/24594576#24594576

I am not sure whether  I need to set a ip address to driver ? do I need a 
separate machine for driver ?

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com


On Wednesday, July 9, 2014 2:39 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


Can you also paste a little bit more stacktrace?


Thanks
Best Regards


On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote:

I have the following in spark-env.sh 


SPARK_MASTER_IP=master

SPARK_MASTER_port=7077

 

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com



On Wednesday, July 9, 2014 2:32 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


Can you try setting SPARK_MASTER_IP in the spark-env.sh file?


Thanks
Best Regards


On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:



 Hi all, 
I have one master and two slave node, I did not set any ip for spark driver 
because I thought it uses its default ( localhost). In my etc/hosts I got the 
following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 
127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do something in 
hosts? or should I set a ip to spark-local-ip? I got the following in my 
stderr: 


Spark 
Executor Command: java -cp ::
/usr/local/spark-1.0.0/conf:
/usr/local/spark-1.0.0
/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
 
-XX:MaxPermSize=128m -Xms512M -Xmx512M 
org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1
 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002
 14/07/04 17:50:14 ERROR 
CoarseGrainedExecutorBackend: 
Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - 
[akka.tcp://spark@master:54477] disassociated! Shutting down.
 


 

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com




Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
Hi all,

I need to run a spark job that need a set of images as input. I need
something that load these images as RDD but I just don't know how to do
that. Do some of you have any idea ?

Cheers,


Jao


Re: Requirements for Spark cluster

2014-07-09 Thread Akhil Das
You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud
quickly.

If you want to set it up on your own then these are the things that you
will need to do:

1. Make sure you have java (7) installed on all machines.
2. Install and configure spark (add all slave nodes in conf/slaves file)
3. Rsync spark directory to all slaves and start it.

If you are already having hadoop, then you might need to get that version
of spark which is compiled with that hadoop version (or you can compile it
with option SPARK_HADOOP_VERSION)

Thanks
Best Regards


On Wed, Jul 9, 2014 at 6:54 AM, Robert James srobertja...@gmail.com wrote:

 I have a Spark app which runs well on local master.  I'm now ready to
 put it on a cluster.  What needs to be installed on the master? What
 needs to be installed on the workers?

 If the cluster already has Hadoop or YARN or Cloudera, does it still
 need an install of Spark?



How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Haopu Wang
Besides restarting the Master, is there any other way to clear the
Completed Applications in Master web UI?


Re: Spark Streaming using File Stream in Java

2014-07-09 Thread Akhil Das
Try this out:

JavaStreamingContext sc = new
JavaStreamingContext(...);JavaDStreamString lines =
ctx.fileStream(whatever);JavaDStreamString words = lines.flatMap(
  new FlatMapFunctionString, String() {
public IterableString call(String s) {
  return Arrays.asList(s.split( ));
}
  });

JavaPairDStreamString, Integer ones = words.map(
  new PairFunctionString, String, Integer() {
public Tuple2String, Integer call(String s) {
  return new Tuple2(s, 1);
}
  });

JavaPairDStreamString, Integer counts = ones.reduceByKey(
  new Function2Integer, Integer, Integer() {
public Integer call(Integer i1, Integer i2) {
  return i1 + i2;
}
  });


​Actually modified from
https://spark.apache.org/docs/0.9.1/java-programming-guide.html#example​

Thanks
Best Regards


On Wed, Jul 9, 2014 at 6:03 AM, Aravind aravindb...@gmail.com wrote:

 Hi all,

 I am trying to run the NetworkWordCount.java file in the streaming
 examples.
 The example shows how to read from a network socket. But my usecase is that
 , I have a local log file which is a stream and continuously updated (say
 /Users/.../Desktop/mylog.log).

 I would like to write the same NetworkWordCount.java using this filestream

 jssc.fileStream(dataDirectory);

 Question:
 1. How do I write a mapreduce function for the above to measure wordcounts
 (in java, not scala)?

 2. Also does the streaming application stop if the file is not updating or
 does it continuously poll for the file updates?

 I am a new user of Apache Spark Streaming. Kindly help me as I am totally
 stuck

 Thanks in advance.

 Regards
 Aravind



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-File-Stream-in-Java-tp9115.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: error when spark access hdfs with Kerberos enable

2014-07-09 Thread Sandy Ryza
Hi Cheney,

I haven't heard of anybody deploying non-secure YARN on top of secure HDFS.
 It's conceivable that you might be able to get work, but my guess is that
you'd run into some issues.  Also, without authentication on in YARN, you
could be leaving your HDFS tokens exposed, which others could steal and get
to your data.

-Sandy


On Tue, Jul 8, 2014 at 7:28 PM, Cheney Sun sun.che...@gmail.com wrote:

 Hi Sandy,

 We are also going to grep data from a security enabled (with kerberos)
 HDFS in our Spark application. Per you answer, we have to switch Spark on
 YARN to achieve this.
 We plan to deploy a different Hadoop cluster(with YARN) only to run Spark.
 Is it necessary to deploy YARN with security enabled? Or is it possible to
 access data within a security HDFS from no-security enabled Spark on YARN?


 On Wed, Jul 9, 2014 at 4:19 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 That's correct.  Only Spark on YARN supports Kerberos.

 -Sandy


 On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 Someone might be able to correct me if I'm wrong, but I don't believe
 standalone mode supports kerberos. You'd have to use Yarn for that.

 On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote:
  Hi all,
 
 
 
  I encounter a strange issue when using spark 1.0 to access hdfs with
  Kerberos
 
  I just have one spark test node for spark and HADOOP_CONF_DIR is set
 to the
  location containing the hdfs configuration files(hdfs-site.xml and
  core-site.xml)
 
  When I use spark-shell with local mode, the access to hdfs is
 successfully .
 
  However, If I use spark-shell which connects to the stand alone
 cluster (I
  configured the spark as standalone cluster mode with only one node).
 
  The access to the hdfs fails with the following error: “Can't get
 Master
  Kerberos principal for use as renewer”
 
 
 
  Anyone have any ideas on this ?
 
  Thanks a lot.
 
 
 
  Regards,
  Xiaowei
 
 



 --
 Marcelo






Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Akhil Das
Hi Bertrand,

We've updated the document
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0

This is our working Github repo
https://github.com/sigmoidanalytics/spork/tree/spork-0.9

Feel free to open issues over here
https://github.com/sigmoidanalytics/spork/issues

Thanks
Best Regards


On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux decho...@gmail.com wrote:

 @Mayur : I won't fight with the semantic of a fork but at the moment, no
 Spork does take the standard Pig as dependency. On that, we should agree.

 As for my use of Pig, I have no limitation. I am however interested to see
 the rise of a 'no-sql high level non programming language' for Spark.

 @Zhang : Could you elaborate your reference about Twitter?


 Bertrand Dechoux


 On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 pelickzh...@qq.com wrote:

 Hi guys, previously I checked out the old spork and updated it to
 Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine
 https://github.com/pelick/flare-spork‍

 It it also highly experimental, and just directly mapping pig physical
 operations to spark RDD transformations/actions. It works for simple
 requests. :)

 I am also interested on the progress of spork, is it undergoing in
 Twitter in an un open-source way?

 --
 Thanks
 Zhang Baofeng
 Blog http://blog.csdn.net/pelick | Github https://github.com/pelick
 | Weibo http://weibo.com/pelickzhang | LinkedIn
 http://www.linkedin.com/pub/zhang-baofeng/70/609/84




 -- 原始邮件 --
 *发件人:* Mayur Rustagi;mayur.rust...@gmail.com;
 *发送时间:* 2014年7月7日(星期一) 晚上11:55
 *收件人:* user@spark.apache.orguser@spark.apache.org;
 *主题:* Re: Pig 0.13, Spark, Spork

 That version is old :).
 We are not forking pig but cleanly separating out pig execution engine.
 Let me know if you are willing to give it a go.

 Also would love to know what features of pig you are using ?

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 I saw a wiki page from your company but with an old version of Spark.

 http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

 I have no reason to use it yet but I am interested in the state of the
 initiative.
 What's your point of view (personal and/or professional) about the Pig
 0.13 release?
 Is the pluggable execution engine flexible enough in order to avoid
 having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

 As a (for now) external observer, I am glad to see competition in that
 space. It can only be good for the community in the end.

 Bertrand Dechoux


 On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Hi,
 We have fixed many major issues around Spork  deploying it with some
 customers. Would be happy to provide a working version to you to try out.
 We are looking for more folks to try it out  submit bugs.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 Hi,

 I was wondering what was the state of the Pig+Spark initiative now
 that the execution engine of Pig is pluggable? Granted, it was done in
 order to use Tez but could it be used by Spark? I know about a
 'theoretical' project called Spork but I don't know any stable and
 maintained version of it.

 Regards

 Bertrand Dechoux








Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.

On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Besides restarting the Master, is there any other way to clear the
 Completed Applications in Master web UI?


how to host the drive node

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-host-the-drive-node-tp9149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to host spark driver

2014-07-09 Thread amin mohebbi


 I have one master and two slave nodes, I did not set any ip for spark driver. 
My question is should I set a ip for spark driver and can I host the driver 
inside the cluster in master node? if so, how to host it? will it be hosted 
automatically in that node we submit the application by spark-submit?  

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H#x2F;P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of clusters (or using different versions of Spark).

It also unifies the way you bundle and submit an app for Yarn, Mesos,
etc... this was something that became very fragmented over time before
this was added.

Another feature is allowing users to set configuration values
dynamically rather than compile them inside of their program. That's
the one you mention here. You can choose to use this feature or not.
If you know your configs are not going to change, then you don't need
to set them with spark-submit.


On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote:
 What is the purpose of spark-submit? Does it do anything outside of
 the standard val conf = new SparkConf ... val sc = new SparkContext
 ... ?


Re: Why doesn't the driver node do any work?

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-doesn-t-the-driver-node-do-any-work-tp3909p9153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Comparative study

2014-07-09 Thread Sean Owen
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace Hive,
 which is map/reduce based.  It has its own distributed query engine, though
 it does load data from HDFS, and is part of the hadoop ecosystem.  Impala
 really shines when your


(It was not built to replace Hive. It's purpose-built to make interactive
use with a BI tool feasible -- single-digit second queries on huge data
sets. It's very memory hungry. Hive's architecture choices and legacy code
have been throughput-oriented, and can't really get below minutes at scale,
but, remains a right choice when you are in fact doing ETL!)


Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá
Hi Jerry, it's all clear to me now, I will try with something like Apache
DBCP for the connection pool

Thanks a lot for your help!


2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com:

  Yes, that would be the Java equivalence to use static class member, but
 you should carefully program to prevent resource leakage. A good choice is
 to use third-party DB connection library which supports connection pool,
 that will alleviate your programming efforts.



 Thanks

 Jerry



 *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
 *Sent:* Tuesday, July 08, 2014 6:54 PM
 *To:* user@spark.apache.org

 *Subject:* Re: Which is the best way to get a connection to an external
 database per task in Spark Streaming?



 Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and
 I only have rudimentary knowledge about Scala, how could I recreate in Java
 the lazy creation of a singleton object that you propose for Scala? Maybe a
 static class member in Java for the connection would be the solution?

 Thanks again for your help,

 Best Regards,

 Juan



 2014-07-08 11:44 GMT+02:00 Shao, Saisai saisai.s...@intel.com:

 I think you can maintain a connection pool or keep the connection as a
 long-lived object in executor side (like lazily creating a singleton object
 in object { } in Scala), so your task can get this connection each time
 executing a task, not creating a new one, that would be good for your
 scenario, since create a connection is quite expensive for each task.



 Thanks

 Jerry



 *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
 *Sent:* Tuesday, July 08, 2014 5:19 PM
 *To:* Tobias Pfeiffer
 *Cc:* user@spark.apache.org
 *Subject:* Re: Which is the best way to get a connection to an external
 database per task in Spark Streaming?



 Hi Tobias, thanks for your help. I understand that with that code we
 obtain a database connection per partition, but I also suspect that with
 that code a new database connection is created per each execution of the
 function used as argument for mapPartitions(). That would be very
 inefficient because a new object and a new database connection would be
 created for each batch of the DStream. But my knowledge about the lifecycle
 of Functions in Spark Streaming is very limited, so maybe I'm wrong, what
 do you think?

 Greetings,

 Juan



 2014-07-08 3:30 GMT+02:00 Tobias Pfeiffer t...@preferred.jp:

 Juan,



 I am doing something similar, just not insert into SQL database, but
 issue some RPC call. I think mapPartitions() may be helpful to you. You
 could do something like



 dstream.mapPartitions(iter = {

   val db = new DbConnection()

   // maybe only do the above if !iter.isEmpty

   iter.map(item = {

 db.call(...)

 // do some cleanup if !iter.hasNext here

 item

   })

 }).count() // force output



 Keep in mind though that the whole idea about RDDs is that operations are
 idempotent and in theory could be run on multiple hosts (to take the result
 from the fastest server) or multiple times (to deal with failures/timeouts)
 etc., which is maybe something you want to deal with in your SQL.



 Tobias





 On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi list,

 I'm writing a Spark Streaming program that reads from a kafka topic,
 performs some transformations on the data, and then inserts each record in
 a database with foreachRDD. I was wondering which is the best way to handle
 the connection to the database so each worker, or even each task, uses a
 different connection to the database, and then database inserts/updates
 would be performed in parallel.
 - I understand that using a final variable in the driver code is not a
 good idea because then the communication with the database would be
 performed in the driver code, which leads to a bottleneck, according to
 http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
 - I think creating a new connection in the call() method of the Function
 passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
 the connection to the database for each batch RDD in the DStream
 - I'm not sure that a broadcast variable with the connection handler is a
 good idea in case the target database is distributed, because if the same
 handler is used for all the nodes of the Spark cluster then than could have
 a negative effect in the data locality of the connection to the database.
 - From
 http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
 I understand that by using an static variable and referencing it in the
 call() method of the Function passed to foreachRDD we get a different
 connection per Spark worker, I guess it's because there is a different JVM
 per worker. But then all the tasks in the same worker would share the same
 database handler object, am I right?
 - Another idea is 

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-09 Thread Martin Gammelsæter
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem
to work. For now we've downgraded dropwizard to 0.6.2 which uses a
compatible version of jetty. Not optimal, but it works for now.

On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote:
 We're doing similar thing to lunch spark job in tomcat, and I opened a
 JIRA for this. There are couple technical discussions there.

 https://issues.apache.org/jira/browse/SPARK-2100

 In this end, we realized that spark uses jetty not only for Spark
 WebUI, but also for distributing the jars and tasks, so it really hard
 to remove the web dependency in Spark.

 In the end, we lunch our spark job in yarn-cluster mode, and in the
 runtime, the only dependency in our web application is spark-yarn
 which doesn't contain any spark web stuff.

 PS, upgrading the spark jetty 8.x to 9.x in spark may not be
 straightforward by just changing the version in spark build script.
 Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires
 Java 7.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers ko...@tresata.com wrote:
 do you control your cluster and spark deployment? if so, you can try to
 rebuild with jetty 9.x


 On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:

 Digging a bit more I see that there is yet another jetty instance that
 is causing the problem, namely the BroadcastManager has one. I guess
 this one isn't very wise to disable... It might very well be that the
 WebUI is a problem as well, but I guess the code doesn't get far
 enough. Any ideas on how to solve this? Spark seems to use jetty
 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
 of the problem. Any ideas?

 On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:
  Hi!
 
  I am building a web frontend for a Spark app, allowing users to input
  sql/hql and get results back. When starting a SparkContext from within
  my server code (using jetty/dropwizard) I get the error
 
  java.lang.NoSuchMethodError:
  org.eclipse.jetty.server.AbstractConnector: method init()V not found
 
  when Spark tries to fire up its own jetty server. This does not happen
  when running the same code without my web server. This is probably
  fixable somehow(?) but I'd like to disable the webUI as I don't need
  it, and ideally I would like to access that information
  programatically instead, allowing me to embed it in my own web
  application.
 
  Is this possible?
 
  --
  Best regards,
  Martin Gammelsæter



 --
 Mvh.
 Martin Gammelsæter
 92209139





-- 
Mvh.
Martin Gammelsæter
92209139


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail,

I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?

- Patrick

On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb...@gmail.com wrote:
 Thanks Andrew,
  ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
 --executor-memory 20g --driver-memory 10g
 works well, just wanted to make sure that I'm not missing anything



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Initial job has not accepted any resources means many things

2014-07-09 Thread Martin Gammelsæter
It seems like the Initial job has not accepted any resources; shows
up for a wide variety of different errors (for example the obvious one
where you've requested more memory than is available) but also for
example in the case where the worker nodes does not have the
appropriate code on their class path. Debugging from this error is
very hard as errors does not show up in the logs on the workers. Is
this a known issue? I'm having issues with getting the code to the
workers without using addJar (my code is a fairly static application,
and I'd like to avoid using addJar every time the app starts up, and
instead manually add the jar to the classpath of every worke), but I
can't seem to find out how)

-- 
Best regards,
Martin Gammelsæter


RE: Kryo is slower, and the size saving is minimal

2014-07-09 Thread innowireless TaeYun Kim
Thank you for your response.

Maybe that applies to my case.
In my test case, The types of almost all of the data are either primitive
types, joda DateTime, or String.
But I'm somewhat disappointed with the speed.
At least it should not be slower than Java default serializer...

-Original Message-
From: wxhsdp [mailto:wxh...@gmail.com] 
Sent: Wednesday, July 09, 2014 5:47 PM
To: u...@spark.incubator.apache.org
Subject: Re: Kryo is slower, and the size saving is minimal

i'am not familiar with kryo and my opinion may be not right. in my case,
kryo only saves about 5% of the original size when dealing with primitive
types such as Arrays. i'am not sure whether it is the common case.



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-is-slower-and-the-s
ize-saving-is-minimal-tp9131p9160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: TaskContext stageId = 0

2014-07-09 Thread silvermast
Oh well, never mind. The problem is that ResultTask's stageId is immutable
and is used to construct the Task superclass. Anyway, my solution now is to
use this.id for the rddId and to gather all rddIds using a spark listener on
stage completed to clean up for any activity registered for those rdds. I
could use TaskContext's hook but I'd have to add some more messaging so I
can clear state that may live on a different executor than the one my
partition is on, but since I don't know that the executor will succeed, this
is not safe.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TaskContext-stageId-0-tp9152p9162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to run a job on all workers?

2014-07-09 Thread silvermast
Is it possible to run a job that assigns work to every worker in the system?
My bootleg right now is to have a spark listener hear whenever a block
manager is added and to increase a split count by 1. It runs a spark job
with that split count and hopes that it will at least run on the newest
worker. There's some weirdness with block managers being removed sometimes
allowing my count to go negative, so I just keep monotonically increasing my
split count. Anyone have a way that doesn't suck?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-a-job-on-all-workers-tp9163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Error using MLlib-NaiveBayes : Matrices are not aligned

2014-07-09 Thread Rahul Bhojwani
I am using Naive Bayes in MLlib .
Below I have printed log of *model.theta*. after training on train data.
You can check that it contains 9 features for 2 class classification.

print numpy.log(model.theta)

[[ 0.31618962  0.16636852  0.07200358  0.05411449  0.08542039  0.17620751
   0.03711986  0.07110912  0.02146691]
 [ 0.26598639  0.23809524  0.06258503  0.01904762  0.05714286  0.18231293
   0.08911565  0.03061224  0.05510204]]




 I am giving the same no. of features for prediction but I am getting
error: *matrices not alligned.*

*ERROR:*

Traceback (most recent call last):
  File .\naive_bayes_analyser.py, line 192, in module
prediction =
model.predict(array([float(pos_count),float(neg_count),float(need_count),float(goal_count),float(try_co
unt),float(means_count),float(persist_count),float(complete_count),float(fail_count)]))
  File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py,
line 101, in predict
return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned



Can someone suggest me the possible error/mistake?
Thanks in advance
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile(hdfs://...)
val errors = file.filter(line = line.contains(ERROR))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev


Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi
Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi all,

 I wondered if you could help me to clarify the next situation:
 in the classic example

 val file = spark.textFile(hdfs://...)
 val errors = file.filter(line = line.contains(ERROR))

 As I understand, the data is read in memory in first, and after that
 filtering is applying. Is it any way to apply filtering during the read
 step? and don't put all objects into memory?

 Thank you,
 Konstantin Kudryavtsev



Docker Scripts

2014-07-09 Thread dmpour23
Hi,

Regarding docker scripts I know i can change the base image easily but  is
there any specific reason why the base image is hadoop_1.2.1 . Why is this
prefered to Hadoop2 [HDP2, CDH5]) distributions? 

Now that amazon supports docker could this replace ec2-scripts?

Kind regards
Dimitri





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Docker-Scripts-tp9167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


FW: memory question

2014-07-09 Thread michael.lewis
Hi,

Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. 
rather than set spark.cleaner.ttl and have this run periodically, I'd like to 
run it on demand. The problem with periodic cleaning is that it can remove rdd 
that we still require (some calcs are short, others very long).

We're using Spark 0.9.0 with Cloudera distribution.

I have a simple test calculation in a loop as follows:

val test = new TestCalc(sparkContext)
for (i - 1 to 10) {
  val (x) = test.evaluate(rdd)
}

Where TestCalc is defined as:
class TestCalc(sparkContext: SparkContext) extends Serializable  {

  def aplus(a: Double, b:Double) :Double = a+b;

  def evaluate(rdd : RDD[Double]) = {
 /* do some dummy calc. */
  val x = rdd.groupBy(x = x /2.0)
  val y = x.fold((0.0,Seq[Double]()))((a,b)=(aplus(a._1,b._1),Seq()))
  val z = y._1
/* try with/without this... */
  val e :SparkEnv = SparkEnv.getThreadLocal
  e.blockManager.master.removeRdd(x.id,true) // still see memory 
consumption go up...
(z)
  }
}

What I can see on the cluster is the memory usage on the node executing this 
continually
climbs. I'd expect it to level off and not jump up over 1G...
I thought that putting in the line 'removeRdd' might help, but it doesn't seem 
to make a difference


Regards,
Mike


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___


Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread gil...@gmail.com
Hello,While trying to run this example below I am getting errors.I have build
Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0
SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean
assembly-Running the example using
Spark-shell---$
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client
./bin/spark-shellscala val sqlContext = new
org.apache.spark.sql.SQLContext(sc)import sqlContext._case class
Person(name: String, age: Int)val people =
sc.textFile(hdfs://myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))people.registerAsTable(people)val
teenagers = sql(SELECT name FROM people WHERE age = 13 AND age =
19)teenagers.map(t = Name:  +
t(0)).collect().foreach(println)--error---java.lang.NoClassDefFoundError:
Could not initialize class $line10.$read$at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)at
scala.collection.Iterator$$anon$1.next(Iterator.scala:853)at
scala.collection.Iterator$$anon$1.head(Iterator.scala:840)at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
   
at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
   
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)   
at org.apache.spark.scheduler.Task.run(Task.scala:51)at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)   
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
  
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
  
at java.lang.Thread.run(Thread.java:744)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?

for classpath a simple script similar to hadoop classpath that shows what
needs to be added should be sufficient.

on spark standalone I can launch a program just fine with just SparkConf
and SparkContext. not on yarn, so the spark-launch script must be doing a
few things extra there I am missing... which makes things more difficult
because I am not sure its realistic to expect every application that needs
to run something on spark to be launched using spark-submit.
 On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?



Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman
Are there any gaps beyond convenience and code/config separation in using
spark-submit versus SparkConf/SparkContext if you are willing to set your
own config?

If there are any gaps, +1 on having parity within SparkConf/SparkContext
where possible. In my use case, we launch our jobs programmatically. In
theory, we could shell out to spark-submit but it's not the best option for
us.

So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
on the complexities of other modes, though.

-Suren



On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:

 not sure I understand why unifying how you submit app for different
 platforms and dynamic configuration cannot be part of SparkConf and
 SparkContext?

 for classpath a simple script similar to hadoop classpath that shows
 what needs to be added should be sufficient.

 on spark standalone I can launch a program just fine with just SparkConf
 and SparkContext. not on yarn, so the spark-launch script must be doing a
 few things extra there I am missing... which makes things more difficult
 because I am not sure its realistic to expect every application that needs
 to run something on spark to be launched using spark-submit.
  On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
The idea is to run a job that use images as input so that each work will
process a subset of the images


On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 RDD can only keep objects. How do you plan to encode these images so that
 they can be loaded. Keeping the whole image as a single object in 1 rdd
 would perhaps not be super optimized.
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi all,

 I need to run a spark job that need a set of images as input. I need
 something that load these images as RDD but I just don't know how to do
 that. Do some of you have any idea ?

 Cheers,


 Jao





Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi
 val sem = 0
sc.addSparkListener(new SparkListener {
  override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem +=1
  }

})
sc = spark context



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 9, 2014 at 4:34 AM, abhiguruvayya sharath.abhis...@gmail.com
wrote:

 Hello Mayur,

 How can I implement these methods mentioned below. Do u you have any clue
 on
 this pls et me know.


 public void onJobStart(SparkListenerJobStart arg0) {
 }

 @Override
 public void onStageCompleted(SparkListenerStageCompleted arg0) {
 }

 @Override
 public void onStageSubmitted(SparkListenerStageSubmitted arg0) {

 }



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9104.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all,

I am currently trying to save to Cassandra after some Spark Streaming
computation.

I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd = {

var someCol = Seq[MyType]()

foreach(kv ={
  someCol :+ rdd._2 //I only want the RDD value and not the key
 }
val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING TO
RUN THE WORKER
collectionRDD.saveToCassandra(...)
}

I get the NotSerializableException while trying to run the Node (also tried
someCol as shared variable).
I believe this happens because the myDStream doesn't exist yet when the code
is pushed to the Node so the parallelize doens't have any structure to
relate to it. Inside this foreachRDD I should only do RDD calls which are
only related to other RDDs. I guess this was just a desperate attempt

So I have a question
Using the Cassandra Spark driver - Can we only write to Cassandra from an
RDD? In my case I only want to write once all the computation is finished in
a single batch on the driver app.

tnks in advance.

Rod











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS
directory

JavaDStreamString lines_2 =
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);

How to convert JavaDStreamString result to JavaRDDString? if we can
convert. I can use collect() method on JavaRDD and process my textfile.

I'm not able to find collect method on JavaRDDString.

Thank you very much in advance.

Regards,
Rajesh


Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Is MyType serializable? Everything inside the foreachRDD closure has to be
serializable.


2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com:

 Hi all,

 I am currently trying to save to Cassandra after some Spark Streaming
 computation.

 I call a myDStream.foreachRDD so that I can collect each RDD in the driver
 app runtime and inside I do something like this:
 myDStream.foreachRDD(rdd = {

 var someCol = Seq[MyType]()

 foreach(kv ={
   someCol :+ rdd._2 //I only want the RDD value and not the key
  }
 val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING
 TO
 RUN THE WORKER
 collectionRDD.saveToCassandra(...)
 }

 I get the NotSerializableException while trying to run the Node (also tried
 someCol as shared variable).
 I believe this happens because the myDStream doesn't exist yet when the
 code
 is pushed to the Node so the parallelize doens't have any structure to
 relate to it. Inside this foreachRDD I should only do RDD calls which are
 only related to other RDDs. I guess this was just a desperate attempt

 So I have a question
 Using the Cassandra Spark driver - Can we only write to Cassandra from an
 RDD? In my case I only want to write once all the computation is finished
 in
 a single batch on the driver app.

 tnks in advance.

 Rod











 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Purpose of spark-submit?

2014-07-09 Thread Robert James
+1 to be able to do anything via SparkConf/SparkContext.  Our app
worked fine in Spark 0.9, but, after several days of wrestling with
uber jars and spark-submit, and so far failing to get Spark 1.0
working, we'd like to go back to doing it ourself with SparkConf.

As the previous poster said, a few scripts should be able to give us
the classpath and any other params we need, and be a lot more
transparent and debuggable.

On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
 Are there any gaps beyond convenience and code/config separation in using
 spark-submit versus SparkConf/SparkContext if you are willing to set your
 own config?

 If there are any gaps, +1 on having parity within SparkConf/SparkContext
 where possible. In my use case, we launch our jobs programmatically. In
 theory, we could shell out to spark-submit but it's not the best option for
 us.

 So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
 on the complexities of other modes, though.

 -Suren



 On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:

 not sure I understand why unifying how you submit app for different
 platforms and dynamic configuration cannot be part of SparkConf and
 SparkContext?

 for classpath a simple script similar to hadoop classpath that shows
 what needs to be added should be sufficient.

 on spark standalone I can launch a program just fine with just SparkConf
 and SparkContext. not on yarn, so the spark-launch script must be doing a
 few things extra there I am missing... which makes things more difficult
 because I am not sure its realistic to expect every application that
 needs
 to run something on spark to be launched using spark-submit.
  On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?




 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io



Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed
Hi,

For QueueRDD, have a look here. 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

  

Regards,
Laeeq



On Friday, May 23, 2014 10:33 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:
 


Well its hard to use text data as time of input. 
But if you are adament here's what you would do. 
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files you 
receive  push them into the folder in order of time. Mostly your data would be 
produced at t, you would get it at t + say 5 sec,  you can push it in  get 
processed at t + say 10 sec. Then you can timeshift your calculations. If you 
are okay with broad enough time frame you should be fine.


Another way is to use queue processing.
QueueJavaRDDInteger rddQueue = new LinkedListJavaRDDInteger();
Create a Dstream to consume from Queue of RDD, keep looking into the folder of 
files  create rdd's from them at a min level  push them into thee queue. This 
would cause you to go through your data atleast twice  just provide order 
guarantees , processing time is still going to vary with how quickly you can 
process your RDD.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



On Thu, May 22, 2014 at 9:08 PM, Ian Holsman i...@holsman.com.au wrote:

Hi.


I'm writing a pilot project, and plan on using spark's streaming app for it.


To start with I have a dump of some access logs with their own timestamps, and 
am using the textFileStream and some old files to test it with.


One of the issues I've come across is simulating the windows. I would like use 
the timestamp from the access logs as the 'system time' instead of the real 
clock time.


I googled a bit and found the 'manual' clock which appears to be used for 
testing the job scheduler.. but I'm not sure what my next steps should be.


I'm guessing I'll need to do something like


1. use the textFileStream to create a 'DStream'
2. have some kind of DStream that runs on top of that that creates the RDDs 
based on the timestamps Instead of the system time
3. the rest of my mappers.


Is this correct? or do I need to create my own 'textFileStream' to initially 
create the RDDs and modify the system clock inside of it.


I'm not too concerned about out-of-order messages, going backwards in time, or 
being 100% in sync across workers.. as this is more for 
'development'/prototyping.


Are there better ways of achieving this? I would assume that controlling the 
windows RDD buckets would be a common use case.


TIA
Ian

-- 

Ian Holsman
i...@holsman.com.au 
PH: + 61-3-9028 8133 / +1-(425) 998-7083

RDD Cleanup

2014-07-09 Thread premdass
Hi,

I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
system. Basically a long running context is created, which enables to run
multiple jobs under the same context, and hence sharing of the data.

It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
created and cached by a Job-1 gets cleaned up by BlockManager (can see log
statements saying cleaning up RDD) and so the cached RDD's are not available
for Job-2, though Both Job-1 and Job-2 are running under same spark context.

I tried using the spark.cleaner.referenceTracking = false setting, how-ever
this causes the issue that unpersisted RDD's are not cleaned up properly,
and occupying the Spark's memory..


Had anybody faced issue like this before? If so, any advice would be greatly
appreicated.


Also is there any way, to mark an RDD as being used under a context, event
though the job using that had been finished (so subsequent jobs can use that
RDD).


Thanks,
Prem



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Laeeq Ahmed
Hi,

First use foreachrdd and then use collect as


DStream.foreachRDD(rdd = {
   rdd.collect.foreach({

Also its better to use scala. Less verbose. 


Regards,
Laeeq



On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:
 


Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory

JavaDStreamString lines_2 = 
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);

How to convert JavaDStreamString result to JavaRDDString? 
if we can convert. I can use collect() method on JavaRDD and process my 
textfile.

I'm not able to find collect method on JavaRDDString.

Thank you very much in advance.

Regards,
Rajesh

Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed
Hi,

For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.

 


On Sunday, July 6, 2014 10:20 AM, alessandro finamore 
alessandro.finam...@polito.it wrote:
 


On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] 
[hidden email] wrote: 
 Key idea is to simulate your app time as you enter data . So you can connect 
 spark streaming to a queue and insert data in it spaced by time. Easier said 
 than done :). 

I see. 
I'll try to implement also this solution so that I can compare it with 
my current spark implementation. 
I'm interested in seeing if this is faster...as I assume it should be :) 

 What are the parallelism issues you are hitting with your 
 static approach. 

In my current spark implementation, whenever I need to get the 
aggregated stats over the window, I'm re-mapping all the current bins 
to have the same key so that they can be reduced altogether. 
This means that data need to shipped to a single reducer. 
As results, adding nodes/cores to the application does not really 
affect the total time :( 


 
 
 On Friday, July 4, 2014, alessandro finamore [hidden email] wrote: 
 
 Thanks for the replies 
 
 What is not completely clear to me is how time is managed. 
 I can create a DStream from file. 
 But if I set the window property that will be bounded to the application 
 time, right? 
 
 If I got it right, with a receiver I can control the way DStream are 
 created. 
 But, how can apply then the windowing already shipped with the framework 
 if 
 this is bounded to the application time? 
 I would like to do define a window of N files but the window() function 
 requires a duration as input... 
 
 
 
 
 -- 
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
 
 Sent from the Apache Spark User List mailing list archive at Nabble.com. 
 
 
 
 -- 
 Sent from Gmail Mobile 
 
 
  
 If you reply to this email, your message will be added to the discussion 
 below: 
 http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
 To unsubscribe from window analysis with Spark and Spark streaming, click 
 here. 
 NAML 


-- 
-- 
Alessandro Finamore, PhD 
Politecnico di Torino 
-- 
Office:    +39 0115644127 
Mobile:   +39 3280251485 
SkypeId: alessandro.finamore 
--- 


 View this message in context: Re: window analysis with Spark and Spark 
streaming

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using
shell script.

we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar
to submit jobs in shell.



On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in using
  spark-submit versus SparkConf/SparkContext if you are willing to set your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option
 for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be doing
 a
  few things extra there I am missing... which makes things more difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 



Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis,

Yes it's actually an ouput of the previous RDD. 
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...


On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
 system. Basically a long running context is created, which enables to run
 multiple jobs under the same context, and hence sharing of the data.

 It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
 created and cached by a Job-1 gets cleaned up by BlockManager (can see log
 statements saying cleaning up RDD) and so the cached RDD's are not
 available
 for Job-2, though Both Job-1 and Job-2 are running under same spark
 context.

 I tried using the spark.cleaner.referenceTracking = false setting, how-ever
 this causes the issue that unpersisted RDD's are not cleaned up properly,
 and occupying the Spark's memory..


 Had anybody faced issue like this before? If so, any advice would be
 greatly
 appreicated.


 Also is there any way, to mark an RDD as being used under a context, event
 though the job using that had been finished (so subsequent jobs can use
 that
 RDD).


 Thanks,
 Prem



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Yes, I'm using it to count concurrent users from a kafka stream of events
without problems. I'm currently testing it using the local mode but any
serialization problem would have already appeared so I don't expect any
serialization issue when I deployed to my cluster.


2014-07-09 15:39 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com:

 Hi Luis,

 Yes it's actually an ouput of the previous RDD.
 Have you ever used the Cassandra Spark Driver on the driver app? I believe
 these limitations go around that - it's designed to save RDDs from the
 nodes.

 tnks,
 Rod



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark Streaming and Storm

2014-07-09 Thread Dan H.
Xichen_tju,

I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 
24GBRAM with 3 servers) and was not able to achieve a realistic scale for my 
business domain needs.  Storm is really only a framework, which allows you to 
put in code to do whatever it is you need for a distributed system…so it’s 
completely flexible and distributable, but it comes at a price.  In Storm, the 
one of the biggest performance hits, came down to how the “acks” work within 
the tuple trees.  You can have the framework default ack messages between 
spouts and/or bolts, but in the end, you most likely want to manage acks 
yourself, due to how much reliability you’re system will need (to replay 
messages…).  All this means, is that if you don’t have massive amounts of data 
that you need to process within a few seconds, (which I do) then Storm may work 
well for you, but you’re performance will diminish as you add in more and more 
business rules (unless of course you add in more servers for processing).  If 
you need to ingest at least 1GBps+, then you may want to reevaluate since 
you’re server scale may not mesh with you overall processing needs.

I recently just started using Spark Streaming with Kafka and have been quite 
impressed at the performance level that’s being achieved.  I particularly like 
the fact that Spark isn’t just a framework, but it provides you with simple 
tools with API convenience methods.  Some of those features are reduceByKey 
(mapReduce), sliding and aggregate sub time windows, etc.  Also, In my 
environment, I believe it’s going to be a great fit since we use Hadoop already 
and Spark should fit into that environment well.

You should look into both Storm and Spark Streaming, but in the end it just 
depends on your needs.  If you not looking for Streaming aspects, then Spark on 
Hadoop is a great option since Spark will cache the dataset in memory for all 
queries, which will be much faster than running Hive/Pig onto of Hadoop.  But 
I’m assuming you need some sort of Streaming system for data flow, but if it 
doesn’t need to be real-time or near real-time, you may want to simply look at 
Hadoop, which you could always use Spark ontop of for real-time queries.

Hope this helps…

Dan

 
On Jul 8, 2014, at 7:25 PM, Shao, Saisai saisai.s...@intel.com wrote:

 You may get the performance comparison results from Spark Streaming paper and 
 meetup ppt, just google it.
 Actually performance comparison is case by case and relies on your work load 
 design, hardware and software configurations. There is no actual winner for 
 the whole scenarios.
  
 Thanks
 Jerry
  
 From: xichen_tju@126 [mailto:xichen_...@126.com] 
 Sent: Wednesday, July 09, 2014 9:17 AM
 To: user@spark.apache.org
 Subject: Spark Streaming and Storm
  
 hi all
 I am a newbie to Spark Streaming, and used Strom before.Have u test the 
 performance both of them and which one is better?
  
 xichen_tju@126



Re: RDD Cleanup

2014-07-09 Thread premdass
Hi,

Yes . I am  caching the RDD's by calling cache method..


May i ask, how you are sharing RDD's across jobs in same context? By the RDD
name. I tried printing the RDD's of the Spark context, and when the
referenceTracking is enabled, i get empty list after the clean up.

Thanks,
Prem




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application


On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 Yes . I am  caching the RDD's by calling cache method..


 May i ask, how you are sharing RDD's across jobs in same context? By the
 RDD
 name. I tried printing the RDD's of the Spark context, and when the
 referenceTracking is enabled, i get empty list after the clean up.

 Thanks,
 Prem




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
I am trying to get my head around using Spark on Yarn from a perspective of
a cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This
is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they connect to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection.  Cool. s
But what I don't understand is how do I setup a Yarn instance that I can
connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it
gave me an error, telling me to use yarn-client.  I see information on
using spark-class or spark-submit.  But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!


Re: Purpose of spark-submit?

2014-07-09 Thread Andrei
One another +1. For me it's a question of embedding. With
SparkConf/SparkContext I can easily create larger projects with Spark as a
separate service (just like MySQL and JDBC, for example). With spark-submit
I'm bound to Spark as a main framework that defines how my application
should look like. In my humble opinion, using Spark as embeddable library
rather than main framework and runtime is much easier.




On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option
 for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 





Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Spark still supports the ability to submit jobs programmatically without
shell scripts.

Koert,
The main reason that the unification can't be a part of SparkContext is
that YARN and standalone support deploy modes where the driver runs in a
managed process on the cluster.  In this case, the SparkContext is created
on a remote node well after the application is launched.


On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
 srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 






Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like 
MapReduce, Storm and Spark.

I would recommend that you build your spark jobs in the main method without 
specifying how you deploy it. Then you can use spark-submit to tell Spark how 
you would want to deploy to it using yarn-cluster as the master. The key point 
here is that once you have YARN setup, the spark client connects to it using 
the $HADOOP_CONF_DIR that contains the resource manager address. In particular, 
this needs to be accessible from the classpath of the submitter since it 
implicitly uses this when it instantiates a YarnConfiguration instance. If you 
want more details, read org.apache.spark.deploy.yarn.Client.scala.

You should be able to download a standalone YARN cluster from any of the Hadoop 
providers like Cloudera or Hortonworks. Once you have that, the spark 
programming guide describes what I mention above in sufficient detail for you 
to proceed.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
 I am trying to get my head around using Spark on Yarn from a perspective of a 
 cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This is 
 done in yarn-client mode and it all works well. 
 
 In multiple examples, I see instances where people have setup Spark Clusters 
 in Stand Alone mode, and then in the examples they connect to this cluster 
 in Stand Alone mode. This is done often times using the spark:// string for 
 connection.  Cool. s
 But what I don't understand is how do I setup a Yarn instance that I can 
 connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it 
 gave me an error, telling me to use yarn-client.  I see information on using 
 spark-class or spark-submit.  But what I'd really like is a instance I can 
 connect a spark-shell too, and have the instance stay up. I'd like to be able 
 run other things on that instance etc. Is that possible with Yarn? I know 
 there may be long running job challenges with Yarn, but I am just testing, I 
 am just curious if I am looking at something completely bonkers here, or just 
 missing something simple. 
 
 Thanks!
 
 


Mechanics of passing functions to Spark?

2014-07-09 Thread Seref Arikan
Greetings,
The documentation at
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
says:
Note that while it is also possible to pass a reference to a method in a
class instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method

First, could someone clarify what is meant by sending the object here? How
is the object sent to (presumably) nodes of the cluster? Is it a one time
operation per node? Why does this sound like (according to doc) a less
preferred option compared to a singleton's function? Would not the nodes
require the singleton object anyway?

Some clarification would really help.

Regards
Seref

ps: the expression the object that contains that class sounds a bit
unusual, is the intended meaning the object that is the instance  of that
class ?


Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
sandy, that makes sense. however i had trouble doing programmatic execution
on yarn in client mode as well. the application-master in yarn came up but
then bombed because it was looking for jars that dont exist (it was looking
in the original file paths on the driver side, which are not available on
the yarn node). my guess is that spark-submit is changing some settings
(perhaps preparing the distributed cache and modifying settings
accordingly), which makes it harder to run things programmatically. i could
be wrong however. i gave up debugging and resorted to using spark-submit
for now.



On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Spark still supports the ability to submit jobs programmatically without
 shell scripts.

 Koert,
 The main reason that the unification can't be a part of SparkContext is
 that YARN and standalone support deploy modes where the driver runs in a
 managed process on the cluster.  In this case, the SparkContext is created
 on a remote node well after the application is launched.


 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically.
 In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that
 shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving users
 a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark.
 So
  a user can bundle an application and then use spark-submit to send
 it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't
 need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
 srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside
 of
   the standard val conf = new SparkConf ... val sc = new
 SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH 

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
To add to Ron's answer, this post explains what it means to run Spark
against a YARN cluster, the difference between yarn-client and yarn-cluster
mode, and the reason spark-shell only works in yarn-client mode.
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy


On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:

 The idea behind YARN is that you can run different application types like
 MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of the
 Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

  On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
  I am trying to get my head around using Spark on Yarn from a perspective
 of a cluster. I can start a Spark Shell no issues in Yarn. Works easily.
  This is done in yarn-client mode and it all works well.
 
  In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they connect to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
  But what I don't understand is how do I setup a Yarn instance that I can
 connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it
 gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 
  Thanks!
 
 



Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr
Hi

My setup is to use localMode standalone, Sprak 1.0.0 release version, scala
2.10.4

I made a job that receive serialized object from Kafka broker. The objects
are serialized using kryo. 
The code :

val sparkConf = new
SparkConf().setMaster(local[4]).setAppName(SparkTest)
  .set(spark.serializer, org.apache.spark.serializer.JavaSerializer)
  .set(spark.kryo.registrator,
com.inneractive.fortknox.kafka.EventDetailRegistrator)

val ssc = new StreamingContext(sparkConf, Seconds(20))
ssc.checkpoint(checkpoint)


val topicMap = topic.split(,).map((_,partitions)).toMap

// Create a Stream using my Decoder EventKryoEncoder 
val events = KafkaUtils.createStream[String, EventDetails,
StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams,
  topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)

val data = events.map(e = (e.getPublisherId, 1L))
val counter = data.reduceByKey(_ + _)
counter.print()

ssc.start()
ssc.awaitTermination()

When I run this I get
java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
~[na:1.7.0_60]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
~[na:1.7.0_60]
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
~[na:1.7.0_60]
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.scheduler.Task.run(Task.scala:51)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]

I've check that my decoder is working I can trace that the deserialization
is OK thus sprark must get ready to use object 

My setup work if I use JSON and not Kryo serialized object
Thanks for help because I don't what to do next






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
Sandy, I experienced the similar behavior as Koert just mentioned. I don't
understand why there is a difference between using spark-submit and
programmatic execution. Maybe there is something else we need to add to the
spark conf/spark context in order to launch spark jobs programmatically
that are not needed before?



On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote:

 sandy, that makes sense. however i had trouble doing programmatic
 execution on yarn in client mode as well. the application-master in yarn
 came up but then bombed because it was looking for jars that dont exist (it
 was looking in the original file paths on the driver side, which are not
 available on the yarn node). my guess is that spark-submit is changing some
 settings (perhaps preparing the distributed cache and modifying settings
 accordingly), which makes it harder to run things programmatically. i could
 be wrong however. i gave up debugging and resorted to using spark-submit
 for now.



 On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Spark still supports the ability to submit jobs programmatically without
 shell scripts.

 Koert,
 The main reason that the unification can't be a part of SparkContext is
 that YARN and standalone support deploy modes where the driver runs in a
 managed process on the cluster.  In this case, the SparkContext is created
 on a remote node well after the application is launched.


 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically.
 In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that
 shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving
 users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark.
 So
  a user can bundle an application and then use spark-submit to send
 it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program.
 That's
  the one you mention here. You can choose to use this feature or
 not.
  If you know your configs are not going 

Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Hello,

I am currently learning Apache Spark and I want to see how it integrates
with an existing Hadoop Cluster.

My current Hadoop configuration is version 2.2.0 without Yarn. I have build
Apache Spark (v1.0.0) following the instructions in the README file. Only
setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the HADOOP_CONF_DIR
to point to the configuration directory of Hadoop configuration.

My use-case is the Linear Least Regression MLlib example of Apache Spark
(link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
The only difference in the code is that I give the text file to be an HDFS
file.

However, I get a Runtime Exception: Error in configuring object.

So my question is the following:

Does Spark work with a Hadoop distribution without Yarn?
If yes, am I doing it right? If no, can I build Spark with
SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?

Thank you,
Nick


Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert,
Yeah I had the same problems trying to do programmatic submission of spark jobs 
to my Yarn cluster. I was ultimately able to resolve it by reviewing the 
classpath and debugging through all the different things that the Spark Yarn 
client (Client.scala) did for submitting to Yarn (like env setup, local 
resources, etc), and I compared it to what spark-submit had done.
I have to admit though that it was far from trivial to get it working out of 
the box, and perhaps some work could be done in that regards. In my case, it 
had boiled down to the launch environment not having the HADOOP_CONF_DIR set, 
which prevented the app master from registering itself with the Resource 
Manager.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 9:25 AM, Jerry Lam chiling...@gmail.com wrote:
 
 Sandy, I experienced the similar behavior as Koert just mentioned. I don't 
 understand why there is a difference between using spark-submit and 
 programmatic execution. Maybe there is something else we need to add to the 
 spark conf/spark context in order to launch spark jobs programmatically that 
 are not needed before?
 
 
 
 On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote:
 sandy, that makes sense. however i had trouble doing programmatic execution 
 on yarn in client mode as well. the application-master in yarn came up but 
 then bombed because it was looking for jars that dont exist (it was looking 
 in the original file paths on the driver side, which are not available on 
 the yarn node). my guess is that spark-submit is changing some settings 
 (perhaps preparing the distributed cache and modifying settings 
 accordingly), which makes it harder to run things programmatically. i could 
 be wrong however. i gave up debugging and resorted to using spark-submit for 
 now.
 
 
 
 On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Spark still supports the ability to submit jobs programmatically without 
 shell scripts.
 
 Koert,
 The main reason that the unification can't be a part of SparkContext is 
 that YARN and standalone support deploy modes where the driver runs in a 
 managed process on the cluster.  In this case, the SparkContext is created 
 on a remote node well after the application is launched.
 
 
 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:
 One another +1. For me it's a question of embedding. With 
 SparkConf/SparkContext I can easily create larger projects with Spark as a 
 separate service (just like MySQL and JDBC, for example). With 
 spark-submit I'm bound to Spark as a main framework that defines how my 
 application should look like. In my humble opinion, using Spark as 
 embeddable library rather than main framework and runtime is much easier. 
 
 
 
 
 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:
 +1 as well for being able to submit jobs programmatically without using 
 shell script.
 
 we also experience issues of submitting jobs programmatically without 
 using spark-submit. In fact, even in the Hadoop World, I rarely used 
 hadoop jar to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com 
 wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in 
  using
  spark-submit versus SparkConf/SparkContext if you are willing to set 
  your
  own config?
 
  If there are any gaps, +1 on having parity within 
  SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best 
  option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not 
  knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com 
  wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just 
  SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be 
  doing a
  few things extra there I am missing... which makes things more 
  difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using 

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic 
entry point for Yarn.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote:
 
 +1 as well for being able to submit jobs programmatically without using shell 
 script.
 
 we also experience issues of submitting jobs programmatically without using 
 spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar 
 to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in using
  spark-submit versus SparkConf/SparkContext if you are willing to set your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be doing a
  few things extra there I am missing... which makes things more difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 
 


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov
Hi Patrick,

I used 1.0 branch, but it was not an official release, I just git pulled
whatever was there and compiled.

Thanks,
Mikhail



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Rahul Bhojwani
According to me there is BUG in MLlib Naive Bayes implementation in spark
0.9.1.
Whom should I report this to or with whom should I discuss? I can discuss
this over call as well.

My Skype ID : rahul.bhijwani
Phone no:  +91-9945197359

Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Make a JIRA with enough detail to reproduce the error ideally:
https://issues.apache.org/jira/browse/SPARK

and then even more ideally open a PR with a fix:
https://github.com/apache/spark


On Wed, Jul 9, 2014 at 5:57 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:

 According to me there is BUG in MLlib Naive Bayes implementation in spark
 0.9.1.
 Whom should I report this to or with whom should I discuss? I can discuss
 this over call as well.

 My Skype ID : rahul.bhijwani
 Phone no:  +91-9945197359

 Thanks,
 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka



Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git
repository.
On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com
wrote:

 By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
 master? Because I am using v 1.0.0 - Alex



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Hi everyone,

I am new to Spark and I'm having problems to make my code compile. I have
the feeling I might be misunderstanding the functions so I would be very
glad to get some insight in what could be wrong.

The problematic code is the following:

JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} );

JavaPairRDDPartition, IterableBody partitions =
bodies.mapToPair(b -
b.computePartitions(maxDistance)).groupByKey();

Partition and Body are defined inside the driver class. Body contains the
following definition:

protected IterableTuple2Partition, Body computePartitions (int
maxDistance)

The idea is to reproduce the following schema:

The first map results in: *body1, body2, ... *
The mapToPair should output several of these:* (partition_i, body1),
(partition_i, body2)...*
Which are gathered by key as follows: *(partition_i, (body1, body_n),
(partition_i', (body2, body_n') ...*

Thanks in advance.
Regards,
Silvina


Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point.  Shows how personal use cases color how we interpret products.


On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:

 On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:

  Impala is *not* built on map/reduce, though it was built to replace
 Hive, which is map/reduce based.  It has its own distributed query engine,
 though it does load data from HDFS, and is part of the hadoop ecosystem.
  Impala really shines when your


 (It was not built to replace Hive. It's purpose-built to make interactive
 use with a BI tool feasible -- single-digit second queries on huge data
 sets. It's very memory hungry. Hive's architecture choices and legacy code
 have been throughput-oriented, and can't really get below minutes at scale,
 but, remains a right choice when you are in fact doing ETL!)



Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
Thank you for the link.  In that link the following is written:

For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a single
batch job, an interactive session with multiple jobs spaced apart, or a
long-lived server continually satisfying requests

So, if I wanted to use a long-lived server continually satisfying
requests and then start a shell that connected to that context, how would
I do that in Yarn? That's the problem I am having right now, I just want
there to be that long lived service that I can utilize.

Thanks!


On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 To add to Ron's answer, this post explains what it means to run Spark
 against a YARN cluster, the difference between yarn-client and yarn-cluster
 mode, and the reason spark-shell only works in yarn-client mode.

 http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 -Sandy


 On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:

 The idea behind YARN is that you can run different application types like
 MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of the
 Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

  On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
  I am trying to get my head around using Spark on Yarn from a
 perspective of a cluster. I can start a Spark Shell no issues in Yarn.
 Works easily.  This is done in yarn-client mode and it all works well.
 
  In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they connect to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
  But what I don't understand is how do I setup a Yarn instance that I
 can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
 it gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 
  Thanks!
 
 





Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So basically, I have Spark on Yarn running (spark shell) how do I connect
to it with another tool I am trying to test using the spark://IP:7077  URL
it's expecting? If that won't work with spark shell, or yarn-client mode,
how do I setup Spark on Yarn to be able to handle that?

Thanks!




On Wed, Jul 9, 2014 at 12:41 PM, John Omernik j...@omernik.com wrote:

 Thank you for the link.  In that link the following is written:

 For those familiar with the Spark API, an application corresponds to an
 instance of the SparkContext class. An application can be used for a
 single batch job, an interactive session with multiple jobs spaced apart,
 or a long-lived server continually satisfying requests

 So, if I wanted to use a long-lived server continually satisfying
 requests and then start a shell that connected to that context, how would
 I do that in Yarn? That's the problem I am having right now, I just want
 there to be that long lived service that I can utilize.

 Thanks!


 On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 To add to Ron's answer, this post explains what it means to run Spark
 against a YARN cluster, the difference between yarn-client and yarn-cluster
 mode, and the reason spark-shell only works in yarn-client mode.

 http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 -Sandy


 On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com
 wrote:

 The idea behind YARN is that you can run different application types
 like MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of the
 Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

  On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
  I am trying to get my head around using Spark on Yarn from a
 perspective of a cluster. I can start a Spark Shell no issues in Yarn.
 Works easily.  This is done in yarn-client mode and it all works well.
 
  In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they connect to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
  But what I don't understand is how do I setup a Yarn instance that I
 can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
 it gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 
  Thanks!
 
 






Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
Spark doesn't currently offer you anything special to do this.  I.e. if you
want to write a Spark application that fires off jobs on behalf of remote
processes, you would need to implement the communication between those
remote processes and your Spark application code yourself.


On Wed, Jul 9, 2014 at 10:41 AM, John Omernik j...@omernik.com wrote:

 Thank you for the link.  In that link the following is written:

 For those familiar with the Spark API, an application corresponds to an
 instance of the SparkContext class. An application can be used for a
 single batch job, an interactive session with multiple jobs spaced apart,
 or a long-lived server continually satisfying requests

 So, if I wanted to use a long-lived server continually satisfying
 requests and then start a shell that connected to that context, how would
 I do that in Yarn? That's the problem I am having right now, I just want
 there to be that long lived service that I can utilize.

 Thanks!


 On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 To add to Ron's answer, this post explains what it means to run Spark
 against a YARN cluster, the difference between yarn-client and yarn-cluster
 mode, and the reason spark-shell only works in yarn-client mode.

 http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 -Sandy


 On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com
 wrote:

 The idea behind YARN is that you can run different application types
 like MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of the
 Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

  On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
  I am trying to get my head around using Spark on Yarn from a
 perspective of a cluster. I can start a Spark Shell no issues in Yarn.
 Works easily.  This is done in yarn-client mode and it all works well.
 
  In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they connect to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
  But what I don't understand is how do I setup a Yarn instance that I
 can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
 it gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 
  Thanks!
 
 






Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng
We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui

On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
 Xiangrui,

 Thanks for all the help in resolving this issue. The  cause turned out to
 bethe build environment rather than runtime configuration. The build process
 had picked up maven2 while building spark. Using binaries that were rebuilt
 using m3, the entire processing went through fine. While I'm aware that the
 build instruction page specifies m3 as the min requirement, declaratively
 preventing accidental m2 usage (e.g. through something like the maven
 enforcer plugin?) might help other developers avoid such issues.

 -Bharath



 On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng men...@gmail.com wrote:

 It seems to me a setup issue. I just tested news20.binary (1355191
 features) on a 2-node EC2 cluster and it worked well. I added one line
 to conf/spark-env.sh:

 export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20 

 and launched spark-shell with --driver-memory 20g. Could you re-try
 with an EC2 setup? If it still doesn't work, please attach all your
 code and logs.

 Best,
 Xiangrui

 On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:
  Hi Xiangrui,
 
  1) Yes, I used the same build (compiled locally from source) to the host
  that has (master, slave1) and the second host with slave2.
 
  2) The execution was successful when run in local mode with reduced
  number
  of partitions. Does this imply issues communicating/coordinating across
  processes (i.e. driver, master and workers)?
 
  Thanks,
  Bharath
 
 
 
  On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Bharath,
 
  1) Did you sync the spark jar and conf to the worker nodes after build?
  2) Since the dataset is not large, could you try local mode first
  using `spark-summit --driver-memory 12g --master local[*]`?
  3) Try to use less number of partitions, say 5.
 
  If the problem is still there, please attach the full master/worker log
  files.
 
  Best,
  Xiangrui
 
  On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
  reachb...@gmail.com
  wrote:
   Xiangrui,
  
   Leaving the frameSize unspecified led to an error message (and
   failure)
   stating that the task size (~11M) was larger. I hence set it to an
   arbitrarily large value ( I realize 500 was unrealistic  unnecessary
   in
   this case). I've now set the size to 20M and repeated the runs. The
   earlier
   runs were on an uncached RDD. Caching the RDD (and setting
   spark.storage.memoryFraction=0.5) resulted in marginal speed up of
   execution, but the end result remained the same. The cached RDD size
   is
   as
   follows:
  
   RDD NameStorage LevelCached Partitions
   Fraction CachedSize in MemorySize in TachyonSize on
   Disk
   1084 Memory Deserialized 1x Replicated 80
   100% 165.9 MB 0.0 B 0.0 B
  
  
  
   The corresponding master logs were:
  
   14/07/04 06:29:34 INFO Master: Removing executor
   app-20140704062238-0033/1
   because it is EXITED
   14/07/04 06:29:34 INFO Master: Launching executor
   app-20140704062238-0033/2
   on worker worker-20140630124441-slave1-40182
   14/07/04 06:29:34 INFO Master: Removing executor
   app-20140704062238-0033/0
   because it is EXITED
   14/07/04 06:29:34 INFO Master: Launching executor
   app-20140704062238-0033/3
   on worker worker-20140630102913-slave2-44735
   14/07/04 06:29:37 INFO Master: Removing executor
   app-20140704062238-0033/2
   because it is EXITED
   14/07/04 06:29:37 INFO Master: Launching executor
   app-20140704062238-0033/4
   on worker worker-20140630124441-slave1-40182
   14/07/04 06:29:37 INFO Master: Removing executor
   app-20140704062238-0033/3
   because it is EXITED
   14/07/04 06:29:37 INFO Master: Launching executor
   app-20140704062238-0033/5
   on worker worker-20140630102913-slave2-44735
   14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
   disassociated, removing it.
   14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
   14/07/04 06:29:39 INFO LocalActorRef: Message
   [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
   from
   Actor[akka://sparkMaster/deadLetters] to
  
  
   Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
   was not delivered. [39] dead letters encountered. This logging can be
   turned
   off or adjusted with configuration settings 'akka.log-dead-letters'
   and
   'akka.log-dead-letters-during-shutdown'.
   14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
   disassociated, removing it.
   14/07/04 06:29:39 INFO Master: 

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So how do I do the long-lived server continually satisfying requests in
the Cloudera application? I am very confused by that at this point.


On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Spark doesn't currently offer you anything special to do this.  I.e. if
 you want to write a Spark application that fires off jobs on behalf of
 remote processes, you would need to implement the communication between
 those remote processes and your Spark application code yourself.


 On Wed, Jul 9, 2014 at 10:41 AM, John Omernik j...@omernik.com wrote:

 Thank you for the link.  In that link the following is written:

 For those familiar with the Spark API, an application corresponds to an
 instance of the SparkContext class. An application can be used for a
 single batch job, an interactive session with multiple jobs spaced apart,
 or a long-lived server continually satisfying requests

 So, if I wanted to use a long-lived server continually satisfying
 requests and then start a shell that connected to that context, how would
 I do that in Yarn? That's the problem I am having right now, I just want
 there to be that long lived service that I can utilize.

 Thanks!


 On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 To add to Ron's answer, this post explains what it means to run Spark
 against a YARN cluster, the difference between yarn-client and yarn-cluster
 mode, and the reason spark-shell only works in yarn-client mode.

 http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

 -Sandy


 On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com
 wrote:

 The idea behind YARN is that you can run different application types
 like MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of
 the Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

  On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
  I am trying to get my head around using Spark on Yarn from a
 perspective of a cluster. I can start a Spark Shell no issues in Yarn.
 Works easily.  This is done in yarn-client mode and it all works well.
 
  In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they connect to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
  But what I don't understand is how do I setup a Yarn instance that I
 can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and
 it gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 
  Thanks!
 
 







Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the
spark shell.  (i.e. the line that you type into the spark shell are
compiled into classes and then shipped to the executors where they run).
 Are you able to run other spark examples with closures in the same shell?

Michael


On Wed, Jul 9, 2014 at 4:28 AM, gil...@gmail.com gil...@gmail.com wrote:

 Hello, While trying to run this example below I am getting errors. I have
 build Spark using the followng command: $ SPARK_HADOOP_VERSION=2.4.0
 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly
 --- --Running the example using
 Spark-shell --- $
 SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
 HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client
 ./bin/spark-shell scala val sqlContext = new
 org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class
 Person(name: String, age: Int) val people = sc.textFile(hdfs://
 myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) val
 teenagers = sql(SELECT name FROM people WHERE age = 13 AND age = 19)
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 -- error
 ---
 java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
 at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at
 $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
 scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at
 scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at
 org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
 at
 org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
 at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
 org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112) at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at
 org.apache.spark.scheduler.Task.run(Task.scala:51) at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 --
 View this message in context: Spark SQL - java.lang.NoClassDefFoundError:
 Could not initialize class $line10.$read$
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
I am using the Spark Streaming and have the following two questions:

1. If more than one output operations are put in the same StreamingContext
(basically, I mean, I put all the output operations in the same class), are
they processed one by one as the order they appear in the class? Or they
are actually processes parallely?

2. If one DStream takes longer than the interval time, does a new DStream
wait in the queue until the previous DStream is fully processed? Is there
any parallelism that may process the two DStream at the same time?

Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like
to add in a jar I believe I need to access Twitter data live, twitter4j
http://twitter4j.org/en/index.html. I’m confused over where and how to
add this jar in.

SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions two
environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has
an addJar method and a jars property, the latter of which does not have an
associated doc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
.

What’s the difference between all these jar-related things, and what do I
need to do to add this Twitter jar in correctly?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at a
time. This *can* be parallelized using an undocumented config parameter
spark.streaming.concurrentJobs which is by default set to 1.

2. Yes, the output operations (and the spark jobs that are involved with
them) gets queued up.

TD


On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang yanfang...@gmail.com wrote:

 I am using the Spark Streaming and have the following two questions:

 1. If more than one output operations are put in the same StreamingContext
 (basically, I mean, I put all the output operations in the same class), are
 they processed one by one as the order they appear in the class? Or they
 are actually processes parallely?

 2. If one DStream takes longer than the interval time, does a new DStream
 wait in the queue until the previous DStream is fully processed? Is there
 any parallelism that may process the two DStream at the same time?

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108



Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
Great. Thank you!

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 1. Multiple output operations are processed in the order they are defined.
 That is because by default each one output operation is processed at a
 time. This *can* be parallelized using an undocumented config parameter
 spark.streaming.concurrentJobs which is by default set to 1.

 2. Yes, the output operations (and the spark jobs that are involved with
 them) gets queued up.

 TD


 On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang yanfang...@gmail.com wrote:

 I am using the Spark Streaming and have the following two questions:

 1. If more than one output operations are put in the same
 StreamingContext (basically, I mean, I put all the output operations in the
 same class), are they processed one by one as the order they appear in the
 class? Or they are actually processes parallely?

 2. If one DStream takes longer than the interval time, does a new DStream
 wait in the queue until the previous DStream is fully processed? Is there
 any parallelism that may process the two DStream at the same time?

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108





Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks:


I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

* Process each row and save the counter in cassandra.  In this scenario 
after the text file has been consumed, there is no task/stages seen in the 
spark UI.

* If instead I use reduce by key before saving to cassandra, the spark 
UI shows continuous generation of tasks/stages even afterprocessing the file 
has been completed.

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any insights/pointers 
for understanding this difference in behavior and how to avoid generating 
tasks/stages when there is no data (new file) available.

Thanks

Mans

RE: How should I add a jar?

2014-07-09 Thread Sameer Tilak
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars. 
First jar is the dependency and the second one is my application. Hope this 
will work for you. 
 ./spark-shell --jars 
/apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar


Date: Wed, 9 Jul 2014 11:44:27 -0700
From: nicholas.cham...@gmail.com
To: u...@spark.incubator.apache.org
Subject: How should I add a jar?

I’m just starting to use the Scala version of Spark’s shell, and I’d like to 
add in a jar I believe I need to access Twitter data live, twitter4j. I’m 
confused over where and how to add this jar in.


SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. 
SparkContext also has an addJar method and a jars property, the latter of which 
does not have an associated doc.


What’s the difference between all these jar-related things, and what do I need 
to do to add this Twitter jar in correctly?
Nick
​







View this message in context: How should I add a jar?

Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

Some question about SQL and streaming

2014-07-09 Thread hsy...@gmail.com
Hi guys,

I'm a new user to spark. I would like to know is there an example of how to
user spark SQL and spark streaming together? My use case is I want to do
some SQL on the input stream from kafka.
Thanks!

Best,
Siyuan


Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Awww ye. That worked! Thank you Sameer.

Is this documented somewhere? I feel there there's a slight doc deficiency
here.

Nick


On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote:

 Hi Nicholas,

 I am using Spark 1.0 and I use this method to specify the additional jars.
 First jar is the dependency and the second one is my application. Hope this
 will work for you.

  ./spark-shell --jars
 /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar



 --
 Date: Wed, 9 Jul 2014 11:44:27 -0700
 From: nicholas.cham...@gmail.com
 To: u...@spark.incubator.apache.org
 Subject: How should I add a jar?


 I’m just starting to use the Scala version of Spark’s shell, and I’d like
 to add in a jar I believe I need to access Twitter data live, twitter4j
 http://twitter4j.org/en/index.html. I’m confused over where and how to
 add this jar in.

 SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions
 two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext
 also has an addJar method and a jars property, the latter of which does
 not have an associated doc
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
 .

 What’s the difference between all these jar-related things, and what do I
 need to do to add this Twitter jar in correctly?

 Nick
  ​

 --
 View this message in context: How should I add a jar?
 http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Right, the compile error is a casting issue telling me I cannot assign
a JavaPairRDDPartition,
Body to a JavaPairRDDObject, Object. It happens in the mapToPair()
method.




On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote:

 You forgot the compile error!


 On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores silvi.ca...@gmail.com
  wrote:

 Hi everyone,

  I am new to Spark and I'm having problems to make my code compile. I
 have the feeling I might be misunderstanding the functions so I would be
 very glad to get some insight in what could be wrong.

 The problematic code is the following:

 JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);}
 );

 JavaPairRDDPartition, IterableBody partitions =
 bodies.mapToPair(b -
 b.computePartitions(maxDistance)).groupByKey();

  Partition and Body are defined inside the driver class. Body contains
 the following definition:

 protected IterableTuple2Partition, Body computePartitions (int
 maxDistance)

 The idea is to reproduce the following schema:

 The first map results in: *body1, body2, ... *
 The mapToPair should output several of these:* (partition_i, body1),
 (partition_i, body2)...*
 Which are gathered by key as follows: *(partition_i, (body1,
 body_n), (partition_i', (body2, body_n') ...*

 Thanks in advance.
 Regards,
 Silvina





SPARK_CLASSPATH Warning

2014-07-09 Thread Nick R. Katsipoulakis
Hello,

I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop
Distribution installed (v2.2.0 without yarn). Due to the fact that the
Hadoop Distribution that I am using, uses a list of jars , I do the
following changes to the conf/spark-env.sh

#!/usr/bin/env bash

export HADOOP_CONF_DIR=/path-to-hadoop-conf/hadoop-conf
export SPARK_LOCAL_IP=impl41
export
SPARK_CLASSPATH=/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*
...

Also, to make sure that I have everything working I execute the Spark shell
as follows:

[biadmin@impl41 spark]$ ./bin/spark-shell --jars
/path-to-proprietary-hadoop-lib/lib/*.jar

14/07/09 13:37:28 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:28 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(biadmin)
14/07/09 13:37:28 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:29 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:29 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44292
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
  /_/

Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

14/07/09 13:37:36 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*'
as a work-around.
14/07/09 13:37:36 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*'
as a work-around.
14/07/09 13:37:36 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:36 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(biadmin)
14/07/09 13:37:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/09 13:37:37 INFO Remoting: Starting remoting
14/07/09 13:37:37 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO spark.SparkEnv: Registering MapOutputTracker
14/07/09 13:37:37 INFO spark.SparkEnv: Registering BlockManagerMaster
14/07/09 13:37:37 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140709133737-798b
14/07/09 13:37:37 INFO storage.MemoryStore: MemoryStore started with
capacity 307.2 MB.
14/07/09 13:37:38 INFO network.ConnectionManager: Bound socket to port
16685 with id = ConnectionManagerId(impl41,16685)
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/07/09 13:37:38 INFO storage.BlockManagerInfo: Registering block manager
impl41:16685 with 307.2 MB RAM
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Registered BlockManager
14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:21938
14/07/09 13:37:38 INFO broadcast.HttpBroadcast: Broadcast server started at
http://impl41:21938
14/07/09 13:37:38 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-91e8e040-f2ca-43dd-b574-805033f476c7
14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:52678
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/07/09 13:37:38 INFO ui.SparkUI: Started SparkUI at http://impl41:4040
14/07/09 13:37:39 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/09 13:37:39 INFO spark.SparkContext: Added JAR
file:/opt/ibm/biginsights/IHC/lib/adaptive-mr.jar at
http://impl41:52678/jars/adaptive-mr.jar with timestamp 1404938259526
14/07/09 13:37:39 INFO executor.Executor: Using REPL class URI:
http://impl41:44292
14/07/09 13:37:39 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.

scala

So, my question is the following:

Am I including my libraries correctly? Why do I get the message that the
SPARK_CLASSPATH method is deprecated?

Also, when I execute the following example:

scala val file = sc.textFile(hdfs://lpsa.dat)
14/07/09 13:41:43 WARN util.SizeEstimator: Failed to 

Understanding how to install in HDP

2014-07-09 Thread Abel Coronado Iruegas
Hi everybody

We have hortonworks cluster with many nodes, we want to test a deployment
of Spark. Whats the recomended path to follow?

I mean we can compile the sources in the Name Node. But i don't really
understand how to pass the executable jar and configuration to the rest of
the nodes.

Thanks! !!

Abel


Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias,

I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in terms of
efficiency? Thanks!


On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know which
 cluster you use, but with Mesos you could check client logs in the web
 interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead of
 doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill







Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Krishna,

Ok, thank you. I just wanted to make sure that this can be done.

Cheers,
Nick


On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Nick,
AFAIK, you can compile with yarn=true and still run spark in stand
 alone cluster mode.
 Cheers
 k/


 On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis kat...@cs.pitt.edu
 wrote:

 Hello,

 I am currently learning Apache Spark and I want to see how it integrates
 with an existing Hadoop Cluster.

 My current Hadoop configuration is version 2.2.0 without Yarn. I have
 build Apache Spark (v1.0.0) following the instructions in the README file.
 Only setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the
 HADOOP_CONF_DIR to point to the configuration directory of Hadoop
 configuration.

 My use-case is the Linear Least Regression MLlib example of Apache Spark
 (link:
 http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
 The only difference in the code is that I give the text file to be an HDFS
 file.

 However, I get a Runtime Exception: Error in configuring object.

 So my question is the following:

 Does Spark work with a Hadoop distribution without Yarn?
 If yes, am I doing it right? If no, can I build Spark with
 SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?

 Thank you,
 Nick





Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers
k/


On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote:

 I have a Spark app which runs well on local master.  I'm now ready to
 put it on a cluster.  What needs to be installed on the master? What
 needs to be installed on the workers?

 If the cluster already has Hadoop or YARN or Cloudera, does it still
 need an install of Spark?



Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all,

I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.

For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which seems slow to me. I allocated 300 workers and
specify 300 as the partition number for groupby. When I checked the slow
stage *combineByKey at ShuffledDStream.scala:42,* there are sometimes 2
executors allocated for this stage. However, during other batches, the
executors can be several hundred for the same stage, which means the number
of executors for the same operations change.

Does anyone know how Spark allocate the number of executors for different
stages and how to increase the efficiency for task? Thanks!

Bill


CoarseGrainedExecutorBackend: Driver Disassociated‏

2014-07-09 Thread Sameer Tilak



Hi,This time instead of manually starting worker node using 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

I used start-slaves script on the master node. I also enabled -v (verbose flag) 
in ssh. Here is the o/p that I see. The log file for to the worker node was not 
created. I will switch back to the manual process for starting the cluster. 
bash-4.1$ ./start-slaves.sh172.16.48.44: OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 
Mar 2010172.16.48.44: debug1: Reading configuration data 
/users/userid/.ssh/config172.16.48.44: debug1: Reading configuration data 
/etc/ssh/ssh_config172.16.48.44: debug1: Applying options for *172.16.48.44: 
debug1: Connecting to 172.16.48.44 [172.16.48.44] port 22.172.16.48.44: debug1: 
Connection established.172.16.48.44: debug1: identity file 
/users/p529444/.ssh/identity type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/identity-cert type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_rsa type 1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_rsa-cert type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_dsa type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_dsa-cert type -1172.16.48.44: debug1: Remote protocol 
version 2.0, remote software version OpenSSH_5.2p1_q17.gM-hpn13v6172.16.48.44: 
debug1: match: OpenSSH_5.2p1_q17.gM-hpn13v6 pat OpenSSH*172.16.48.44: debug1: 
Enabling compatibility mode for protocol 2.0172.16.48.44: debug1: Local version 
string SSH-2.0-OpenSSH_5.3172.16.48.44: debug1: SSH2_MSG_KEXINIT 
sent172.16.48.44: debug1: SSH2_MSG_KEXINIT received172.16.48.44: debug1: kex: 
server-client aes128-ctr hmac-md5 none172.16.48.44: debug1: kex: 
client-server aes128-ctr hmac-md5 none172.16.48.44: debug1: 
SSH2_MSG_KEX_DH_GEX_REQUEST(102410248192) sent172.16.48.44: debug1: expecting 
SSH2_MSG_KEX_DH_GEX_GROUP172.16.48.44: debug1: SSH2_MSG_KEX_DH_GEX_INIT 
sent172.16.48.44: debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY172.16.48.44: 
debug1: Host '172.16.48.44' is known and matches the RSA host key.172.16.48.44: 
debug1: Found key in /users/p529444/.ssh/known_hosts:6172.16.48.44: debug1: 
ssh_rsa_verify: signature correct172.16.48.44: debug1: SSH2_MSG_NEWKEYS 
sent172.16.48.44: debug1: expecting SSH2_MSG_NEWKEYS172.16.48.44: debug1: 
SSH2_MSG_NEWKEYS received172.16.48.44: debug1: SSH2_MSG_SERVICE_REQUEST 
sent172.16.48.44: debug1: SSH2_MSG_SERVICE_ACCEPT received172.16.48.44: 
172.16.48.44: 
This is a private computer system. Access to and use requires172.16.48.44: 
explicit current authorization and is limited to business use.172.16.48.44: All 
users express consent to monitoring by system personnel to172.16.48.44: detect 
improper use of or access to the system, system personnel172.16.48.44: may 
provide evidence of such conduct to law enforcement172.16.48.44: officials 
and/or company management.172.16.48.44: 
172.16.48.44: 
UAM R2 account support: http://ussweb.crdc.kp.org/UAM/172.16.48.44: 
172.16.48.44: 
For password resets, please call the Helpdesk 888-457-4872172.16.48.44: 
172.16.48.44: 
debug1: Authentications that can continue: 
gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44:
 debug1: Next authentication method: gssapi-keyex172.16.48.44: debug1: No valid 
Key exchange context172.16.48.44: debug1: Next authentication method: 
gssapi-with-mic172.16.48.44: debug1: Unspecified GSS failure.  Minor code may 
provide more information172.16.48.44: Cannot find KDC for requested 
realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure.  Minor code 
may provide more information172.16.48.44: Cannot find KDC for requested 
realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure.  Minor code 
may provide more information172.16.48.44:172.16.48.44:172.16.48.44: debug1: 
Authentications that can continue: 
gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44:
 debug1: Next authentication method: publickey172.16.48.44: debug1: Trying 
private key: /users/userid/.ssh/identity172.16.48.44: debug1: Offering public 
key: /users/userid/.ssh/id_rsa172.16.48.44: debug1: Server accepts key: pkalg 
ssh-rsa blen 277172.16.48.44: debug1: read PEM private key done: type 
RSA172.16.48.44: debug1: Authentication succeeded (publickey).172.16.48.44: 
debug1: channel 0: new [client-session]172.16.48.44: debug1: Requesting 
no-more-sessions@openssh.com172.16.48.44: debug1: Entering interactive 
session.172.16.48.44: debug1: Sending environment.172.16.48.44: debug1: Sending 
env LANG = en_US.UTF-8172.16.48.44: debug1: Sending command: cd 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/.. ; 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/start-slave.sh 1 

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-09 Thread vs
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not
require Spark to be installed on all nodes manually. When you submit the
Spark assembly jar it will have all its dependencies. YARN will instantiate
Spark App Master  Containers based on this jar.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p9246.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


spark1.0 principal component analysis

2014-07-09 Thread fintis
Hi,

Can anyone please shed more light on the PCA  implementation in spark? The 
documentation is a bit leaving as I am not sure I understand the output.
According to the docs, the output is a local matrix with the columns as
principal components and columns sorted in descending order of covariance.
This is a bit confusing for me as I need to compute other statistic Like
standard deviation of the principal components. How do I match the principal
components to the actual features since there is some sorting? How about
eigenvectors and eigenvalues? 

Please anyone to help shed light on the output, how to use it further and
pca spark implementation in general is appreciated

Thank you in earnest



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Public service announcement:

If you're trying to do some stream processing on Twitter data, you'll need
version 3.0.6 of twitter4j http://twitter4j.org/archive/. That should
work with the Spark Streaming 1.0.0 Twitter library.

The latest version of twitter4j, 4.0.2, appears to have breaking changes in
its API for us.

Nick


On Wed, Jul 9, 2014 at 3:34 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Awww ye. That worked! Thank you Sameer.

 Is this documented somewhere? I feel there there's a slight doc deficiency
 here.

 Nick


 On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote:

 Hi Nicholas,

 I am using Spark 1.0 and I use this method to specify the additional
 jars. First jar is the dependency and the second one is my application.
 Hope this will work for you.

  ./spark-shell --jars
 /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar



 --
 Date: Wed, 9 Jul 2014 11:44:27 -0700
 From: nicholas.cham...@gmail.com
 To: u...@spark.incubator.apache.org
 Subject: How should I add a jar?


 I’m just starting to use the Scala version of Spark’s shell, and I’d like
 to add in a jar I believe I need to access Twitter data live, twitter4j
 http://twitter4j.org/en/index.html. I’m confused over where and how to
 add this jar in.

 SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions
 two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext
 also has an addJar method and a jars property, the latter of which does
 not have an associated doc
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
 .

 What’s the difference between all these jar-related things, and what do I
 need to do to add this Twitter jar in correctly?

 Nick
  ​

 --
 View this message in context: How should I add a jar?
 http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill,

I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.


On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone. In
 Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
 sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know which
 cluster you use, but with Mesos you could check client logs in the web
 interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill








Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias,

Now I did the re-partition and ran the program again. I find a bottleneck
of the whole program. In the streaming, there is a stage marked as
*combineByKey
at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
executed. However, during some batches, the number of executors allocated
to this step is only 2 although I used 300 workers and specified the
partition number as 300. In this case, the program is very slow although
the data that are processed are not big.

Do you know how to solve this issue?

Thanks!


On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 I haven't worked with Yarn, but I would try adding a repartition() call
 after you receive your data from Kafka. I would be surprised if that didn't
 help.


 On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone.
 In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
 not sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know
 which cluster you use, but with Mesos you could check client logs in the
 web interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the 
 current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill









Re: Some question about SQL and streaming

2014-07-09 Thread Tobias Pfeiffer
Siyuan,

I do it like this:

// get data from Kafka
val ssc = new StreamingContext(...)
val kvPairs = KafkaUtils.createStream(...)
// we need to wrap the data in a case class for registerAsTable() to succeed
val lines = kvPairs.map(_._2).map(s = StringWrapper(s))
val result = lines.transform((rdd, time) = {
  // execute statement
  rdd.registerAsTable(data)
  sqlc.sql(query)
})

Don't know if it is the best way, but it works.

Tobias


On Thu, Jul 10, 2014 at 4:21 AM, hsy...@gmail.com hsy...@gmail.com wrote:

 Hi guys,

 I'm a new user to spark. I would like to know is there an example of how
 to user spark SQL and spark streaming together? My use case is I want to do
 some SQL on the input stream from kafka.
 Thanks!

 Best,
 Siyuan



Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill,

good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.

Tobias


On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi Tobias,

 Now I did the re-partition and ran the program again. I find a bottleneck
 of the whole program. In the streaming, there is a stage marked as 
 *combineByKey
 at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
 executed. However, during some batches, the number of executors allocated
 to this step is only 2 although I used 300 workers and specified the
 partition number as 300. In this case, the program is very slow although
 the data that are processed are not big.

 Do you know how to solve this issue?

 Thanks!


 On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 I haven't worked with Yarn, but I would try adding a repartition() call
 after you receive your data from Kafka. I would be surprised if that didn't
 help.


 On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 I was using Spark 0.9 before and the master I used was yarn-standalone.
 In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
 not sure whether it is the reason why more machines do not provide better
 scalability. What is the difference between these two modes in terms of
 efficiency? Thanks!


 On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 do the additional 100 nodes receive any tasks at all? (I don't know
 which cluster you use, but with Mesos you could check client logs in the
 web interface.) You might want to try something like repartition(N) or
 repartition(N*2) (with N the number of your nodes) after you receive your
 data.

 Tobias


 On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi Tobias,

 Thanks for the suggestion. I have tried to add more nodes from 300 to
 400. It seems the running time did not get improved.


 On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 can't you just add more nodes in order to speed up the processing?

 Tobias


 On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a
 map which is updated by historical data in every minute. My current 
 method
 is to set batch size as 1 minute and use foreachRDD to update this map 
 and
 output the map at the end of the foreachRDD function. However, the 
 current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead
 of doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill










Spark Streaming - What does Spark Streaming checkpoint?

2014-07-09 Thread Yan Fang
Hi guys,

I am a little confusing by the checkpointing in Spark Streaming. It
checkpoints the intermediate data for the stateful operations for sure.
Does it also checkpoint the information of StreamingContext? Because it
seems we can recreate the SC from the checkpoint in a driver node failure
scenario. When I looked at the checkpoint directory, did not find much
clue. Any help? Thank you very much.

Best,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


Restarting a Streaming Context

2014-07-09 Thread Nick Chammas
So I do this from the Spark shell:

// set things up// snipped

ssc.start()
// let things happen for a few minutes

ssc.stop(stopSparkContext = false, stopGracefully = true)

Then I want to restart the Streaming Context:

ssc.start() // still in the shell; Spark Context is still alive

Which yields:

org.apache.spark.SparkException: StreamingContext has already been stopped

How come? Is there any way in the interactive shell to restart a Streaming
Context once it is stopped?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Restarting-a-Streaming-Context-tp9256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Andrew Or
I don't see why using SparkSubmit.scala as your entry point would be any
different, because all that does is invoke the main class of Client.scala
(e.g. for Yarn) after setting up all the class paths and configuration
options. (Though I haven't tried this myself)


2014-07-09 9:40 GMT-07:00 Ron Gonzalez zlgonza...@yahoo.com:

 I am able to use Client.scala or LauncherExecutor.scala as my programmatic
 entry point for Yarn.

 Thanks,
 Ron

 Sent from my iPad

 On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option
 for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 





  1   2   >