Standalone cluster on Windows
Hi, I wanted to set up standalone cluster on windows machine. But unfortunately, spark-master.cmd file is not available. Can someone suggest how to proceed or is spark-1.0.0 has missed spark-master.cmd file ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-cluster-on-Windows-tp9135.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark Driver
Can you try setting SPARK_MASTER_IP in the spark-env.sh file? Thanks Best Regards On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote: Hi all, I have one master and two slave node, I did not set any ip for spark driver because I thought it uses its default ( localhost). In my etc/hosts I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do something in hosts? or should I set a ip to spark-local-ip? I got the following in my stderr: Spark Executor Command: java -cp :: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf -XX:MaxPermSize=128m -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - [akka.tcp://spark@master:54477] disassociated! Shutting down. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: spark Driver
Can you also paste a little bit more stacktrace? Thanks Best Regards On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote: I have the following in spark-env.sh SPARK_MASTER_IP=master SPARK_MASTER_port=7077 Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com On Wednesday, July 9, 2014 2:32 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try setting SPARK_MASTER_IP in the spark-env.sh file? Thanks Best Regards On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote: Hi all, I have one master and two slave node, I did not set any ip for spark driver because I thought it uses its default ( localhost). In my etc/hosts I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do something in hosts? or should I set a ip to spark-local-ip? I got the following in my stderr: Spark Executor Command: java -cp :: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf -XX:MaxPermSize=128m -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - [akka.tcp://spark@master:54477] disassociated! Shutting down. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: Spark Installation
Hi Srikrishna the reason to this issue is you had uploaded assembly jar to HDFS twice. paste your command could be better diagnosis 田毅 === 橘云平台产品线 大数据产品部 亚信联创科技(中国)有限公司 手机:13910177261 电话:010-82166322 传真:010-82166617 Q Q:20057509 MSN:yi.t...@hotmail.com 地址:北京市海淀区东北旺西路10号院东区 亚信联创大厦 === 在 2014年7月9日,上午3:03,Srikrishna S srikrishna...@gmail.com 写道: Hi All, I tried the make distribution script and it worked well. I was able to compile the spark binary on our CDH5 cluster. Once I compiled Spark, I copied over the binaries in the dist folder to all the other machines in the cluster. However, I run into an issue while submit a job in yarn-client mode. I get an error message that says the following Resource file:/opt/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar changed on src filesystem (expected 1404845211000, was 1404845404000) My end goal is to submit a job (that uses MLLib) in our Yarn cluster. Any thoughts anyone? Regards, Krishna On Tue, Jul 8, 2014 at 9:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Srikrishna, The binaries are built with something like mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 -Dyarn.version=2.3.0-cdh5.0.1 -Sandy On Tue, Jul 8, 2014 at 3:14 AM, 田毅 tia...@asiainfo.com wrote: try this command: make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive 田毅 === 橘云平台产品线 大数据产品部 亚信联创科技(中国)有限公司 手机:13910177261 电话:010-82166322 传真:010-82166617 Q Q:20057509 MSN:yi.t...@hotmail.com 地址:北京市海淀区东北旺西路10号院东区 亚信联创大厦 === 在 2014年7月8日,上午11:53,Krishna Sankar ksanka...@gmail.com 写道: Couldn't find any reference of CDH in pom.xml - profiles or the hadoop.version.Am also wondering how the cdh compatible artifact was compiled. Cheers k/ On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S srikrishna...@gmail.com wrote: Hi All, Does anyone know what the command line arguments to mvn are to generate the pre-built binary for spark on Hadoop 2-CHD5. I would like to pull in a recent bug fix in spark-master and rebuild the binaries in the exact same way that was used for that provided on the website. I have tried the following: mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 And it doesn't quite work. Any thoughts anyone?
Re: spark Driver
This is exactly what I got Spark Executor Command: java -cp :: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf -XX:MaxPermSize=128m -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - [akka.tcp://spark@master:54477] disassociated! Shutting down. I have posted in stackoverflow and I revived this answer http://stackoverflow.com/questions/24571922/apache-spark-stderr-and-stdout/24594576#24594576 I am not sure whether I need to set a ip address to driver ? do I need a separate machine for driver ? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com On Wednesday, July 9, 2014 2:39 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you also paste a little bit more stacktrace? Thanks Best Regards On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote: I have the following in spark-env.sh SPARK_MASTER_IP=master SPARK_MASTER_port=7077 Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com On Wednesday, July 9, 2014 2:32 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try setting SPARK_MASTER_IP in the spark-env.sh file? Thanks Best Regards On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote: Hi all, I have one master and two slave node, I did not set any ip for spark driver because I thought it uses its default ( localhost). In my etc/hosts I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do something in hosts? or should I set a ip to spark-local-ip? I got the following in my stderr: Spark Executor Command: java -cp :: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf -XX:MaxPermSize=128m -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@master:54477/user/CoarseGrainedScheduler 0 slave2 1 akka.tcp://sparkWorker@slave2:41483/user/Worker app-20140704174955-0002 14/07/04 17:50:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@slave2:33758] - [akka.tcp://spark@master:54477] disassociated! Shutting down. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Need advice to create an objectfile of set of images from Spark
Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao
Re: Requirements for Spark cluster
You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud quickly. If you want to set it up on your own then these are the things that you will need to do: 1. Make sure you have java (7) installed on all machines. 2. Install and configure spark (add all slave nodes in conf/slaves file) 3. Rsync spark directory to all slaves and start it. If you are already having hadoop, then you might need to get that version of spark which is compiled with that hadoop version (or you can compile it with option SPARK_HADOOP_VERSION) Thanks Best Regards On Wed, Jul 9, 2014 at 6:54 AM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a cluster. What needs to be installed on the master? What needs to be installed on the workers? If the cluster already has Hadoop or YARN or Cloudera, does it still need an install of Spark?
How to clear the list of Completed Appliations in Spark web UI?
Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?
Re: Spark Streaming using File Stream in Java
Try this out: JavaStreamingContext sc = new JavaStreamingContext(...);JavaDStreamString lines = ctx.fileStream(whatever);JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { public IterableString call(String s) { return Arrays.asList(s.split( )); } }); JavaPairDStreamString, Integer ones = words.map( new PairFunctionString, String, Integer() { public Tuple2String, Integer call(String s) { return new Tuple2(s, 1); } }); JavaPairDStreamString, Integer counts = ones.reduceByKey( new Function2Integer, Integer, Integer() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); Actually modified from https://spark.apache.org/docs/0.9.1/java-programming-guide.html#example Thanks Best Regards On Wed, Jul 9, 2014 at 6:03 AM, Aravind aravindb...@gmail.com wrote: Hi all, I am trying to run the NetworkWordCount.java file in the streaming examples. The example shows how to read from a network socket. But my usecase is that , I have a local log file which is a stream and continuously updated (say /Users/.../Desktop/mylog.log). I would like to write the same NetworkWordCount.java using this filestream jssc.fileStream(dataDirectory); Question: 1. How do I write a mapreduce function for the above to measure wordcounts (in java, not scala)? 2. Also does the streaming application stop if the file is not updating or does it continuously poll for the file updates? I am a new user of Apache Spark Streaming. Kindly help me as I am totally stuck Thanks in advance. Regards Aravind -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-File-Stream-in-Java-tp9115.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: error when spark access hdfs with Kerberos enable
Hi Cheney, I haven't heard of anybody deploying non-secure YARN on top of secure HDFS. It's conceivable that you might be able to get work, but my guess is that you'd run into some issues. Also, without authentication on in YARN, you could be leaving your HDFS tokens exposed, which others could steal and get to your data. -Sandy On Tue, Jul 8, 2014 at 7:28 PM, Cheney Sun sun.che...@gmail.com wrote: Hi Sandy, We are also going to grep data from a security enabled (with kerberos) HDFS in our Spark application. Per you answer, we have to switch Spark on YARN to achieve this. We plan to deploy a different Hadoop cluster(with YARN) only to run Spark. Is it necessary to deploy YARN with security enabled? Or is it possible to access data within a security HDFS from no-security enabled Spark on YARN? On Wed, Jul 9, 2014 at 4:19 AM, Sandy Ryza sandy.r...@cloudera.com wrote: That's correct. Only Spark on YARN supports Kerberos. -Sandy On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin van...@cloudera.com wrote: Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote: Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I just have one spark test node for spark and HADOOP_CONF_DIR is set to the location containing the hdfs configuration files(hdfs-site.xml and core-site.xml) When I use spark-shell with local mode, the access to hdfs is successfully . However, If I use spark-shell which connects to the stand alone cluster (I configured the spark as standalone cluster mode with only one node). The access to the hdfs fails with the following error: “Can't get Master Kerberos principal for use as renewer” Anyone have any ideas on this ? Thanks a lot. Regards, Xiaowei -- Marcelo
Re: Re: Pig 0.13, Spark, Spork
Hi Bertrand, We've updated the document http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0 This is our working Github repo https://github.com/sigmoidanalytics/spork/tree/spork-0.9 Feel free to open issues over here https://github.com/sigmoidanalytics/spork/issues Thanks Best Regards On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux decho...@gmail.com wrote: @Mayur : I won't fight with the semantic of a fork but at the moment, no Spork does take the standard Pig as dependency. On that, we should agree. As for my use of Pig, I have no limitation. I am however interested to see the rise of a 'no-sql high level non programming language' for Spark. @Zhang : Could you elaborate your reference about Twitter? Bertrand Dechoux On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 pelickzh...@qq.com wrote: Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine https://github.com/pelick/flare-spork It it also highly experimental, and just directly mapping pig physical operations to spark RDD transformations/actions. It works for simple requests. :) I am also interested on the progress of spork, is it undergoing in Twitter in an un open-source way? -- Thanks Zhang Baofeng Blog http://blog.csdn.net/pelick | Github https://github.com/pelick | Weibo http://weibo.com/pelickzhang | LinkedIn http://www.linkedin.com/pub/zhang-baofeng/70/609/84 -- 原始邮件 -- *发件人:* Mayur Rustagi;mayur.rust...@gmail.com; *发送时间:* 2014年7月7日(星期一) 晚上11:55 *收件人:* user@spark.apache.orguser@spark.apache.org; *主题:* Re: Pig 0.13, Spark, Spork That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I saw a wiki page from your company but with an old version of Spark. http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 I have no reason to use it yet but I am interested in the state of the initiative. What's your point of view (personal and/or professional) about the Pig 0.13 release? Is the pluggable execution engine flexible enough in order to avoid having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D As a (for now) external observer, I am glad to see competition in that space. It can only be good for the community in the end. Bertrand Dechoux On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: How to clear the list of Completed Appliations in Spark web UI?
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote: Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?
how to host the drive node
I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by spark-submit? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-host-the-drive-node-tp9149.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to host spark driver
I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by spark-submit? Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia H#x2F;P : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: Purpose of spark-submit?
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?
Re: Why doesn't the driver node do any work?
I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by spark-submit? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-doesn-t-the-driver-node-do-any-work-tp3909p9153.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Comparative study
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your (It was not built to replace Hive. It's purpose-built to make interactive use with a BI tool feasible -- single-digit second queries on huge data sets. It's very memory hungry. Hive's architecture choices and legacy code have been throughput-oriented, and can't really get below minutes at scale, but, remains a right choice when you are in fact doing ETL!)
Re: Which is the best way to get a connection to an external database per task in Spark Streaming?
Hi Jerry, it's all clear to me now, I will try with something like Apache DBCP for the connection pool Thanks a lot for your help! 2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com: Yes, that would be the Java equivalence to use static class member, but you should carefully program to prevent resource leakage. A good choice is to use third-party DB connection library which supports connection pool, that will alleviate your programming efforts. Thanks Jerry *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] *Sent:* Tuesday, July 08, 2014 6:54 PM *To:* user@spark.apache.org *Subject:* Re: Which is the best way to get a connection to an external database per task in Spark Streaming? Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and I only have rudimentary knowledge about Scala, how could I recreate in Java the lazy creation of a singleton object that you propose for Scala? Maybe a static class member in Java for the connection would be the solution? Thanks again for your help, Best Regards, Juan 2014-07-08 11:44 GMT+02:00 Shao, Saisai saisai.s...@intel.com: I think you can maintain a connection pool or keep the connection as a long-lived object in executor side (like lazily creating a singleton object in object { } in Scala), so your task can get this connection each time executing a task, not creating a new one, that would be good for your scenario, since create a connection is quite expensive for each task. Thanks Jerry *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com] *Sent:* Tuesday, July 08, 2014 5:19 PM *To:* Tobias Pfeiffer *Cc:* user@spark.apache.org *Subject:* Re: Which is the best way to get a connection to an external database per task in Spark Streaming? Hi Tobias, thanks for your help. I understand that with that code we obtain a database connection per partition, but I also suspect that with that code a new database connection is created per each execution of the function used as argument for mapPartitions(). That would be very inefficient because a new object and a new database connection would be created for each batch of the DStream. But my knowledge about the lifecycle of Functions in Spark Streaming is very limited, so maybe I'm wrong, what do you think? Greetings, Juan 2014-07-08 3:30 GMT+02:00 Tobias Pfeiffer t...@preferred.jp: Juan, I am doing something similar, just not insert into SQL database, but issue some RPC call. I think mapPartitions() may be helpful to you. You could do something like dstream.mapPartitions(iter = { val db = new DbConnection() // maybe only do the above if !iter.isEmpty iter.map(item = { db.call(...) // do some cleanup if !iter.hasNext here item }) }).count() // force output Keep in mind though that the whole idea about RDDs is that operations are idempotent and in theory could be run on multiple hosts (to take the result from the fastest server) or multiple times (to deal with failures/timeouts) etc., which is maybe something you want to deal with in your SQL. Tobias On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task, uses a different connection to the database, and then database inserts/updates would be performed in parallel. - I understand that using a final variable in the driver code is not a good idea because then the communication with the database would be performed in the driver code, which leads to a bottleneck, according to http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/ - I think creating a new connection in the call() method of the Function passed to foreachRDD is also a bad idea, because then I wouldn't be reusing the connection to the database for each batch RDD in the DStream - I'm not sure that a broadcast variable with the connection handler is a good idea in case the target database is distributed, because if the same handler is used for all the nodes of the Spark cluster then than could have a negative effect in the data locality of the connection to the database. - From http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html I understand that by using an static variable and referencing it in the call() method of the Function passed to foreachRDD we get a different connection per Spark worker, I guess it's because there is a different JVM per worker. But then all the tasks in the same worker would share the same database handler object, am I right? - Another idea is
Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem to work. For now we've downgraded dropwizard to 0.6.2 which uses a compatible version of jetty. Not optimal, but it works for now. On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote: We're doing similar thing to lunch spark job in tomcat, and I opened a JIRA for this. There are couple technical discussions there. https://issues.apache.org/jira/browse/SPARK-2100 In this end, we realized that spark uses jetty not only for Spark WebUI, but also for distributing the jars and tasks, so it really hard to remove the web dependency in Spark. In the end, we lunch our spark job in yarn-cluster mode, and in the runtime, the only dependency in our web application is spark-yarn which doesn't contain any spark web stuff. PS, upgrading the spark jetty 8.x to 9.x in spark may not be straightforward by just changing the version in spark build script. Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires Java 7. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers ko...@tresata.com wrote: do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on how to solve this? Spark seems to use jetty 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source of the problem. Any ideas? On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi! I am building a web frontend for a Spark app, allowing users to input sql/hql and get results back. When starting a SparkContext from within my server code (using jetty/dropwizard) I get the error java.lang.NoSuchMethodError: org.eclipse.jetty.server.AbstractConnector: method init()V not found when Spark tries to fire up its own jetty server. This does not happen when running the same code without my web server. This is probably fixable somehow(?) but I'd like to disable the webUI as I don't need it, and ideally I would like to access that information programatically instead, allowing me to embed it in my own web application. Is this possible? -- Best regards, Martin Gammelsæter -- Mvh. Martin Gammelsæter 92209139 -- Mvh. Martin Gammelsæter 92209139
Re: issues with ./bin/spark-shell for standalone mode
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb...@gmail.com wrote: Thanks Andrew, ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g works well, just wanted to make sure that I'm not missing anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Initial job has not accepted any resources means many things
It seems like the Initial job has not accepted any resources; shows up for a wide variety of different errors (for example the obvious one where you've requested more memory than is available) but also for example in the case where the worker nodes does not have the appropriate code on their class path. Debugging from this error is very hard as errors does not show up in the logs on the workers. Is this a known issue? I'm having issues with getting the code to the workers without using addJar (my code is a fairly static application, and I'd like to avoid using addJar every time the app starts up, and instead manually add the jar to the classpath of every worke), but I can't seem to find out how) -- Best regards, Martin Gammelsæter
RE: Kryo is slower, and the size saving is minimal
Thank you for your response. Maybe that applies to my case. In my test case, The types of almost all of the data are either primitive types, joda DateTime, or String. But I'm somewhat disappointed with the speed. At least it should not be slower than Java default serializer... -Original Message- From: wxhsdp [mailto:wxh...@gmail.com] Sent: Wednesday, July 09, 2014 5:47 PM To: u...@spark.incubator.apache.org Subject: Re: Kryo is slower, and the size saving is minimal i'am not familiar with kryo and my opinion may be not right. in my case, kryo only saves about 5% of the original size when dealing with primitive types such as Arrays. i'am not sure whether it is the common case. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-is-slower-and-the-s ize-saving-is-minimal-tp9131p9160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: TaskContext stageId = 0
Oh well, never mind. The problem is that ResultTask's stageId is immutable and is used to construct the Task superclass. Anyway, my solution now is to use this.id for the rddId and to gather all rddIds using a spark listener on stage completed to clean up for any activity registered for those rdds. I could use TaskContext's hook but I'd have to add some more messaging so I can clear state that may live on a different executor than the one my partition is on, but since I don't know that the executor will succeed, this is not safe. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TaskContext-stageId-0-tp9152p9162.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to run a job on all workers?
Is it possible to run a job that assigns work to every worker in the system? My bootleg right now is to have a spark listener hear whenever a block manager is added and to increase a split count by 1. It runs a spark job with that split count and hopes that it will at least run on the newest worker. There's some weirdness with block managers being removed sometimes allowing my count to go negative, so I just keep monotonically increasing my split count. Anyone have a way that doesn't suck? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-a-job-on-all-workers-tp9163.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Error using MLlib-NaiveBayes : Matrices are not aligned
I am using Naive Bayes in MLlib . Below I have printed log of *model.theta*. after training on train data. You can check that it contains 9 features for 2 class classification. print numpy.log(model.theta) [[ 0.31618962 0.16636852 0.07200358 0.05411449 0.08542039 0.17620751 0.03711986 0.07110912 0.02146691] [ 0.26598639 0.23809524 0.06258503 0.01904762 0.05714286 0.18231293 0.08911565 0.03061224 0.05510204]] I am giving the same no. of features for prediction but I am getting error: *matrices not alligned.* *ERROR:* Traceback (most recent call last): File .\naive_bayes_analyser.py, line 192, in module prediction = model.predict(array([float(pos_count),float(neg_count),float(need_count),float(goal_count),float(try_co unt),float(means_count),float(persist_count),float(complete_count),float(fail_count)])) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py, line 101, in predict return numpy.argmax(self.pi + dot(x, self.theta)) ValueError: matrices are not aligned Can someone suggest me the possible error/mistake? Thanks in advance -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Filtering data during the read
Hi all, I wondered if you could help me to clarify the next situation: in the classic example val file = spark.textFile(hdfs://...) val errors = file.filter(line = line.contains(ERROR)) As I understand, the data is read in memory in first, and after that filtering is applying. Is it any way to apply filtering during the read step? and don't put all objects into memory? Thank you, Konstantin Kudryavtsev
Re: Filtering data during the read
Hi, Spark does that out of the box for you :) It compresses down the execution steps as much as possible. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, I wondered if you could help me to clarify the next situation: in the classic example val file = spark.textFile(hdfs://...) val errors = file.filter(line = line.contains(ERROR)) As I understand, the data is read in memory in first, and after that filtering is applying. Is it any way to apply filtering during the read step? and don't put all objects into memory? Thank you, Konstantin Kudryavtsev
Docker Scripts
Hi, Regarding docker scripts I know i can change the base image easily but is there any specific reason why the base image is hadoop_1.2.1 . Why is this prefered to Hadoop2 [HDP2, CDH5]) distributions? Now that amazon supports docker could this replace ec2-scripts? Kind regards Dimitri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Docker-Scripts-tp9167.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
FW: memory question
Hi, Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. rather than set spark.cleaner.ttl and have this run periodically, I'd like to run it on demand. The problem with periodic cleaning is that it can remove rdd that we still require (some calcs are short, others very long). We're using Spark 0.9.0 with Cloudera distribution. I have a simple test calculation in a loop as follows: val test = new TestCalc(sparkContext) for (i - 1 to 10) { val (x) = test.evaluate(rdd) } Where TestCalc is defined as: class TestCalc(sparkContext: SparkContext) extends Serializable { def aplus(a: Double, b:Double) :Double = a+b; def evaluate(rdd : RDD[Double]) = { /* do some dummy calc. */ val x = rdd.groupBy(x = x /2.0) val y = x.fold((0.0,Seq[Double]()))((a,b)=(aplus(a._1,b._1),Seq())) val z = y._1 /* try with/without this... */ val e :SparkEnv = SparkEnv.getThreadLocal e.blockManager.master.removeRdd(x.id,true) // still see memory consumption go up... (z) } } What I can see on the cluster is the memory usage on the node executing this continually climbs. I'd expect it to level off and not jump up over 1G... I thought that putting in the line 'removeRdd' might help, but it doesn't seem to make a difference Regards, Mike ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
Hello,While trying to run this example below I am getting errors.I have build Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly-Running the example using Spark-shell---$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client ./bin/spark-shellscala val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Person(name: String, age: Int)val people = sc.textFile(hdfs://myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt))people.registerAsTable(people)val teenagers = sql(SELECT name FROM people WHERE age = 13 AND age = 19)teenagers.map(t = Name: + t(0)).collect().foreach(println)--error---java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at org.apache.spark.scheduler.Task.run(Task.scala:51)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?
Re: Purpose of spark-submit?
Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Need advice to create an objectfile of set of images from Spark
The idea is to run a job that use images as input so that each work will process a subset of the images On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: RDD can only keep objects. How do you plan to encode these images so that they can be loaded. Keeping the whole image as a single object in 1 rdd would perhaps not be super optimized. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao
Re: Spark job tracker.
val sem = 0 sc.addSparkListener(new SparkListener { override def onTaskStart(taskStart: SparkListenerTaskStart) { sem +=1 } }) sc = spark context Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 4:34 AM, abhiguruvayya sharath.abhis...@gmail.com wrote: Hello Mayur, How can I implement these methods mentioned below. Do u you have any clue on this pls et me know. public void onJobStart(SparkListenerJobStart arg0) { } @Override public void onStageCompleted(SparkListenerStageCompleted arg0) { } @Override public void onStageSubmitted(SparkListenerStageSubmitted arg0) { } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9104.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Cassandra driver Spark question
Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd = { var someCol = Seq[MyType]() foreach(kv ={ someCol :+ rdd._2 //I only want the RDD value and not the key } val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING TO RUN THE WORKER collectionRDD.saveToCassandra(...) } I get the NotSerializableException while trying to run the Node (also tried someCol as shared variable). I believe this happens because the myDStream doesn't exist yet when the code is pushed to the Node so the parallelize doens't have any structure to relate to it. Inside this foreachRDD I should only do RDD calls which are only related to other RDDs. I guess this was just a desperate attempt So I have a question Using the Cassandra Spark driver - Can we only write to Cassandra from an RDD? In my case I only want to write once all the computation is finished in a single batch on the driver app. tnks in advance. Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
how to convert JavaDStreamString to JavaRDDString
Hi Team, Could you please help me to resolve below query. My use case is : I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory JavaDStreamString lines_2 = ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/); How to convert JavaDStreamString result to JavaRDDString? if we can convert. I can use collect() method on JavaRDD and process my textfile. I'm not able to find collect method on JavaRDDString. Thank you very much in advance. Regards, Rajesh
Re: Cassandra driver Spark question
Is MyType serializable? Everything inside the foreachRDD closure has to be serializable. 2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com: Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd = { var someCol = Seq[MyType]() foreach(kv ={ someCol :+ rdd._2 //I only want the RDD value and not the key } val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING TO RUN THE WORKER collectionRDD.saveToCassandra(...) } I get the NotSerializableException while trying to run the Node (also tried someCol as shared variable). I believe this happens because the myDStream doesn't exist yet when the code is pushed to the Node so the parallelize doens't have any structure to relate to it. Inside this foreachRDD I should only do RDD calls which are only related to other RDDs. I guess this was just a desperate attempt So I have a question Using the Cassandra Spark driver - Can we only write to Cassandra from an RDD? In my case I only want to write once all the computation is finished in a single batch on the driver app. tnks in advance. Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
+1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: controlling the time in spark-streaming
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq On Friday, May 23, 2014 10:33 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Well its hard to use text data as time of input. But if you are adament here's what you would do. Have a Dstream object which works in on a folder using filestream/textstream Then have another process (spark streaming or cron) read through the files you receive push them into the folder in order of time. Mostly your data would be produced at t, you would get it at t + say 5 sec, you can push it in get processed at t + say 10 sec. Then you can timeshift your calculations. If you are okay with broad enough time frame you should be fine. Another way is to use queue processing. QueueJavaRDDInteger rddQueue = new LinkedListJavaRDDInteger(); Create a Dstream to consume from Queue of RDD, keep looking into the folder of files create rdd's from them at a min level push them into thee queue. This would cause you to go through your data atleast twice just provide order guarantees , processing time is still going to vary with how quickly you can process your RDD. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, May 22, 2014 at 9:08 PM, Ian Holsman i...@holsman.com.au wrote: Hi. I'm writing a pilot project, and plan on using spark's streaming app for it. To start with I have a dump of some access logs with their own timestamps, and am using the textFileStream and some old files to test it with. One of the issues I've come across is simulating the windows. I would like use the timestamp from the access logs as the 'system time' instead of the real clock time. I googled a bit and found the 'manual' clock which appears to be used for testing the job scheduler.. but I'm not sure what my next steps should be. I'm guessing I'll need to do something like 1. use the textFileStream to create a 'DStream' 2. have some kind of DStream that runs on top of that that creates the RDDs based on the timestamps Instead of the system time 3. the rest of my mappers. Is this correct? or do I need to create my own 'textFileStream' to initially create the RDDs and modify the system clock inside of it. I'm not too concerned about out-of-order messages, going backwards in time, or being 100% in sync across workers.. as this is more for 'development'/prototyping. Are there better ways of achieving this? I would assume that controlling the windows RDD buckets would be a common use case. TIA Ian -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083
RDD Cleanup
Hi, I using spark 1.0.0 , using Ooyala Job Server, for a low latency query system. Basically a long running context is created, which enables to run multiple jobs under the same context, and hence sharing of the data. It was working fine in 0.9.1. However in spark 1.0 release, the RDD's created and cached by a Job-1 gets cleaned up by BlockManager (can see log statements saying cleaning up RDD) and so the cached RDD's are not available for Job-2, though Both Job-1 and Job-2 are running under same spark context. I tried using the spark.cleaner.referenceTracking = false setting, how-ever this causes the issue that unpersisted RDD's are not cleaned up properly, and occupying the Spark's memory.. Had anybody faced issue like this before? If so, any advice would be greatly appreicated. Also is there any way, to mark an RDD as being used under a context, event though the job using that had been finished (so subsequent jobs can use that RDD). Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to convert JavaDStreamString to JavaRDDString
Hi, First use foreachrdd and then use collect as DStream.foreachRDD(rdd = { rdd.collect.foreach({ Also its better to use scala. Less verbose. Regards, Laeeq On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please help me to resolve below query. My use case is : I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory JavaDStreamString lines_2 = ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/); How to convert JavaDStreamString result to JavaRDDString? if we can convert. I can use collect() method on JavaRDD and process my textfile. I'm not able to find collect method on JavaRDDString. Thank you very much in advance. Regards, Rajesh
Re: window analysis with Spark and Spark streaming
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq, PhD candidatte, KTH, Stockholm. On Sunday, July 6, 2014 10:20 AM, alessandro finamore alessandro.finam...@polito.it wrote: On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] [hidden email] wrote: Key idea is to simulate your app time as you enter data . So you can connect spark streaming to a queue and insert data in it spaced by time. Easier said than done :). I see. I'll try to implement also this solution so that I can compare it with my current spark implementation. I'm interested in seeing if this is faster...as I assume it should be :) What are the parallelism issues you are hitting with your static approach. In my current spark implementation, whenever I need to get the aggregated stats over the window, I'm re-mapping all the current bins to have the same key so that they can be reduced altogether. This means that data need to shipped to a single reducer. As results, adding nodes/cores to the application does not really affect the total time :( On Friday, July 4, 2014, alessandro finamore [hidden email] wrote: Thanks for the replies What is not completely clear to me is how time is managed. I can create a DStream from file. But if I set the window property that will be bounded to the application time, right? If I got it right, with a receiver I can control the way DStream are created. But, how can apply then the windowing already shipped with the framework if this is bounded to the application time? I would like to do define a window of N files but the window() function requires a duration as input... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Sent from Gmail Mobile If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html To unsubscribe from window analysis with Spark and Spark streaming, click here. NAML -- -- Alessandro Finamore, PhD Politecnico di Torino -- Office: +39 0115644127 Mobile: +39 3280251485 SkypeId: alessandro.finamore --- View this message in context: Re: window analysis with Spark and Spark streaming Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Cassandra driver Spark question
Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD Cleanup
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I using spark 1.0.0 , using Ooyala Job Server, for a low latency query system. Basically a long running context is created, which enables to run multiple jobs under the same context, and hence sharing of the data. It was working fine in 0.9.1. However in spark 1.0 release, the RDD's created and cached by a Job-1 gets cleaned up by BlockManager (can see log statements saying cleaning up RDD) and so the cached RDD's are not available for Job-2, though Both Job-1 and Job-2 are running under same spark context. I tried using the spark.cleaner.referenceTracking = false setting, how-ever this causes the issue that unpersisted RDD's are not cleaned up properly, and occupying the Spark's memory.. Had anybody faced issue like this before? If so, any advice would be greatly appreicated. Also is there any way, to mark an RDD as being used under a context, event though the job using that had been finished (so subsequent jobs can use that RDD). Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Cassandra driver Spark question
Yes, I'm using it to count concurrent users from a kafka stream of events without problems. I'm currently testing it using the local mode but any serialization problem would have already appeared so I don't expect any serialization issue when I deployed to my cluster. 2014-07-09 15:39 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com: Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming and Storm
Xichen_tju, I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 24GBRAM with 3 servers) and was not able to achieve a realistic scale for my business domain needs. Storm is really only a framework, which allows you to put in code to do whatever it is you need for a distributed system…so it’s completely flexible and distributable, but it comes at a price. In Storm, the one of the biggest performance hits, came down to how the “acks” work within the tuple trees. You can have the framework default ack messages between spouts and/or bolts, but in the end, you most likely want to manage acks yourself, due to how much reliability you’re system will need (to replay messages…). All this means, is that if you don’t have massive amounts of data that you need to process within a few seconds, (which I do) then Storm may work well for you, but you’re performance will diminish as you add in more and more business rules (unless of course you add in more servers for processing). If you need to ingest at least 1GBps+, then you may want to reevaluate since you’re server scale may not mesh with you overall processing needs. I recently just started using Spark Streaming with Kafka and have been quite impressed at the performance level that’s being achieved. I particularly like the fact that Spark isn’t just a framework, but it provides you with simple tools with API convenience methods. Some of those features are reduceByKey (mapReduce), sliding and aggregate sub time windows, etc. Also, In my environment, I believe it’s going to be a great fit since we use Hadoop already and Spark should fit into that environment well. You should look into both Storm and Spark Streaming, but in the end it just depends on your needs. If you not looking for Streaming aspects, then Spark on Hadoop is a great option since Spark will cache the dataset in memory for all queries, which will be much faster than running Hive/Pig onto of Hadoop. But I’m assuming you need some sort of Streaming system for data flow, but if it doesn’t need to be real-time or near real-time, you may want to simply look at Hadoop, which you could always use Spark ontop of for real-time queries. Hope this helps… Dan On Jul 8, 2014, at 7:25 PM, Shao, Saisai saisai.s...@intel.com wrote: You may get the performance comparison results from Spark Streaming paper and meetup ppt, just google it. Actually performance comparison is case by case and relies on your work load design, hardware and software configurations. There is no actual winner for the whole scenarios. Thanks Jerry From: xichen_tju@126 [mailto:xichen_...@126.com] Sent: Wednesday, July 09, 2014 9:17 AM To: user@spark.apache.org Subject: Spark Streaming and Storm hi all I am a newbie to Spark Streaming, and used Strom before.Have u test the performance both of them and which one is better? xichen_tju@126
Re: RDD Cleanup
Hi, Yes . I am caching the RDD's by calling cache method.. May i ask, how you are sharing RDD's across jobs in same context? By the RDD name. I tried printing the RDD's of the Spark context, and when the referenceTracking is enabled, i get empty list after the clean up. Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD Cleanup
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote: Hi, Yes . I am caching the RDD's by calling cache method.. May i ask, how you are sharing RDD's across jobs in same context? By the RDD name. I tried printing the RDD's of the Spark context, and when the referenceTracking is enabled, i get empty list after the clean up. Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark on Yarn: Connecting to Existing Instance
I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Purpose of spark-submit?
One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Purpose of spark-submit?
Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Spark on Yarn: Connecting to Existing Instance
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Mechanics of passing functions to Spark?
Greetings, The documentation at http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark says: Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method First, could someone clarify what is meant by sending the object here? How is the object sent to (presumably) nodes of the cluster? Is it a one time operation per node? Why does this sound like (according to doc) a less preferred option compared to a singleton's function? Would not the nodes require the singleton object anyway? Some clarification would really help. Regards Seref ps: the expression the object that contains that class sounds a bit unusual, is the intended meaning the object that is the instance of that class ?
Re: Purpose of spark-submit?
sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available on the yarn node). my guess is that spark-submit is changing some settings (perhaps preparing the distributed cache and modifying settings accordingly), which makes it harder to run things programmatically. i could be wrong however. i gave up debugging and resorted to using spark-submit for now. On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH
Re: Spark on Yarn: Connecting to Existing Instance
To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ -Sandy On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Error with Stream Kafka Kryo
Hi My setup is to use localMode standalone, Sprak 1.0.0 release version, scala 2.10.4 I made a job that receive serialized object from Kafka broker. The objects are serialized using kryo. The code : val sparkConf = new SparkConf().setMaster(local[4]).setAppName(SparkTest) .set(spark.serializer, org.apache.spark.serializer.JavaSerializer) .set(spark.kryo.registrator, com.inneractive.fortknox.kafka.EventDetailRegistrator) val ssc = new StreamingContext(sparkConf, Seconds(20)) ssc.checkpoint(checkpoint) val topicMap = topic.split(,).map((_,partitions)).toMap // Create a Stream using my Decoder EventKryoEncoder val events = KafkaUtils.createStream[String, EventDetails, StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2) val data = events.map(e = (e.getPublisherId, 1L)) val counter = data.reduceByKey(_ + _) counter.print() ssc.start() ssc.awaitTermination() When I run this I get java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) ~[na:1.7.0_60] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_60] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) ~[na:1.7.0_60] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) ~[na:1.7.0_60] at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ~[spark-core_2.10-1.0.0.jar:1.0.0] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.Task.run(Task.scala:51) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) ~[spark-core_2.10-1.0.0.jar:1.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60] I've check that my decoder is working I can trace that the deserialization is OK thus sprark must get ready to use object My setup work if I use JSON and not Kryo serialized object Thanks for help because I don't what to do next -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
Sandy, I experienced the similar behavior as Koert just mentioned. I don't understand why there is a difference between using spark-submit and programmatic execution. Maybe there is something else we need to add to the spark conf/spark context in order to launch spark jobs programmatically that are not needed before? On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote: sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available on the yarn node). my guess is that spark-submit is changing some settings (perhaps preparing the distributed cache and modifying settings accordingly), which makes it harder to run things programmatically. i could be wrong however. i gave up debugging and resorted to using spark-submit for now. On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going
Apache Spark, Hadoop 2.2.0 without Yarn Integration
Hello, I am currently learning Apache Spark and I want to see how it integrates with an existing Hadoop Cluster. My current Hadoop configuration is version 2.2.0 without Yarn. I have build Apache Spark (v1.0.0) following the instructions in the README file. Only setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the HADOOP_CONF_DIR to point to the configuration directory of Hadoop configuration. My use-case is the Linear Least Regression MLlib example of Apache Spark (link: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression). The only difference in the code is that I give the text file to be an HDFS file. However, I get a Runtime Exception: Error in configuring object. So my question is the following: Does Spark work with a Hadoop distribution without Yarn? If yes, am I doing it right? If no, can I build Spark with SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false? Thank you, Nick
Re: Purpose of spark-submit?
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env setup, local resources, etc), and I compared it to what spark-submit had done. I have to admit though that it was far from trivial to get it working out of the box, and perhaps some work could be done in that regards. In my case, it had boiled down to the launch environment not having the HADOOP_CONF_DIR set, which prevented the app master from registering itself with the Resource Manager. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 9:25 AM, Jerry Lam chiling...@gmail.com wrote: Sandy, I experienced the similar behavior as Koert just mentioned. I don't understand why there is a difference between using spark-submit and programmatic execution. Maybe there is something else we need to add to the spark conf/spark context in order to launch spark jobs programmatically that are not needed before? On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote: sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available on the yarn node). my guess is that spark-submit is changing some settings (perhaps preparing the distributed cache and modifying settings accordingly), which makes it harder to run things programmatically. i could be wrong however. i gave up debugging and resorted to using spark-submit for now. On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using
Re: Purpose of spark-submit?
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: issues with ./bin/spark-shell for standalone mode
Hi Patrick, I used 1.0 branch, but it was not an official release, I just git pulled whatever was there and compiled. Thanks, Mikhail -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Terminal freeze during SVM
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.
According to me there is BUG in MLlib Naive Bayes implementation in spark 0.9.1. Whom should I report this to or with whom should I discuss? I can discuss this over call as well. My Skype ID : rahul.bhijwani Phone no: +91-9945197359 Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Make a JIRA with enough detail to reproduce the error ideally: https://issues.apache.org/jira/browse/SPARK and then even more ideally open a PR with a fix: https://github.com/apache/spark On Wed, Jul 9, 2014 at 5:57 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: According to me there is BUG in MLlib Naive Bayes implementation in spark 0.9.1. Whom should I report this to or with whom should I discuss? I can discuss this over call as well. My Skype ID : rahul.bhijwani Phone no: +91-9945197359 Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Re: Terminal freeze during SVM
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Compilation error in Spark 1.0.0
Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} ); JavaPairRDDPartition, IterableBody partitions = bodies.mapToPair(b - b.computePartitions(maxDistance)).groupByKey(); Partition and Body are defined inside the driver class. Body contains the following definition: protected IterableTuple2Partition, Body computePartitions (int maxDistance) The idea is to reproduce the following schema: The first map results in: *body1, body2, ... * The mapToPair should output several of these:* (partition_i, body1), (partition_i, body2)...* Which are gathered by key as follows: *(partition_i, (body1, body_n), (partition_i', (body2, body_n') ...* Thanks in advance. Regards, Silvina
Re: Comparative study
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your (It was not built to replace Hive. It's purpose-built to make interactive use with a BI tool feasible -- single-digit second queries on huge data sets. It's very memory hungry. Hive's architecture choices and legacy code have been throughput-oriented, and can't really get below minutes at scale, but, remains a right choice when you are in fact doing ETL!)
Re: Spark on Yarn: Connecting to Existing Instance
Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived server continually satisfying requests So, if I wanted to use a long-lived server continually satisfying requests and then start a shell that connected to that context, how would I do that in Yarn? That's the problem I am having right now, I just want there to be that long lived service that I can utilize. Thanks! On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ -Sandy On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Spark on Yarn: Connecting to Existing Instance
So basically, I have Spark on Yarn running (spark shell) how do I connect to it with another tool I am trying to test using the spark://IP:7077 URL it's expecting? If that won't work with spark shell, or yarn-client mode, how do I setup Spark on Yarn to be able to handle that? Thanks! On Wed, Jul 9, 2014 at 12:41 PM, John Omernik j...@omernik.com wrote: Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived server continually satisfying requests So, if I wanted to use a long-lived server continually satisfying requests and then start a shell that connected to that context, how would I do that in Yarn? That's the problem I am having right now, I just want there to be that long lived service that I can utilize. Thanks! On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ -Sandy On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Spark on Yarn: Connecting to Existing Instance
Spark doesn't currently offer you anything special to do this. I.e. if you want to write a Spark application that fires off jobs on behalf of remote processes, you would need to implement the communication between those remote processes and your Spark application code yourself. On Wed, Jul 9, 2014 at 10:41 AM, John Omernik j...@omernik.com wrote: Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived server continually satisfying requests So, if I wanted to use a long-lived server continually satisfying requests and then start a shell that connected to that context, how would I do that in Yarn? That's the problem I am having right now, I just want there to be that long lived service that I can utilize. Thanks! On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ -Sandy On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Execution stalls in LogisticRegressionWithSGD
We have maven-enforcer-plugin defined in the pom. I don't know why it didn't work for you. Could you try rebuild with maven2 and confirm that there is no error message? If that is the case, please create a JIRA for it. Thanks! -Xiangrui On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Xiangrui, Thanks for all the help in resolving this issue. The cause turned out to bethe build environment rather than runtime configuration. The build process had picked up maven2 while building spark. Using binaries that were rebuilt using m3, the entire processing went through fine. While I'm aware that the build instruction page specifies m3 as the min requirement, declaratively preventing accidental m2 usage (e.g. through something like the maven enforcer plugin?) might help other developers avoid such issues. -Bharath On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng men...@gmail.com wrote: It seems to me a setup issue. I just tested news20.binary (1355191 features) on a 2-node EC2 cluster and it worked well. I added one line to conf/spark-env.sh: export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20 and launched spark-shell with --driver-memory 20g. Could you re-try with an EC2 setup? If it still doesn't work, please attach all your code and logs. Best, Xiangrui On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Xiangrui, 1) Yes, I used the same build (compiled locally from source) to the host that has (master, slave1) and the second host with slave2. 2) The execution was successful when run in local mode with reduced number of partitions. Does this imply issues communicating/coordinating across processes (i.e. driver, master and workers)? Thanks, Bharath On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, 1) Did you sync the spark jar and conf to the worker nodes after build? 2) Since the dataset is not large, could you try local mode first using `spark-summit --driver-memory 12g --master local[*]`? 3) Try to use less number of partitions, say 5. If the problem is still there, please attach the full master/worker log files. Best, Xiangrui On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Xiangrui, Leaving the frameSize unspecified led to an error message (and failure) stating that the task size (~11M) was larger. I hence set it to an arbitrarily large value ( I realize 500 was unrealistic unnecessary in this case). I've now set the size to 20M and repeated the runs. The earlier runs were on an uncached RDD. Caching the RDD (and setting spark.storage.memoryFraction=0.5) resulted in marginal speed up of execution, but the end result remained the same. The cached RDD size is as follows: RDD NameStorage LevelCached Partitions Fraction CachedSize in MemorySize in TachyonSize on Disk 1084 Memory Deserialized 1x Replicated 80 100% 165.9 MB 0.0 B 0.0 B The corresponding master logs were: 14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/1 because it is EXITED 14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/2 on worker worker-20140630124441-slave1-40182 14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/0 because it is EXITED 14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/3 on worker worker-20140630102913-slave2-44735 14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/2 because it is EXITED 14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/4 on worker worker-20140630124441-slave1-40182 14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/3 because it is EXITED 14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/5 on worker worker-20140630102913-slave2-44735 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got disassociated, removing it. 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033 14/07/04 06:29:39 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260] was not delivered. [39] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got disassociated, removing it. 14/07/04 06:29:39 INFO Master:
Re: Spark on Yarn: Connecting to Existing Instance
So how do I do the long-lived server continually satisfying requests in the Cloudera application? I am very confused by that at this point. On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark doesn't currently offer you anything special to do this. I.e. if you want to write a Spark application that fires off jobs on behalf of remote processes, you would need to implement the communication between those remote processes and your Spark application code yourself. On Wed, Jul 9, 2014 at 10:41 AM, John Omernik j...@omernik.com wrote: Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived server continually satisfying requests So, if I wanted to use a long-lived server continually satisfying requests and then start a shell that connected to that context, how would I do that in Yarn? That's the problem I am having right now, I just want there to be that long lived service that I can utilize. Thanks! On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ -Sandy On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
At first glance that looks like an error with the class shipping in the spark shell. (i.e. the line that you type into the spark shell are compiled into classes and then shipped to the executors where they run). Are you able to run other spark examples with closures in the same shell? Michael On Wed, Jul 9, 2014 at 4:28 AM, gil...@gmail.com gil...@gmail.com wrote: Hello, While trying to run this example below I am getting errors. I have build Spark using the followng command: $ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly --- --Running the example using Spark-shell --- $ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client ./bin/spark-shell scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class Person(name: String, age: Int) val people = sc.textFile(hdfs:// myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.registerAsTable(people) val teenagers = sql(SELECT name FROM people WHERE age = 13 AND age = 19) teenagers.map(t = Name: + t(0)).collect().foreach(println) -- error --- java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$ at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- View this message in context: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$ http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Spark Streaming - two questions about the streamingcontext
I am using the Spark Streaming and have the following two questions: 1. If more than one output operations are put in the same StreamingContext (basically, I mean, I put all the output operations in the same class), are they processed one by one as the order they appear in the class? Or they are actually processes parallely? 2. If one DStream takes longer than the interval time, does a new DStream wait in the queue until the previous DStream is fully processed? Is there any parallelism that may process the two DStream at the same time? Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
How should I add a jar?
I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j http://twitter4j.org/en/index.html. I’m confused over where and how to add this jar in. SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has an addJar method and a jars property, the latter of which does not have an associated doc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext . What’s the difference between all these jar-related things, and what do I need to do to add this Twitter jar in correctly? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming - two questions about the streamingcontext
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter spark.streaming.concurrentJobs which is by default set to 1. 2. Yes, the output operations (and the spark jobs that are involved with them) gets queued up. TD On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang yanfang...@gmail.com wrote: I am using the Spark Streaming and have the following two questions: 1. If more than one output operations are put in the same StreamingContext (basically, I mean, I put all the output operations in the same class), are they processed one by one as the order they appear in the class? Or they are actually processes parallely? 2. If one DStream takes longer than the interval time, does a new DStream wait in the queue until the previous DStream is fully processed? Is there any parallelism that may process the two DStream at the same time? Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Re: Spark Streaming - two questions about the streamingcontext
Great. Thank you! Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: 1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter spark.streaming.concurrentJobs which is by default set to 1. 2. Yes, the output operations (and the spark jobs that are involved with them) gets queued up. TD On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang yanfang...@gmail.com wrote: I am using the Spark Streaming and have the following two questions: 1. If more than one output operations are put in the same StreamingContext (basically, I mean, I put all the output operations in the same class), are they processed one by one as the order they appear in the class? Or they are actually processes parallely? 2. If one DStream takes longer than the interval time, does a new DStream wait in the queue until the previous DStream is fully processed? Is there any parallelism that may process the two DStream at the same time? Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Spark streaming - tasks and stages continue to be generated when using reduce by key
Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text file and save counters in cassandra based on fields in each row. I am testing the application in two modes: * Process each row and save the counter in cassandra. In this scenario after the text file has been consumed, there is no task/stages seen in the spark UI. * If instead I use reduce by key before saving to cassandra, the spark UI shows continuous generation of tasks/stages even afterprocessing the file has been completed. I believe this is because the reduce by key requires merging of data from different partitions. But I was wondering if anyone has any insights/pointers for understanding this difference in behavior and how to avoid generating tasks/stages when there is no data (new file) available. Thanks Mans
RE: How should I add a jar?
Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar Date: Wed, 9 Jul 2014 11:44:27 -0700 From: nicholas.cham...@gmail.com To: u...@spark.incubator.apache.org Subject: How should I add a jar? I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j. I’m confused over where and how to add this jar in. SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has an addJar method and a jars property, the latter of which does not have an associated doc. What’s the difference between all these jar-related things, and what do I need to do to add this Twitter jar in correctly? Nick View this message in context: How should I add a jar? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Some question about SQL and streaming
Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan
Re: How should I add a jar?
Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote: Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar -- Date: Wed, 9 Jul 2014 11:44:27 -0700 From: nicholas.cham...@gmail.com To: u...@spark.incubator.apache.org Subject: How should I add a jar? I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j http://twitter4j.org/en/index.html. I’m confused over where and how to add this jar in. SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has an addJar method and a jars property, the latter of which does not have an associated doc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext . What’s the difference between all these jar-related things, and what do I need to do to add this Twitter jar in correctly? Nick -- View this message in context: How should I add a jar? http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Compilation error in Spark 1.0.0
Right, the compile error is a casting issue telling me I cannot assign a JavaPairRDDPartition, Body to a JavaPairRDDObject, Object. It happens in the mapToPair() method. On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote: You forgot the compile error! On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores silvi.ca...@gmail.com wrote: Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDDBody bodies = lines.map(l - {Body b = new Body(); b.parse(l);} ); JavaPairRDDPartition, IterableBody partitions = bodies.mapToPair(b - b.computePartitions(maxDistance)).groupByKey(); Partition and Body are defined inside the driver class. Body contains the following definition: protected IterableTuple2Partition, Body computePartitions (int maxDistance) The idea is to reproduce the following schema: The first map results in: *body1, body2, ... * The mapToPair should output several of these:* (partition_i, body1), (partition_i, body2)...* Which are gathered by key as follows: *(partition_i, (body1, body_n), (partition_i', (body2, body_n') ...* Thanks in advance. Regards, Silvina
SPARK_CLASSPATH Warning
Hello, I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop Distribution installed (v2.2.0 without yarn). Due to the fact that the Hadoop Distribution that I am using, uses a list of jars , I do the following changes to the conf/spark-env.sh #!/usr/bin/env bash export HADOOP_CONF_DIR=/path-to-hadoop-conf/hadoop-conf export SPARK_LOCAL_IP=impl41 export SPARK_CLASSPATH=/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/* ... Also, to make sure that I have everything working I execute the Spark shell as follows: [biadmin@impl41 spark]$ ./bin/spark-shell --jars /path-to-proprietary-hadoop-lib/lib/*.jar 14/07/09 13:37:28 INFO spark.SecurityManager: Changing view acls to: biadmin 14/07/09 13:37:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(biadmin) 14/07/09 13:37:28 INFO spark.HttpServer: Starting HTTP Server 14/07/09 13:37:29 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/09 13:37:29 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:44292 Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0 /_/ Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 14/07/09 13:37:36 WARN spark.SparkConf: SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath 14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as a work-around. 14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as a work-around. 14/07/09 13:37:36 INFO spark.SecurityManager: Changing view acls to: biadmin 14/07/09 13:37:36 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(biadmin) 14/07/09 13:37:37 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/07/09 13:37:37 INFO Remoting: Starting remoting 14/07/09 13:37:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@impl41:46081] 14/07/09 13:37:37 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@impl41:46081] 14/07/09 13:37:37 INFO spark.SparkEnv: Registering MapOutputTracker 14/07/09 13:37:37 INFO spark.SparkEnv: Registering BlockManagerMaster 14/07/09 13:37:37 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140709133737-798b 14/07/09 13:37:37 INFO storage.MemoryStore: MemoryStore started with capacity 307.2 MB. 14/07/09 13:37:38 INFO network.ConnectionManager: Bound socket to port 16685 with id = ConnectionManagerId(impl41,16685) 14/07/09 13:37:38 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/07/09 13:37:38 INFO storage.BlockManagerInfo: Registering block manager impl41:16685 with 307.2 MB RAM 14/07/09 13:37:38 INFO storage.BlockManagerMaster: Registered BlockManager 14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server 14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/09 13:37:38 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:21938 14/07/09 13:37:38 INFO broadcast.HttpBroadcast: Broadcast server started at http://impl41:21938 14/07/09 13:37:38 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-91e8e040-f2ca-43dd-b574-805033f476c7 14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server 14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/09 13:37:38 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52678 14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/09 13:37:38 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/07/09 13:37:38 INFO ui.SparkUI: Started SparkUI at http://impl41:4040 14/07/09 13:37:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/09 13:37:39 INFO spark.SparkContext: Added JAR file:/opt/ibm/biginsights/IHC/lib/adaptive-mr.jar at http://impl41:52678/jars/adaptive-mr.jar with timestamp 1404938259526 14/07/09 13:37:39 INFO executor.Executor: Using REPL class URI: http://impl41:44292 14/07/09 13:37:39 INFO repl.SparkILoop: Created spark context.. Spark context available as sc. scala So, my question is the following: Am I including my libraries correctly? Why do I get the message that the SPARK_CLASSPATH method is deprecated? Also, when I execute the following example: scala val file = sc.textFile(hdfs://lpsa.dat) 14/07/09 13:41:43 WARN util.SizeEstimator: Failed to
Understanding how to install in HDP
Hi everybody We have hortonworks cluster with many nodes, we want to test a deployment of Spark. Whats the recomended path to follow? I mean we can compile the sources in the Name Node. But i don't really understand how to pass the executable jar and configuration to the rest of the nodes. Thanks! !! Abel
Re: Use Spark Streaming to update result whenever data come
Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in terms of efficiency? Thanks! On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update this map and output the map at the end of the foreachRDD function. However, the current issue is the processing cannot be finished within one minute. I am thinking of updating the map whenever the new data come instead of doing the update when the whoe RDD comes. Is there any idea on how to achieve this in a better running time? Thanks! Bill
Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration
Krishna, Ok, thank you. I just wanted to make sure that this can be done. Cheers, Nick On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar ksanka...@gmail.com wrote: Nick, AFAIK, you can compile with yarn=true and still run spark in stand alone cluster mode. Cheers k/ On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis kat...@cs.pitt.edu wrote: Hello, I am currently learning Apache Spark and I want to see how it integrates with an existing Hadoop Cluster. My current Hadoop configuration is version 2.2.0 without Yarn. I have build Apache Spark (v1.0.0) following the instructions in the README file. Only setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the HADOOP_CONF_DIR to point to the configuration directory of Hadoop configuration. My use-case is the Linear Least Regression MLlib example of Apache Spark (link: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression). The only difference in the code is that I give the text file to be an HDFS file. However, I get a Runtime Exception: Error in configuring object. So my question is the following: Does Spark work with a Hadoop distribution without Yarn? If yes, am I doing it right? If no, can I build Spark with SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false? Thank you, Nick
Re: Requirements for Spark cluster
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers k/ On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a cluster. What needs to be installed on the master? What needs to be installed on the workers? If the cluster already has Hadoop or YARN or Cloudera, does it still need an install of Spark?
Number of executors change during job running
Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute. For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which seems slow to me. I allocated 300 workers and specify 300 as the partition number for groupby. When I checked the slow stage *combineByKey at ShuffledDStream.scala:42,* there are sometimes 2 executors allocated for this stage. However, during other batches, the executors can be several hundred for the same stage, which means the number of executors for the same operations change. Does anyone know how Spark allocate the number of executors for different stages and how to increase the efficiency for task? Thanks! Bill
CoarseGrainedExecutorBackend: Driver Disassociated
Hi,This time instead of manually starting worker node using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT I used start-slaves script on the master node. I also enabled -v (verbose flag) in ssh. Here is the o/p that I see. The log file for to the worker node was not created. I will switch back to the manual process for starting the cluster. bash-4.1$ ./start-slaves.sh172.16.48.44: OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 Mar 2010172.16.48.44: debug1: Reading configuration data /users/userid/.ssh/config172.16.48.44: debug1: Reading configuration data /etc/ssh/ssh_config172.16.48.44: debug1: Applying options for *172.16.48.44: debug1: Connecting to 172.16.48.44 [172.16.48.44] port 22.172.16.48.44: debug1: Connection established.172.16.48.44: debug1: identity file /users/p529444/.ssh/identity type -1172.16.48.44: debug1: identity file /users/p529444/.ssh/identity-cert type -1172.16.48.44: debug1: identity file /users/p529444/.ssh/id_rsa type 1172.16.48.44: debug1: identity file /users/p529444/.ssh/id_rsa-cert type -1172.16.48.44: debug1: identity file /users/p529444/.ssh/id_dsa type -1172.16.48.44: debug1: identity file /users/p529444/.ssh/id_dsa-cert type -1172.16.48.44: debug1: Remote protocol version 2.0, remote software version OpenSSH_5.2p1_q17.gM-hpn13v6172.16.48.44: debug1: match: OpenSSH_5.2p1_q17.gM-hpn13v6 pat OpenSSH*172.16.48.44: debug1: Enabling compatibility mode for protocol 2.0172.16.48.44: debug1: Local version string SSH-2.0-OpenSSH_5.3172.16.48.44: debug1: SSH2_MSG_KEXINIT sent172.16.48.44: debug1: SSH2_MSG_KEXINIT received172.16.48.44: debug1: kex: server-client aes128-ctr hmac-md5 none172.16.48.44: debug1: kex: client-server aes128-ctr hmac-md5 none172.16.48.44: debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(102410248192) sent172.16.48.44: debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP172.16.48.44: debug1: SSH2_MSG_KEX_DH_GEX_INIT sent172.16.48.44: debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY172.16.48.44: debug1: Host '172.16.48.44' is known and matches the RSA host key.172.16.48.44: debug1: Found key in /users/p529444/.ssh/known_hosts:6172.16.48.44: debug1: ssh_rsa_verify: signature correct172.16.48.44: debug1: SSH2_MSG_NEWKEYS sent172.16.48.44: debug1: expecting SSH2_MSG_NEWKEYS172.16.48.44: debug1: SSH2_MSG_NEWKEYS received172.16.48.44: debug1: SSH2_MSG_SERVICE_REQUEST sent172.16.48.44: debug1: SSH2_MSG_SERVICE_ACCEPT received172.16.48.44: 172.16.48.44: This is a private computer system. Access to and use requires172.16.48.44: explicit current authorization and is limited to business use.172.16.48.44: All users express consent to monitoring by system personnel to172.16.48.44: detect improper use of or access to the system, system personnel172.16.48.44: may provide evidence of such conduct to law enforcement172.16.48.44: officials and/or company management.172.16.48.44: 172.16.48.44: UAM R2 account support: http://ussweb.crdc.kp.org/UAM/172.16.48.44: 172.16.48.44: For password resets, please call the Helpdesk 888-457-4872172.16.48.44: 172.16.48.44: debug1: Authentications that can continue: gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44: debug1: Next authentication method: gssapi-keyex172.16.48.44: debug1: No valid Key exchange context172.16.48.44: debug1: Next authentication method: gssapi-with-mic172.16.48.44: debug1: Unspecified GSS failure. Minor code may provide more information172.16.48.44: Cannot find KDC for requested realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure. Minor code may provide more information172.16.48.44: Cannot find KDC for requested realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure. Minor code may provide more information172.16.48.44:172.16.48.44:172.16.48.44: debug1: Authentications that can continue: gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44: debug1: Next authentication method: publickey172.16.48.44: debug1: Trying private key: /users/userid/.ssh/identity172.16.48.44: debug1: Offering public key: /users/userid/.ssh/id_rsa172.16.48.44: debug1: Server accepts key: pkalg ssh-rsa blen 277172.16.48.44: debug1: read PEM private key done: type RSA172.16.48.44: debug1: Authentication succeeded (publickey).172.16.48.44: debug1: channel 0: new [client-session]172.16.48.44: debug1: Requesting no-more-sessions@openssh.com172.16.48.44: debug1: Entering interactive session.172.16.48.44: debug1: Sending environment.172.16.48.44: debug1: Sending env LANG = en_US.UTF-8172.16.48.44: debug1: Sending command: cd /apps/software/spark-1.0.0-bin-hadoop1/sbin/.. ; /apps/software/spark-1.0.0-bin-hadoop1/sbin/start-slave.sh 1
Re: Unable to run Spark 1.0 SparkPi on HDP 2.0
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not require Spark to be installed on all nodes manually. When you submit the Spark assembly jar it will have all its dependencies. YARN will instantiate Spark App Master Containers based on this jar. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p9246.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark1.0 principal component analysis
Hi, Can anyone please shed more light on the PCA implementation in spark? The documentation is a bit leaving as I am not sure I understand the output. According to the docs, the output is a local matrix with the columns as principal components and columns sorted in descending order of covariance. This is a bit confusing for me as I need to compute other statistic Like standard deviation of the principal components. How do I match the principal components to the actual features since there is some sorting? How about eigenvectors and eigenvalues? Please anyone to help shed light on the output, how to use it further and pca spark implementation in general is appreciated Thank you in earnest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How should I add a jar?
Public service announcement: If you're trying to do some stream processing on Twitter data, you'll need version 3.0.6 of twitter4j http://twitter4j.org/archive/. That should work with the Spark Streaming 1.0.0 Twitter library. The latest version of twitter4j, 4.0.2, appears to have breaking changes in its API for us. Nick On Wed, Jul 9, 2014 at 3:34 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote: Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar -- Date: Wed, 9 Jul 2014 11:44:27 -0700 From: nicholas.cham...@gmail.com To: u...@spark.incubator.apache.org Subject: How should I add a jar? I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j http://twitter4j.org/en/index.html. I’m confused over where and how to add this jar in. SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has an addJar method and a jars property, the latter of which does not have an associated doc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext . What’s the difference between all these jar-related things, and what do I need to do to add this Twitter jar in correctly? Nick -- View this message in context: How should I add a jar? http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Use Spark Streaming to update result whenever data come
Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in terms of efficiency? Thanks! On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update this map and output the map at the end of the foreachRDD function. However, the current issue is the processing cannot be finished within one minute. I am thinking of updating the map whenever the new data come instead of doing the update when the whoe RDD comes. Is there any idea on how to achieve this in a better running time? Thanks! Bill
Re: Use Spark Streaming to update result whenever data come
Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck of the whole program. In the streaming, there is a stage marked as *combineByKey at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly executed. However, during some batches, the number of executors allocated to this step is only 2 although I used 300 workers and specified the partition number as 300. In this case, the program is very slow although the data that are processed are not big. Do you know how to solve this issue? Thanks! On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in terms of efficiency? Thanks! On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update this map and output the map at the end of the foreachRDD function. However, the current issue is the processing cannot be finished within one minute. I am thinking of updating the map whenever the new data come instead of doing the update when the whoe RDD comes. Is there any idea on how to achieve this in a better running time? Thanks! Bill
Re: Some question about SQL and streaming
Siyuan, I do it like this: // get data from Kafka val ssc = new StreamingContext(...) val kvPairs = KafkaUtils.createStream(...) // we need to wrap the data in a case class for registerAsTable() to succeed val lines = kvPairs.map(_._2).map(s = StringWrapper(s)) val result = lines.transform((rdd, time) = { // execute statement rdd.registerAsTable(data) sqlc.sql(query) }) Don't know if it is the best way, but it works. Tobias On Thu, Jul 10, 2014 at 4:21 AM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan
Re: Use Spark Streaming to update result whenever data come
Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here. Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck of the whole program. In the streaming, there is a stage marked as *combineByKey at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly executed. However, during some batches, the number of executors allocated to this step is only 2 although I used 300 workers and specified the partition number as 300. In this case, the program is very slow although the data that are processed are not big. Do you know how to solve this issue? Thanks! On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in terms of efficiency? Thanks! On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update this map and output the map at the end of the foreachRDD function. However, the current issue is the processing cannot be finished within one minute. I am thinking of updating the map whenever the new data come instead of doing the update when the whoe RDD comes. Is there any idea on how to achieve this in a better running time? Thanks! Bill
Spark Streaming - What does Spark Streaming checkpoint?
Hi guys, I am a little confusing by the checkpointing in Spark Streaming. It checkpoints the intermediate data for the stateful operations for sure. Does it also checkpoint the information of StreamingContext? Because it seems we can recreate the SC from the checkpoint in a driver node failure scenario. When I looked at the checkpoint directory, did not find much clue. Any help? Thank you very much. Best, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108
Restarting a Streaming Context
So I do this from the Spark shell: // set things up// snipped ssc.start() // let things happen for a few minutes ssc.stop(stopSparkContext = false, stopGracefully = true) Then I want to restart the Streaming Context: ssc.start() // still in the shell; Spark Context is still alive Which yields: org.apache.spark.SparkException: StreamingContext has already been stopped How come? Is there any way in the interactive shell to restart a Streaming Context once it is stopped? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Restarting-a-Streaming-Context-tp9256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
I don't see why using SparkSubmit.scala as your entry point would be any different, because all that does is invoke the main class of Client.scala (e.g. for Yarn) after setting up all the class paths and configuration options. (Though I haven't tried this myself) 2014-07-09 9:40 GMT-07:00 Ron Gonzalez zlgonza...@yahoo.com: I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io