Re: Spark-SQL JDBC driver
Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is: It seems to me that thrift is the same as hive2 JDBC driver, does this mean that starting thrift will start hive as well on the server? On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com wrote: You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server *From:* Anas Mosaad [mailto:anas.mos...@incorta.com] *Sent:* Monday, December 8, 2014 11:01 AM *To:* user@spark.apache.org *Subject:* Spark-SQL JDBC driver Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad* -- *Best Regards/أطيب المنى,* *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510*
Re: do not assemble the spark example jar
Can this assembly get faster if we do not need the Spark SQL or some other component in spark ? such as we only need the core of spark. On Wed, Nov 26, 2014 at 3:37 PM, lihu lihu...@gmail.com wrote: Matei, sorry for my last typo error. And the tip can improve about 30s in my computer. On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote: Mater, thank you very much! After take your advice, the time for assembly from about 20min down to 6min in my computer. that's a very big improvement. On Wed, Nov 26, 2014 at 12:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW as another tip, it helps to keep the SBT console open as you make source changes (by just running sbt/sbt with no args). It's a lot faster the second time it builds something. Matei On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can do sbt/sbt assembly/assembly to assemble only the main package. Matei On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote: Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to avoid assemble the example jar. I know *export SPARK_PREPEND_CLASSES=* *true* method can reduce the assembly, but I do not develop locally. Any advice? -- *Best Wishes!* -- *Best Wishes!* *Li Hu(李浒) | Graduate Student* *Institute for Interdisciplinary Information Sciences(IIIS http://iiis.tsinghua.edu.cn/)* *Tsinghua University, China* *Email: lihu...@gmail.com lihu...@gmail.com* *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ http://iiis.tsinghua.edu.cn/zh/lihu/* -- *Best Wishes!* *Li Hu(李浒) | Graduate Student* *Institute for Interdisciplinary Information Sciences(IIIS http://iiis.tsinghua.edu.cn/)* *Tsinghua University, China* *Email: lihu...@gmail.com lihu...@gmail.com* *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ http://iiis.tsinghua.edu.cn/zh/lihu/* -- *Best Wishes!* *Li Hu(李浒) | Graduate Student* *Institute for Interdisciplinary Information Sciences(IIIS http://iiis.tsinghua.edu.cn/)* *Tsinghua University, China* *Email: lihu...@gmail.com lihu...@gmail.com* *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ http://iiis.tsinghua.edu.cn/zh/lihu/*
Re: Spark-SQL JDBC driver
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of HiveServer2. You don't need to run Hive, but you do need a working Metastore. On 12/9/14 3:59 PM, Anas Mosaad wrote: Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is: It seems to me that thrift is the same as hive2 JDBC driver, does this mean that starting thrift will start hive as well on the server? On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com mailto:judyn...@exchange.microsoft.com wrote: You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server *From:*Anas Mosaad [mailto:anas.mos...@incorta.com mailto:anas.mos...@incorta.com] *Sent:* Monday, December 8, 2014 11:01 AM *To:* user@spark.apache.org mailto:user@spark.apache.org *Subject:* Spark-SQL JDBC driver Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad* -- *Best Regards/أطيب المنى,* * * *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510*
Re: How can I create an RDD with millions of entries created programmatically
Ah... I think you're right about the flatMap then :). Or you could use mapPartitions. (I'm not sure if it makes a difference.) On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis lordjoe2...@gmail.com wrote: looks good but how do I say that in Java as far as I can see sc.parallelize (in Java) has only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x = generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generated at a time per executor thread. On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis lordjoe2...@gmail.com wrote: I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I can generate similar objects programmatically. I could create a small list, parallelize it and then use flatmap to inflate it several times by a factor of 1000 (remember I can hold a list of 1000 items in memory but not a million) Are there better ideas - remember I want to create more objects than can be held in memory at once. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
KafkaUtils explicit acks
Hello Experts, I'm working on a spark app which reads data from kafka persists it in hbase. Spark documentation states the below *[1]* that in case of worker failure we can loose some data. If not how can I make my kafka stream more reliable? I have seen there is a simple consumer *[2]* but I'm not sure if it has been used/tested extensively. I was wondering if there is a way to explicitly acknowledge the kafka offsets once they are replicated in memory of other worker nodes (if it's not already done) to tackle this issue. Any help is appreciated in advance. 1. *Using any input source that receives data through a network* - For network-based data sources like *Kafka *and Flume, the received input data is replicated in memory between nodes of the cluster (default replication factor is 2). So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the *worker node where a network receiver was running fails, then a tiny bit of data may be lost*, that is, the data received by the system but not yet replicated to other node(s). The receiver will be started on a different node and it will continue to receive data. 2. https://github.com/dibbhatt/kafka-spark-consumer Txz, *Mukesh Jha me.mukesh@gmail.com*
Re: Spark-SQL JDBC driver
Thanks Cheng, I thought spark-sql is using the same exact metastore, right? However, it didn't work as expected. Here's what I did. In spark-shell, I loaded a csv files and registered the table, say countries. Started the thrift server. Connected using beeline. When I run show tables or !tables, I get empty list of tables as follow: *0: jdbc:hive2://localhost:1 !tables* *++--+-+-+--+* *| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS |* *++--+-+-+--+* *++--+-+-+--+* *0: jdbc:hive2://localhost:1 show tables ;* *+-+* *| result |* *+-+* *+-+* *No rows selected (0.106 seconds)* *0: jdbc:hive2://localhost:1 * Kindly advice, what am I missing? I want to read the RDD using SQL from outside spark-shell (i.e. like any other relational database) On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com wrote: Essentially, the Spark SQL JDBC Thrift server is just a Spark port of HiveServer2. You don't need to run Hive, but you do need a working Metastore. On 12/9/14 3:59 PM, Anas Mosaad wrote: Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is: It seems to me that thrift is the same as hive2 JDBC driver, does this mean that starting thrift will start hive as well on the server? On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com wrote: You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server *From:* Anas Mosaad [mailto:anas.mos...@incorta.com] *Sent:* Monday, December 8, 2014 11:01 AM *To:* user@spark.apache.org *Subject:* Spark-SQL JDBC driver Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad* -- *Best Regards/أطيب المنى,* *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510* -- *Best Regards/أطيب المنى,* *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510*
Re: Spark-SQL JDBC driver
How did you register the table under spark-shell? Two things to notice: 1. To interact with Hive, HiveContext instead of SQLContext must be used. 2. `registerTempTable` doesn't persist the table into Hive metastore, and the table is lost after quitting spark-shell. Instead, you must use `saveAsTable`. On 12/9/14 5:27 PM, Anas Mosaad wrote: Thanks Cheng, I thought spark-sql is using the same exact metastore, right? However, it didn't work as expected. Here's what I did. In spark-shell, I loaded a csv files and registered the table, say countries. Started the thrift server. Connected using beeline. When I run show tables or !tables, I get empty list of tables as follow: /0: jdbc:hive2://localhost:1 !tables/ /++--+-+-+--+/ /| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS |/ /++--+-+-+--+/ /++--+-+-+--+/ /0: jdbc:hive2://localhost:1 show tables ;/ /+-+/ /| result |/ /+-+/ /+-+/ /No rows selected (0.106 seconds)/ /0: jdbc:hive2://localhost:1 / Kindly advice, what am I missing? I want to read the RDD using SQL from outside spark-shell (i.e. like any other relational database) On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Essentially, the Spark SQL JDBC Thrift server is just a Spark port of HiveServer2. You don't need to run Hive, but you do need a working Metastore. On 12/9/14 3:59 PM, Anas Mosaad wrote: Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is: It seems to me that thrift is the same as hive2 JDBC driver, does this mean that starting thrift will start hive as well on the server? On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com mailto:judyn...@exchange.microsoft.com wrote: You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server *From:*Anas Mosaad [mailto:anas.mos...@incorta.com mailto:anas.mos...@incorta.com] *Sent:* Monday, December 8, 2014 11:01 AM *To:* user@spark.apache.org mailto:user@spark.apache.org *Subject:* Spark-SQL JDBC driver Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad* -- *Best Regards/أطيب المنى,* * * *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510* -- *Best Regards/أطيب المنى,* * * *Anas Mosaad* *Incorta Inc.* *+20-100-743-4510*
spark 1.1.1 Maven dependency
Dear All, I am using spark streaming, It was working fine when i was using spark1.0.2, now i repeatedly getting few issue Like class not found, i am using the same pom.xml with the updated version for all spark modules i am using spark-core,streaming, streaming with kafka modules.. Its constantly keeps throwing errors for no commons-configuation, commons-langs, logging How to get all the dependencies for running spark streaming.. Is there any way or we just have to find by trial and error methord? my pom dependencies dependencies dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.slf4j/groupId artifactIdslf4j-log4j12/artifactId version1.7.5/version /dependency dependency groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId version1.1.1/version /dependency dependency groupIdcommons-configuration/groupId artifactIdcommons-configuration/artifactId version1.6/version /dependency Am i missing something here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-1-Maven-dependency-tp20590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Issues on schemaRDD's function got stuck
I checked my code again, and located the issue that, if we do the `load data inpath` before select statement, the application will get stuck, if don't, it won't get stuck.Log info: 14/12/09 17:29:33 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 WARN io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 WARN io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down ActorSystem [sparkDriver] From: lin_q...@outlook.com To: u...@spark.incubator.apache.org Subject: Issues on schemaRDD's function got stuck Date: Tue, 9 Dec 2014 15:54:14 +0800 Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive select's result schemaRDD and tried to run `collect()` on it, the application got stuck and don't know what's wrong with it. Here is my code: val sqlStat = sSELECT * FROM $TABLE_NAME val result = hiveContext.hql(sqlStat) // got the select's result schemaRDDval rows = result.collect() // This is where the application getting stuck It was ok when running on yarn-client mode. Here is the Log===14/12/09 15:40:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:41:31 WARN util.AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:42:04 WARN util.AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:42:07 WARN executor.Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@a810606,BlockManagerId(2, longzhou-hdp1.lz.dscc, 53356, 0))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) ... 1 more 14/12/09 15:42:47 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at
RE: Issues on schemaRDD's function got stuck
I checked my code again, and located the issue that, if we do the `load data inpath` before select statement, the application will get stuck, if don't, it won't get stuck.Get stuck code: val sqlLoadData = sLOAD DATA INPATH '$currentFile' OVERWRITE INTO TABLE $tableName hiveContext.hql(sqlLoadData) val sqlStat = sSELECT * FROM $TABLE_NAME val result = hiveContext.hql(sqlStat) // got the select's result schemaRDD val rows = result.collect() // This is where the application getting stuckLog info: 14/12/09 17:29:33 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 WARN io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 WARN io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down ActorSystem [sparkDriver] From: lin_q...@outlook.com To: u...@spark.incubator.apache.org Subject: Issues on schemaRDD's function got stuck Date: Tue, 9 Dec 2014 15:54:14 +0800 Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive select's result schemaRDD and tried to run `collect()` on it, the application got stuck and don't know what's wrong with it. Here is my code: val sqlStat = sSELECT * FROM $TABLE_NAME val result = hiveContext.hql(sqlStat) // got the select's result schemaRDDval rows = result.collect() // This is where the application getting stuck It was ok when running on yarn-client mode. Here is the Log===14/12/09 15:40:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:41:31 WARN util.AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:42:04 WARN util.AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) 14/12/09 15:42:07 WARN executor.Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@a810606,BlockManagerId(2, longzhou-hdp1.lz.dscc, 53356, 0))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
NoSuchMethodError: writing spark-streaming data to cassandra
Hi, I am intending to save the streaming data from kafka into Cassandra, using spark-streaming: But there seems to be problem with line javaFunctions(data).writerBuilder(testkeyspace, test_table, mapToRow(TestTable.class)).saveToCassandra(); I am getting NoSuchMethodError. The code, the error-log and POM.xml dependencies are listed below: Please help me find the reason as to why is this happening. public class SparkStream { static int key=0; public static void main(String args[]) throws Exception { if(args.length != 3) { System.out.println(SparkStream zookeeper_ip group_nm topic1,topic2,...); System.exit(1); } Logger.getLogger(org).setLevel(Level.OFF); Logger.getLogger(akka).setLevel(Level.OFF); MapString,Integer topicMap = new HashMapString,Integer(); String[] topic = args[2].split(,); for(String t: topic) { topicMap.put(t, new Integer(3)); } /* Connection to Spark */ SparkConf conf = new SparkConf(); JavaSparkContext sc = new JavaSparkContext(local[4], SparkStream,conf); JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(3000)); /* Receive Kafka streaming inputs */ JavaPairReceiverInputDStreamString, String messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap ); /* Create DStream */ JavaDStreamTestTable data = messages.map(new FunctionTuple2String,String, TestTable () { public TestTable call(Tuple2String, String message) { return new TestTable(new Integer(++key), message._2() ); } } ); /* Write to cassandra */ javaFunctions(data,TestTable.class).saveToCassandra(testkeyspace,test_table); // data.print(); //creates console output stream. jssc.start(); jssc.awaitTermination(); } } class TestTable implements Serializable { Integer key; String value; public TestTable() {} public TestTable(Integer k, String v) { key=k; value=v; } public Integer getKey(){ return key; } public void setKey(Integer k){ key=k; } public String getValue(){ return value; } public void setValue(String v){ value=v; } public String toString(){ return MessageFormat.format(TestTable'{'key={0},value={1}'}', key, value); } } The output log is: Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.streaming.DStreamFunctions.init(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;)V at com.datastax.spark.connector.DStreamJavaFunctions.init(DStreamJavaFunctions.java:17) at com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(CassandraJavaUtil.java:89) at com.spark.SparkStream.main(SparkStream.java:83) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And the POM dependencies are: dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.1.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.1.0/version /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector_2.10/artifactId version1.1.0/version /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector-java_2.10/artifactId version1.0.4/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.1.1/version /dependency dependency groupIdcom.msiops.footing/groupId
Re: Spark-SQL JDBC driver
Back to the first question, this will mandate that hive is up and running? When I try it, I get the following exception. The documentation says that this method works only on SchemaRDD. I though that countries.saveAsTable did not work for that a reason so I created a tmp that contains the results from the registered temp table. Which I could validate that it's a SchemaRDD as shown below. *@Judy,* I do really appreciate your kind support and I want to understand and off course don't want to wast your time. If you can direct me the documentation describing this details, this will be great. scala val tmp = sqlContext.sql(select * from countries) tmp: org.apache.spark.sql.SchemaRDD = SchemaRDD[12] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == PhysicalRDD [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36 scala tmp.saveAsTable(Countries) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan found, tree: 'CreateTableAsSelect None, Countries, false, None Project [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29] Subquery countries LogicalRDD [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126) at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC.init(console:31) at $iwC.init(console:33) at init(console:35) at .init(console:39) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
Re: Mllib native netlib-java/OpenBLAS
+1 with 1.3-SNAPSHOT. On Mon, Dec 1, 2014 at 5:49 PM, agg212 alexander_galaka...@brown.edu wrote: Thanks for your reply, but I'm still running into issues installing/configuring the native libraries for MLlib. Here are the steps I've taken, please let me know if anything is incorrect. - Download Spark source - unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean package ` - Run `sbt/sbt publish-local` The last step fails with the following error (full stack trace is attached here: error.txt http://apache-spark-user-list.1001560.n3.nabble.com/file/n20110/error.txt ): [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) Do I still have to install OPENBLAS/anything else if I build Spark from the source using the -Pnetlib-lgpl flag? Also, do I change the Spark version (from 1.1.0 to 1.2.0-SNAPSHOT) in the .sbt file for my app? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p20110.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using S3 block file system
Hi, I'm trying to use S3 Block file system in spark, i.e. s3:// urls (*not* s3n://). And I always get the following error: Py4JJavaError: An error occurred while calling o3188.saveAsParquetFile. : org.apache.hadoop.fs.s3.S3FileSystemException: Not a Hadoop S3 file. at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.checkMetadata(Jets3tFileSystemStore.java:206) at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:165) at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy24.retrieveINode(Unknown Source) at org.apache.hadoop.fs.s3.S3FileSystem.mkdir(S3FileSystem.java:158) at org.apache.hadoop.fs.s3.S3FileSystem.mkdirs(S3FileSystem.java:151) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815) at org.apache.hadoop.fs.s3.S3FileSystem.create(S3FileSystem.java:234) [.. snip ..] I believe that I must somehow initialize file system (in particular the metadata), but I can't find out how to do it. I use spark 1.2.0rc1 with hadoop 2.4 and Riak CS (instead of S3) if that matters. The s3n:// protocol with same settings work. Thanks. -- Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoSuchMethodError: writing spark-streaming data to cassandra
You're using two conflicting versions of the connector: the Scala version at 1.1.0 and the Java version at 1.0.4. I don't use Java, but I guess you only need the java dependency for your job - and with the version fixed. dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector_2.10/artifactId version*1.1.0*/version /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector-java_2.10/artifactId version*1.0.4*/version /dependency On Tue, Dec 9, 2014 at 11:16 AM, m.sar...@accenture.com wrote: Hi, I am intending to save the streaming data from kafka into Cassandra, using spark-streaming: But there seems to be problem with line javaFunctions(data).writerBuilder(testkeyspace, test_table, mapToRow(TestTable.class)).saveToCassandra(); I am getting NoSuchMethodError. The code, the error-log and POM.xml dependencies are listed below: Please help me find the reason as to why is this happening. public class SparkStream { static int key=0; public static void main(String args[]) throws Exception { if(args.length != 3) { System.out.println(SparkStream zookeeper_ip group_nm topic1,topic2,...); System.exit(1); } Logger.getLogger(org).setLevel(Level.OFF); Logger.getLogger(akka).setLevel(Level.OFF); MapString,Integer topicMap = new HashMapString,Integer(); String[] topic = args[2].split(,); for(String t: topic) { topicMap.put(t, new Integer(3)); } /* Connection to Spark */ SparkConf conf = new SparkConf(); JavaSparkContext sc = new JavaSparkContext(local[4], SparkStream,conf); JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(3000)); /* Receive Kafka streaming inputs */ JavaPairReceiverInputDStreamString, String messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap ); /* Create DStream */ JavaDStreamTestTable data = messages.map(new FunctionTuple2String,String, TestTable () { public TestTable call(Tuple2String, String message) { return new TestTable(new Integer(++key), message._2() ); } } ); /* Write to cassandra */ javaFunctions(data,TestTable.class).saveToCassandra(testkeyspace,test_table); // data.print(); //creates console output stream. jssc.start(); jssc.awaitTermination(); } } class TestTable implements Serializable { Integer key; String value; public TestTable() {} public TestTable(Integer k, String v) { key=k; value=v; } public Integer getKey(){ return key; } public void setKey(Integer k){ key=k; } public String getValue(){ return value; } public void setValue(String v){ value=v; } public String toString(){ return MessageFormat.format(TestTable'{'key={0},value={1}'}', key, value); } } The output log is: Exception in thread main java.lang.NoSuchMethodError: com.datastax.spark.connector.streaming.DStreamFunctions.init(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;)V at com.datastax.spark.connector.DStreamJavaFunctions.init(DStreamJavaFunctions.java:17) at com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(CassandraJavaUtil.java:89) at com.spark.SparkStream.main(SparkStream.java:83) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And the POM dependencies are: dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.1.0/version /dependency dependency
Stack overflow Error while executing spark SQL
Hi I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) while executing the following code sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) The complete code is from github https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { // Process program arguments and set properties /*if (args.length 3) { System.err.println(Usage: + this.getClass.getSimpleName + tweetInput outputModelDir numClusters numIterations) System.exit(1) } * */ val outputModelDir=C:\\MLModel val tweetInput=C:\\MLInput val numClusters=10 val numIterations=20 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), Utils.IntParam(numIterations)) = args val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Pretty print some of the tweets. val tweets = sc.textFile(tweetInput) println(Sample JSON Tweets---) for (tweet - tweets.take(5)) { println(gson.toJson(jsonParser.parse(tweet))) } val tweetTable = sqlContext.jsonFile(tweetInput).cache() tweetTable.registerTempTable(tweetTable) println(--Tweet table Schema---) tweetTable.printSchema() println(Sample Tweet Text-) sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) println(--Sample Lang, Name, text---) sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000).collect().foreach(println) println(--Total count by languages Lang, count(*)---) sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println) println(--- Training the model and persist it) val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString) // Cache the vectors RDD since it will be used for all the KMeans iterations. val vectors = texts.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = texts.take(100) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } Thanks Regards Jishnu Menath Prathap
Stack overflow Error while executing spark SQL
Hi I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) while executing the following code sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) The complete code is from github https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { // Process program arguments and set properties /*if (args.length 3) { System.err.println(Usage: + this.getClass.getSimpleName + tweetInput outputModelDir numClusters numIterations) System.exit(1) } * */ val outputModelDir=C:\\MLModel val tweetInput=C:\\MLInput val numClusters=10 val numIterations=20 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), Utils.IntParam(numIterations)) = args val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Pretty print some of the tweets. val tweets = sc.textFile(tweetInput) println(Sample JSON Tweets---) for (tweet - tweets.take(5)) { println(gson.toJson(jsonParser.parse(tweet))) } val tweetTable = sqlContext.jsonFile(tweetInput).cache() tweetTable.registerTempTable(tweetTable) println(--Tweet table Schema---) tweetTable.printSchema() println(Sample Tweet Text-) sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) println(--Sample Lang, Name, text---) sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000).collect().foreach(println) println(--Total count by languages Lang, count(*)---) sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println) println(--- Training the model and persist it) val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString) // Cache the vectors RDD since it will be used for all the KMeans iterations. val vectors = texts.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = texts.take(100) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } Thanks Regards Jishnu Menath Prathap
Re: spark 1.1.1 Maven dependency
Are you using the Commons libs in your app? then you need to depend on them directly, and not rely on them accidentally being provided by Spark. There is no trial and error; you must declare all the dependencies you use in your own code. Otherwise perhaps you are not actually running with the Spark assembly JAR at runtime but somehow only trying to run against the Spark jar itself. This has none of the dependencies that Spark needs. On Tue, Dec 9, 2014 at 3:46 AM, sivarani whitefeathers...@gmail.com wrote: Dear All, I am using spark streaming, It was working fine when i was using spark1.0.2, now i repeatedly getting few issue Like class not found, i am using the same pom.xml with the updated version for all spark modules i am using spark-core,streaming, streaming with kafka modules.. Its constantly keeps throwing errors for no commons-configuation, commons-langs, logging How to get all the dependencies for running spark streaming.. Is there any way or we just have to find by trial and error methord? my pom dependencies dependencies dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.0.2/version /dependency dependency groupIdorg.slf4j/groupId artifactIdslf4j-log4j12/artifactId version1.7.5/version /dependency dependency groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId version1.1.1/version /dependency dependency groupIdcommons-configuration/groupId artifactIdcommons-configuration/artifactId version1.6/version /dependency Am i missing something here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-1-Maven-dependency-tp20590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Saving Data only if Dstream is not empty
I don't believe you can do this unless you implement the save to HDFS logic yourself. To keep the semantics consistent, these saveAs* methods will always output a file per partition. On Mon, Dec 8, 2014 at 11:53 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts! I want to save DStream to HDFS only if it is not empty such that it contains some kafka messages to be stored. What is an efficient way to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, params, topicMap, StorageLevel.MEMORY_ONLY).map(_._2) val streams = data.window(Seconds(interval*4), Seconds(interval*2)).map(x = new String(x)) //streams.foreachRDD(rdd=rdd.foreach(println)) //what condition can be applied here to store only non empty DStream streams.saveAsTextFiles(sink, msg) Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
reg JDBCRDD code
Hi All, am new to Spark. I tried to connect to Mysql using Spark. want to write a code in Java but getting runtime exception. I guess that the issue is with the function0 and function1 objects being passed in JDBCRDD . I tried my level best and attached the code, can you please help us to fix the issue. Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you JdbcRddTest.java Description: Binary data - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
registerTempTable: Table not found
Hi, I am using Spark SQL on 1.2.1-snapshot. Here is problem I encountered. Bacially, I want to save a schemaRDD to HiveContext val scm = StructType( Seq( StructField(name, StringType, nullable = false), StructField(cnt, IntegerType, nullable = false) )) val schRdd = hiveContext.applySchema(ranked, scm) // ranked above is RDD[Row] whose row contains 2 fields schRdd.registerTempTable(schRdd) hiveContext sql select count(name) from schRdd limit 20 // = ok hiveContext sql create table t as select * from schRdd // = table not found A query like select works well and gives the correct answer, but when I try to save the temple table into Hive Context by createTableAsSelect, it does not work. *Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Table not found 'schRdd' at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)* I thought that was caused by registerTempTable, so I replace it by saveAsTable. It does not work neither. *Exception in thread main java.lang.AssertionError: assertion failed: No plan for CreateTableAsSelect Some(sephcn), schRdd, false, None LogicalRDD [name#6,cnt#7], MappedRDD[3] at map at Creation.scala:70 at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)* I also checked source code of QueryPlanner: def apply(plan: LogicalPlan): Iterator[PhysicalPlan] = { // Obviously a lot to do here still... val iter = strategies.view.flatMap(_(plan)).toIterator assert(iter.hasNext, sNo plan for $plan) iter } The comment shows that there are some works to do with it. Any help is appreciated. Thx. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Help realted with spark streaming usage error
Hi , I am running spark streaming in standalone mode on twitter source into single machine(using HDP virtual box) I receive status from streaming context and I can print the same but when I try to save those statuses as RDD into Hadoop using rdd.saveAsTextFiles or saveAsHadoopFiles(hdfs://10.20.32.204:50070/user/hue/test,txt) I get below connection error. My Hadoop version:2.4.0.2.1.1.0-385 Spark 1.1.0 ERROR- 14/12/09 04:45:12 ERROR scheduler.JobScheduler: Error running job streaming job 141812911 ms.1 java.io.IOException: Call to /10.20.32.204:50070 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:193) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:685) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:762) at org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:760) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:811) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:749) [error] (run-main-0) java.io.IOException: Call to /10.20.32.204:50070 failed on local exception: java.io.EOFException java.io.IOException: Call to /10.20.32.204:50070 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:193) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:685) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894) at
Re: registerTempTable: Table not found
Looks like this issue has been fixed very recently and should be available in next RC :- http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592p20593.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Specifying number of executors in Mesos
Hi, We've a number of Spark Streaming /Kafka jobs that would benefit of an even spread of consumers over physical hosts in order to maximize network usage. As far as I can see, the Spark Mesos scheduler accepts resource offers until all required Mem + CPU allocation has been satisfied. This basic resource allocation policy results in large executors spread over few nodes, resulting in many Kafka consumers in a single node (e.g. from 12 consumers, I've seen allocations of 7/3/2) Is there a way to tune this behavior to achieve executor allocation on a given number of hosts? -kr, Gerard.
Re: registerTempTable: Table not found
Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592p20594.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-SQL JDBC driver
According to the stacktrace, you were still using SQLContext rather than HiveContext. To interact with Hive, HiveContext *must* be used. Please refer to this page http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On 12/9/14 6:26 PM, Anas Mosaad wrote: Back to the first question,**this will mandate that hive is up and running? When I try it, I get the following exception. The documentation says that this method works only on SchemaRDD. I though that countries.saveAsTable did not work for that a reason so I created a tmp that contains the results from the registered temp table. Which I could validate that it's a SchemaRDD as shown below. * @Judy,* I do really appreciate your kind support and I want to understand and off course don't want to wast your time. If you can direct me the documentation describing this details, this will be great. scala val tmp = sqlContext.sql(select * from countries) tmp: org.apache.spark.sql.SchemaRDD = SchemaRDD[12] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == PhysicalRDD [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36 scala tmp.saveAsTable(Countries) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan found, tree: 'CreateTableAsSelect None, Countries, false, None Project [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29] Subquery countries LogicalRDD [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126) at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC.init(console:31) at $iwC.init(console:33) at init(console:35) at .init(console:39) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console)
Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 14/12/09 06:54:24 INFO ui.SparkUI: Started SparkUI at http://hdname:4040 14/12/09 06:54:25 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/ java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) Any idea why ? Thanks, Daniel
Submit application to spark on mesos cluster
Hi, I have a little problem in submitting our application to a mesos cluster. Basically the mesos cluster is configured and I'm able to have spark-shell working correctly. Then I tried to launch our application jar (a uber, sbt assembly jar with all deps): bin/spark-submit --master mesos://10.192.222.232:5050 --jars $ADDITIONAL_JARS --class my.package.BenchmarkDriver file:///home/jclouds/spark-benchmark-assembly-0.1.0-SNAPSHOT.jar --application_configs So you can see I follow the documentation, provide the application jar in file://... format and I make sure that the application jar is available in this path in each worker of the cluster. Plus I provide the application jar with --jars. However there's always ClassNotFound exception in worker: java.lang.ClassNotFoundException: my.package.UrlLink or, if I tried custom kryo serializer: org.apache.spark.SparkException: Failed to invoke my.package.CustomKryoRegistrator I've tried using hdfs://... for the application jar, but it seems ignored completely by spark-submit. I'm using spark 1.1.1 on hadoop 2.4. Any suggestions? How should I submit the application jar? -- *JU Han* Data Engineer @ Botify.com +33 061960
Re: Programmatically running spark jobs using yarn-client
Thanks Akhil. I was wondering why it isn't available to find the class even though it existed in the same class loader as SparkContext. As a workaround, I used the following code the add all dependent jars in a playframework application to spark context. @tailrec private def addClassPathJars(sparkContext: SparkContext, classLoader: ClassLoader): Unit = { classLoader match { case urlClassLoader: URLClassLoader = { urlClassLoader.getURLs.foreach(classPathUrl = { if (classPathUrl.toExternalForm.endsWith(.jar)) { LOGGER.debug(sAdded $classPathUrl to spark context $sparkContext) sparkContext.addJar(classPathUrl.toExternalForm) } else { LOGGER.debug(sIgnored $classPathUrl while adding to spark context $sparkContext) } }) } case _ = LOGGER.debug(sIgnored class loader $classLoader as it does not subclasses URLClassLoader) } if (classLoader.getParent != null){ addClassPathJars(sparkContext, classLoader.getParent) } } On Mon Dec 08 2014 at 21:39:42 Akhil Das ak...@sigmoidanalytics.com wrote: How are you submitting the job? You need to create a jar of your code (sbt package will give you one inside target/scala-*/projectname-*.jar) and then use it while submitting. If you are not using spark-submit then you can simply add this jar to spark by sc.addJar(/path/to/target/scala*/projectname*jar) Thanks Best Regards On Mon, Dec 8, 2014 at 7:23 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to create (yet another) spark as a service tool that lets you submit jobs via REST APIs. I think I have nearly gotten it to work baring a few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but I have hit the road block with the following issue. I have created a simple spark job as following: class StaticJob { import SparkContext._ override def run(sc: SparkContext): Result = { val array = Range(1, 1000).toArray val rdd = sc.parallelize(array) val paired = rdd.map(i = (i % 1, i)).sortByKey() val sum = paired.countByKey() SimpleResult(sum) } } When I submit this job programmatically, it gives me a class not found error: 2014-12-08 05:41:18,421 [Result resolver thread-0] [warn] o.a.s.s.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, localhost.localdomain): java.lang.ClassNotFoundException: com.blah.server.examples.StaticJob$$anonfun$1 java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) I decompiled the StaticJob$$anonfun$1 class and it seems to point to closure 'rdd.map(i = (i % 1, i))'. I am sure why this is happening. Any help will be greatly appreciated.
Re: Saving Data only if Dstream is not empty
We have a similar case in which we don't want to save data to Cassandra if the data is empty. In our case, we filter the initial DStream to process messages that go to a given table. To do so, we're using something like this: dstream.foreachRDD{ (rdd,time) = tables.foreach{ table = val filteredRdd = rdd.filter(record = predicate to assign records to tables) filteredRdd.cache if (filteredRdd.count0) { filteredRdd.saveAsFoo(...) // we do here saveToCassandra, you could do saveAsTextFile(s$path/$time) } filteredRdd.unpersist } Using the 'time' parameter you can implement an unique name based on the timestamp for the saveAsTextfile(filename) call which is what the Dstream.saveAsTextFile(...) gives you. (so it boils down to what Sean said... you implement the saveAs yourself) -kr, Gerard. @maasg On Tue, Dec 9, 2014 at 1:56 PM, Sean Owen so...@cloudera.com wrote: I don't believe you can do this unless you implement the save to HDFS logic yourself. To keep the semantics consistent, these saveAs* methods will always output a file per partition. On Mon, Dec 8, 2014 at 11:53 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts! I want to save DStream to HDFS only if it is not empty such that it contains some kafka messages to be stored. What is an efficient way to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, params, topicMap, StorageLevel.MEMORY_ONLY).map(_._2) val streams = data.window(Seconds(interval*4), Seconds(interval*2)).map(x = new String(x)) //streams.foreachRDD(rdd=rdd.foreach(println)) //what condition can be applied here to store only non empty DStream streams.saveAsTextFiles(sink, msg) Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on YARN memory utilization
Thanks Sandy! On Mon, Dec 8, 2014 at 23:15 Sandy Ryza sandy.r...@cloudera.com wrote: Another thing to be aware of is that YARN will round up containers to the nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults to 1024. -Sandy On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee denny.g@gmail.com wrote: Got it - thanks! On Sat, Dec 6, 2014 at 14:56 Arun Ahuja aahuj...@gmail.com wrote: Hi Denny, This is due the spark.yarn.memoryOverhead parameter, depending on what version of Spark you are on the default of this may differ, but it should be the larger of 1024mb per executor or .07 * executorMemory. When you set executor memory, the yarn resource request is executorMemory + yarnOverhead. - Arun On Sat, Dec 6, 2014 at 4:27 PM, Denny Lee denny.g@gmail.com wrote: This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated in YARN via the various configurations. For example, if I spin up my cluster with 4GB with a different number of executors as noted below 4GB executor-memory x 10 executors = 46GB (4GB x 10 = 40 + 6) 4GB executor-memory x 4 executors = 19GB (4GB x 4 = 16 + 3) 4GB executor-memory x 2 executors = 10GB (4GB x 2 = 8 + 2) The pattern when observing the RM is that there is a container for each executor and one additional container. From the basis of memory, it looks like there is an additional (1GB + (0.5GB x # executors)) that is allocated in YARN. Just wondering why is this - or is this just an artifact of YARN itself? Thanks!
RE: Learning rate or stepsize automation
Thanks! Will try it out. From: Debasish Das [mailto:debasish.da...@gmail.com] Sent: Monday, December 08, 2014 5:13 PM To: Bui, Tri Cc: user@spark.apache.org Subject: Re: Learning rate or stepsize automation Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid search... Thanks. Deb On Mon, Dec 8, 2014 at 3:04 PM, Bui, Tri tri@verizonwireless.com.invalidmailto:tri@verizonwireless.com.invalid wrote: Hi, Is there any way to auto calculate the optimum learning rate or stepsize via MLLIB for SGD ? Thx tri
Re: NullPointerException When Reading Avro Sequence Files
Hi Cristovao, I have seen a very similar issue that I have posted about in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html I think your main issue here is somewhat similar, in that the MapWrapper Scala class is not registered. This gets registered by the Twitter chill-scala AllScalaRegistrar class that you are currently not using. As far as I understand, in order to use Avro with Spark, you also have to use Kryo. This means you have to use the Spark KryoSerializer. This in turn uses Twitter chill. I posted the basic code that I am using here: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491 Maybe there is a simpler solution to your problem but I am not that much of an expert yet. I hope this helps. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Hi Simone, thanks but I don't think that's it. I've tried several libraries within the --jar argument. Some do give what you said. But other times (when I put the right version I guess) I get the following: 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) Which is odd since I am reading a Avro I wrote...with the same piece of code: https://gist.github.com/MLnick/5864741781b9340cb211 Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 06 December 2014 15:48 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected That is a sign that you are mixing up versions of Hadoop. This is particularly an issue when dealing with AVRO. If you are using Hadoop 2, you will need to get the hadoop 2 version of avro-mapred. In Maven you would do this with the classifier hadoop2 /classifier tag. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote: Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
Re: reg JDBCRDD code
Hi Deepa, In Scala, You will do something like https://gist.github.com/akhld/ccafb27f098163bea622 With Java API's it will be something like https://gist.github.com/akhld/0d9299aafc981553bc34 Thanks Best Regards On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Hi All, am new to Spark. I tried to connect to Mysql using Spark. want to write a code in Java but getting runtime exception. I guess that the issue is with the function0 and function1 objects being passed in JDBCRDD . I tried my level best and attached the code, can you please help us to fix the issue. Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NullPointerException When Reading Avro Sequence Files
You can use this Maven dependency: dependency groupIdcom.twitter/groupId artifactIdchill-avro/artifactId version0.4.0/version /dependency Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Thanks for the reply! I've tried in fact your code. But I lack the twiter chill package and I can not find it online. So I am now trying this http://spark.apache.org/docs/latest/tuning.html#data-serialization . But in case I can't do it, could you tell me where to get that Twiter package you used? Thanks Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 09 December 2014 16:42 *To:* Cristovao Jose Domingues Cordeiro; user *Subject:* Re: NullPointerException When Reading Avro Sequence Files Hi Cristovao, I have seen a very similar issue that I have posted about in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html I think your main issue here is somewhat similar, in that the MapWrapper Scala class is not registered. This gets registered by the Twitter chill-scala AllScalaRegistrar class that you are currently not using. As far as I understand, in order to use Avro with Spark, you also have to use Kryo. This means you have to use the Spark KryoSerializer. This in turn uses Twitter chill. I posted the basic code that I am using here: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491 Maybe there is a simpler solution to your problem but I am not that much of an expert yet. I hope this helps. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro cristovao.corde...@cern.ch wrote: Hi Simone, thanks but I don't think that's it. I've tried several libraries within the --jar argument. Some do give what you said. But other times (when I put the right version I guess) I get the following: 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: scala.collection.convert.Wrappers$MapWrapper at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) Which is odd since I am reading a Avro I wrote...with the same piece of code: https://gist.github.com/MLnick/5864741781b9340cb211 Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro IT Department - 28/R-018 CERN -- *From:* Simone Franzini [captainfr...@gmail.com] *Sent:* 06 December 2014 15:48 *To:* Cristovao Jose Domingues Cordeiro *Subject:* Re: NullPointerException When Reading Avro Sequence Files java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected That is a sign that you are mixing up versions of Hadoop. This is particularly an issue when dealing with AVRO. If you are using Hadoop 2, you will need to get the hadoop 2 version of avro-mapred. In Maven you would do this with the classifier hadoop2 /classifier tag. Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote: Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
Re: PySpark elasticsearch question
Thanks Nick... still no luck. If I use ?q=somerandomcharsfields=title,_source I get an exception about empty collection, which seems to indicate it is actually using the supplied es.query, but somehow when I do rdd.take(1) or take(10), all I get is the id and an empty dict, apparently... maybe something to do how my index is setup in ES ? In [19]: es_rdd.take(4) 14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at PythonRDD.scala:300 14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at PythonRDD.scala:300) with 1 output partitions (allowLocal=true) 14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at PythonRDD.scala:300) 14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List() 14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List() 14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at RDD at PythonRDD.scala:43), which has no missing parents 14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with curMem=1979220, maxMem=278302556 14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in memory (estimated size 4.7 KB, free 263.5 MB) 14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 18 (PythonRDD[30] at RDD at PythonRDD.scala:43) 14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks 14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0 (TID 19, localhost, ANY, 24823 bytes) 14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19) 14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[VKgl4LAgRZyFaSopAWQL5Q/rap-es2-12|141.161.88.237:9200],shard=2] 14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id... 14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init = 284, finish = 0 14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed ( 141.161.88.237:9200); selected next node [141.161.88.233:9200] 14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19). 1886 bytes result sent to driver 14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0 (TID 19) in 316 ms on localhost (1/1) 14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool 14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at PythonRDD.scala:300) finished in 0.324 s 14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at PythonRDD.scala:300, took 0.337848207 s Out[19]: [(u'en_20040726_fbis_116728340038', {}), (u'en_20040726_fbis_116728320448', {}), (u'en_20040726_fbis_116728330192', {}), (u'en_20040726_fbis_116728330145', {})] In [20]: On Tue, Dec 9, 2014 at 10:18 AM, Nick wrote: try es.query something like ?q=*fields=title,_source for a match all query. you need the q=* which is actually the query part of the query On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello, Following a couple of tutorials, I cant seem to get pysprak to get any fields from ES other than the document id? I tried like so: es_rdd = sc.newAPIHadoopRDD(inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,keyClass=org.apache.hadoop.io.NullWritable,valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,conf={ es.resource : en_2004/doc,es.nodes:rap-es2.uis,es.query : ?fields=title,_source }) es_rdd.take(1) Always shows: Out[13]: [(u'en_20040726_fbis_116728340038', {})] How does one get more fields? Thanks, Mohamed.
Re: PySpark elasticsearch question
found a format that worked, kind of accidentally: es.query : {query:{match_all:{}},fields:[title,_source]} Thanks, Mohamed. On Tue, Dec 9, 2014 at 11:27 AM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Thanks Nick... still no luck. If I use ?q=somerandomcharsfields=title,_source I get an exception about empty collection, which seems to indicate it is actually using the supplied es.query, but somehow when I do rdd.take(1) or take(10), all I get is the id and an empty dict, apparently... maybe something to do how my index is setup in ES ? In [19]: es_rdd.take(4) 14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at PythonRDD.scala:300 14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at PythonRDD.scala:300) with 1 output partitions (allowLocal=true) 14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at PythonRDD.scala:300) 14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List() 14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List() 14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at RDD at PythonRDD.scala:43), which has no missing parents 14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with curMem=1979220, maxMem=278302556 14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in memory (estimated size 4.7 KB, free 263.5 MB) 14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 18 (PythonRDD[30] at RDD at PythonRDD.scala:43) 14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks 14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0 (TID 19, localhost, ANY, 24823 bytes) 14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19) 14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[VKgl4LAgRZyFaSopAWQL5Q/rap-es2-12|141.161.88.237:9200],shard=2] 14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id... 14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init = 284, finish = 0 14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed ( 141.161.88.237:9200); selected next node [141.161.88.233:9200] 14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19). 1886 bytes result sent to driver 14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0 (TID 19) in 316 ms on localhost (1/1) 14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool 14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at PythonRDD.scala:300) finished in 0.324 s 14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at PythonRDD.scala:300, took 0.337848207 s Out[19]: [(u'en_20040726_fbis_116728340038', {}), (u'en_20040726_fbis_116728320448', {}), (u'en_20040726_fbis_116728330192', {}), (u'en_20040726_fbis_116728330145', {})] In [20]: On Tue, Dec 9, 2014 at 10:18 AM, Nick wrote: try es.query something like ?q=*fields=title,_source for a match all query. you need the q=* which is actually the query part of the query On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello, Following a couple of tutorials, I cant seem to get pysprak to get any fields from ES other than the document id? I tried like so: es_rdd = sc.newAPIHadoopRDD(inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,keyClass=org.apache.hadoop.io.NullWritable,valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,conf={ es.resource : en_2004/doc,es.nodes:rap-es2.uis,es.query : ?fields=title,_source }) es_rdd.take(1) Always shows: Out[13]: [(u'en_20040726_fbis_116728340038', {})] How does one get more fields? Thanks, Mohamed.
Re: NoSuchMethodError: writing spark-streaming data to cassandra
Hi, @Gerard- Thanks, i added one more dependency for conf.set(spark.cassandra.connection.host, localhost). But now, i am able to create a connection, but the data is not getting inserted into the cassandra table. and the logs show its getting connected and the next second getting disconnected. the full code and the logs and dependencies are below: public class SparkStream { static int key=0; public static void main(String args[]) throws Exception { if(args.length != 3) { System.out.println(parameters not given properly); System.exit(1); } Logger.getLogger(org).setLevel(Level.OFF); Logger.getLogger(akka).setLevel(Level.OFF); MapString,Integer topicMap = new HashMapString,Integer(); String[] topic = args[2].split(,); for(String t: topic) { topicMap.put(t, new Integer(3)); } /* Connection to Spark */ SparkConf conf = new SparkConf(); conf.set(spark.cassandra.connection.host, localhost); JavaSparkContext sc = new JavaSparkContext(local[4], SparkStream,conf); JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(5000)); /* connection to cassandra */ CassandraConnector connector = CassandraConnector.apply(sc.getConf()); System.out.println(+++cassandra connector created); /* Receive Kafka streaming inputs */ JavaPairReceiverInputDStreamString, String messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap ); System.out.println(+streaming Connection done!+++); /* Create DStream */ JavaDStreamTestTable data = messages.map(new Function Tuple2String,String, TestTable () { public TestTable call(Tuple2String, String message) { return new TestTable(new Integer(++key), message._2() ); } } ); System.out.println(JavaDStreamTestTable created); /* Write to cassandra */ javaFunctions(data).writerBuilder(testkeyspace, test_table, mapToRow(TestTable.class)).saveToCassandra(); jssc.start(); jssc.awaitTermination(); } } class TestTable implements Serializable { Integer key; String value; public TestTable() {} public TestTable(Integer k, String v) { key=k; value=v; } public Integer getKey(){ return key; } public void setKey(Integer k){ key=k; } public String getValue(){ return value; } public void setValue(String v){ value=v; } public String toString(){ return MessageFormat.format(TestTable'{'key={0}, value={1}'}', key, value); } } The log is: +++cassandra connector created +streaming Connection done!+++ JavaDStreamTestTable created 14/12/09 12:07:33 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added 14/12/09 12:07:33 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster 14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1) 14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1) 14/12/09 12:07:34 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster 14/12/09 12:07:45 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added 14/12/09 12:07:45 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster 14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1) 14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1) 14/12/09 12:07:46 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster The POM.xml dependencies are: dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.1.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.1.0/version /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector_2.10/artifactId version1.1.0/version /dependency dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector-java_2.10/artifactId version1.1.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.1.1/version /dependency dependency groupIdcom.msiops.footing/groupId artifactIdfooting-tuple/artifactId version0.2/version /dependency dependency groupIdcom.datastax.cassandra/groupId artifactIdcassandra-driver-core/artifactId version2.1.3/version /dependency Thanks and Regards, Md. Aiman Sarosh. Accenture Services Pvt. Ltd. Mob #: (+91) - 9836112841. From: Gerard Maas gerard.m...@gmail.com Sent: Tuesday, December 9, 2014 4:39 PM To: Sarosh, M. Cc: spark users Subject: Re: NoSuchMethodError: writing spark-streaming data to
pyspark sc.textFile uses only 4 out of 32 threads per node
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do sc.textFile to load data from a number of gz files, it does not progress as fast as expected. When I log-in to a child node and run top, I see only 4 threads at 100 cpu. All remaining 28 cores were idle. This is not an issue when processing the strings after loading, when all the cores are used to process the data. Please help me with this? What setting can be changed to get the CPU usage back up to full? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql - save to Parquet file - Unsupported datatype TimestampType
Not yet unfortunately. You could cast the timestamp to a long if you don't need nanosecond precision. Here is a related JIRA: https://issues.apache.org/jira/browse/SPARK-4768 On Mon, Dec 8, 2014 at 11:27 PM, ZHENG, Xu-dong dong...@gmail.com wrote: I meet the same issue. Any solution? On Wed, Nov 12, 2014 at 2:54 PM, tridib tridib.sama...@live.com wrote: Hi Friends, I am trying to save a json file to parquet. I got error Unsupported datatype TimestampType. Is not parquet support date? Which parquet version does spark uses? Is there any work around? Here the stacktrace: java.lang.RuntimeException: Unsupported datatype TimestampType at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:319) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42) Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-file-Unsupported-datatype-TimestampType-tp18691.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- 郑旭东 ZHENG, Xu-dong
PySprak and UnsupportedOperationException
While trying simple examples of PySpark code, I systematically get these failures when I try this.. I dont see any prior exceptions in the output... How can I debug further to find root cause? es_rdd = sc.newAPIHadoopRDD( inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat, keyClass=org.apache.hadoop.io.NullWritable, valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable, conf={ es.resource : en_2014/doc, es.nodes:rap-es2, es.query : {query:{match_all:{}},fields:[title], size: 100} } ) titles=es_rdd.map(lambda d: d[1]['title'][0]) counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) output = counts.collect() ... 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448 dropped from memory (free 274984768) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391 dropped from memory (free 275148159) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391 dropped from memory (free 275311550) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 72) java.lang.UnsupportedOperationException at java.util.AbstractMap.put(AbstractMap.java:203) at java.util.AbstractMap.putAll(AbstractMap.java:273) at org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373) at org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 14/12/09 19:27:30 INFO TaskSetManager: Starting task 2.0 in stage 67.0 (TID 74, localhost, ANY, 26266 bytes) 14/12/09 19:27:30 INFO Executor: Running task 2.0 in stage 67.0 (TID 74) 14/12/09 19:27:30 WARN TaskSetManager: Lost task 0.0 in stage 67.0 (TID 72, localhost): java.lang.UnsupportedOperationException:
Re: PhysicalRDD problem?
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema) This line is throwing away the logical information about existingSchemaRDD and thus Spark SQL can't know how to push down projections or predicates past this operator. Can you describe more the problems that you see if you don't do this reapplication of the schema.
Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)
You might also try out the recently added support for views. On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ah... I see. Thanks for pointing it out. Then it means we cannot mount external table using customized column names. hmm... Then the only option left is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com wrote: This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f = s`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n) val ddl_13 = s |CREATE EXTERNAL TABLE $name ( | $schema |) |STORED AS PARQUET |LOCATION '$file' .stripMargin sql(ddl_13) 2) create a new Schema and do applySchema to generate a new SchemaRDD, had to drop and register table val t = table(name) val newSchema = StructType(t.schema.fields.map(s = s.copy(name = s.name.replaceAll(.*?::, sql(sdrop table $name) applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Thanks for the help! Jianshi On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks, Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
yarn log on EMR
Dear all, I would like to run a simple spark job on EMR with yarn. My job is the follows: public voidEMRRun() { SparkConf sparkConf =newSparkConf().setAppName(RunEMR).setMaster(yarn-cluster); sparkConf.set(spark.executor.memory,13000m); JavaSparkContext ctx =newJavaSparkContext(sparkConf); System.out.println(ctx.appName()); ListInteger list =newLinkedListInteger(); for(inti =0;i1;i++){ list.add(i); } JavaRDDInteger listRDD = ctx.parallelize(list); ListInteger results = listRDD.collect(); for(Integer i : results){ System.out.println(i); } ctx.stop(); } public static voidmain(String[] args) { SparkTest sp =newSparkTest(); sp.EMRRun(); } On EMR I run the spark with spark-submit with the following: ./spark-submit --class com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest --master yarn-cluster --executor-memory 512m --num-executors 10 /home/hadoop/MLyBigData.jar After that finished I tried to see yarn log, but I got this: yarn logs -applicationId application_1418123020170_0032 14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at /172.31.3.155:9022 Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032 Log aggregation has not completed or is not enabled. But I modified the yarn-site.xml as: propertynameyarn.log-aggregation-enable/namevaluetrue/value/property propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4 Any suggestions how can I see the logs of the yarn? Thanks, Istvan
Caching RDDs with shared memory - bug or feature?
If all RDD elements within a partition contain pointers to a single shared object, Spark persists as expected when the RDD is small. However, if the RDD is more than *200 elements* then Spark reports requiring much more memory than it actually does. This becomes a problem for large RDDs, as Spark refuses to persist even though it can. Is this a bug or is there a feature that I'm missing? Cheers, Luke *val* /n/ = ??? *class* Elem(*val* s:Array[Int]) *val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ = { *val* sharedArray = Array./ofDim/[Int](1000) /// Should require ~40MB/ (1 to /n/).toIterator.map(_ = *new* Elem(sharedArray)) }).cache().count() /// force computation/ For n = 100: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *38.1 MB*, free 898.7 MB)/ For n = 200: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *38.2 MB*, free 898.7 MB)/ For n = 201: /MemoryStore: Block rdd_1_0 stored as values in memory (estimated size *76.7 MB*, free 860.2 MB)/ For n = 5000: /MemoryStore: *Not enough space to cache rdd_1_0 in memory!* (computed 781.3 MB so far)/ Note: For medium sized n (where n200 but spark can still cache), the actual application memory still stays where it should - Spark just seems to vastly overreport how much memory it's using. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-RDDs-with-shared-memory-bug-or-feature-tp20596.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark shell and hive context problem
Hi I'm working on Spark that comes with CDH 5.2.0 I'm trying to get a hive context in the shell and I'm running into problems that I don't understand. I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as indicated elsewhere Here is what I see.Pls help --- scala import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package org.apache.hadoop which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. error: while compiling: console during phase: erasure library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: last tree to typer: Apply(value $outer) symbol: value $outer (flags: method synthetic stable expandedname triedcooking) symbol definition: val $outer(): $iwC.$iwC.type tpe: $iwC.$iwC.type symbol owners: value $outer - class $iwC - class $iwC - class $iwC - class $read - package $line9 context owners: class $iwC - class $iwC - class $iwC - class $iwC - class $read - package $line9 == Enclosing template or block == ClassDef( // class $iwC extends Serializable 0 $iwC [] Template( // val local $iwC: notype, tree.tpe=$iwC java.lang.Object, scala.Serializable // parents ValDef( private _ tpt empty ) // 5 statements DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC method triedcooking init [] // 1 parameter list ValDef( // $outer: $iwC.$iwC.$iwC.type $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) tpt // tree.tpe=$iwC Block( // tree.tpe=Unit Apply( // def init(): Object in class Object, tree.tpe=Object $iwC.super.init // def init(): Object in class Object, tree.tpe=()Object Nil ) () ) ) ValDef( // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext private local triedcooking hiveCtx tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext Apply( // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext new org.apache.spark.sql.hive.HiveContext.init // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=(sc: org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext Apply( // val sc(): org.apache.spark.SparkContext, tree.tpe=org.apache.spark.SparkContext $iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc // val sc(): org.apache.spark.SparkContext, tree.tpe=()org.apache.spark.SparkContext Nil ) ) ) DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext method stable accessor hiveCtx [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext $iwC.this.hiveCtx // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext ) ValDef( // protected val $outer: $iwC.$iwC.$iwC.type protected synthetic paramaccessor triedcooking $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) DefDef( // val $outer(): $iwC.$iwC.$iwC.type method synthetic stable expandedname triedcooking $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer [] List(Nil) tpt // tree.tpe=Any $iwC.this.$outer // protected val $outer: $iwC.$iwC.$iwC.type, tree.tpe=$iwC.$iwC.$iwC.type ) ) ) == Expanded type of tree == ThisType(class $iwC) uncaught exception during compilation: scala.reflect.internal.Types$TypeError scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveContext.class refers to term conf in value org.apache.hadoop.hive which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last one. [y/n] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-and-hive-context-problem-tp20597.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail:
Re: spark shell and hive context problem
Hello, In CDH 5.2 you need to manually add Hive classes to the classpath of your Spark job if you want to use the Hive integration. Also, be aware that since Spark 1.1 doesn't really support the version of Hive shipped with CDH 5.2, this combination is to be considered extremely experimental. On Tue, Dec 9, 2014 at 2:07 PM, minajagi chetan.v.minaj...@jpmorgan.com wrote: Hi I'm working on Spark that comes with CDH 5.2.0 I'm trying to get a hive context in the shell and I'm running into problems that I don't understand. I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as indicated elsewhere Here is what I see.Pls help --- scala import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package org.apache.hadoop which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. error: while compiling: console during phase: erasure library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: last tree to typer: Apply(value $outer) symbol: value $outer (flags: method synthetic stable expandedname triedcooking) symbol definition: val $outer(): $iwC.$iwC.type tpe: $iwC.$iwC.type symbol owners: value $outer - class $iwC - class $iwC - class $iwC - class $read - package $line9 context owners: class $iwC - class $iwC - class $iwC - class $iwC - class $read - package $line9 == Enclosing template or block == ClassDef( // class $iwC extends Serializable 0 $iwC [] Template( // val local $iwC: notype, tree.tpe=$iwC java.lang.Object, scala.Serializable // parents ValDef( private _ tpt empty ) // 5 statements DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC method triedcooking init [] // 1 parameter list ValDef( // $outer: $iwC.$iwC.$iwC.type $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) tpt // tree.tpe=$iwC Block( // tree.tpe=Unit Apply( // def init(): Object in class Object, tree.tpe=Object $iwC.super.init // def init(): Object in class Object, tree.tpe=()Object Nil ) () ) ) ValDef( // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext private local triedcooking hiveCtx tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext Apply( // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext new org.apache.spark.sql.hive.HiveContext.init // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=(sc: org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext Apply( // val sc(): org.apache.spark.SparkContext, tree.tpe=org.apache.spark.SparkContext $iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc // val sc(): org.apache.spark.SparkContext, tree.tpe=()org.apache.spark.SparkContext Nil ) ) ) DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext method stable accessor hiveCtx [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext $iwC.this.hiveCtx // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext ) ValDef( // protected val $outer: $iwC.$iwC.$iwC.type protected synthetic paramaccessor triedcooking $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) DefDef( // val $outer(): $iwC.$iwC.$iwC.type method synthetic stable expandedname triedcooking $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer [] List(Nil) tpt // tree.tpe=Any $iwC.this.$outer // protected val $outer: $iwC.$iwC.$iwC.type, tree.tpe=$iwC.$iwC.$iwC.type ) ) ) == Expanded type of tree == ThisType(class $iwC) uncaught exception during compilation: scala.reflect.internal.Types$TypeError scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveContext.class refers to term conf in value org.apache.hadoop.hive which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when
spark shell session crashes when trying to obtain hive context
Hi I'm working on Spark that comes with CDH 5.2.0 I'm trying to get a hive context in the shell and I'm running into problems that I don't understand. I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as indicated elsewhere Here is what I see.Pls help --- scala import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package org.apache.hadoop which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. error: while compiling: console during phase: erasure library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: last tree to typer: Apply(value $outer) symbol: value $outer (flags: method synthetic stable expandedname triedcooking) symbol definition: val $outer(): $iwC.$iwC.type tpe: $iwC.$iwC.type symbol owners: value $outer - class $iwC - class $iwC - class $iwC - class $read - package $line9 context owners: class $iwC - class $iwC - class $iwC - class $iwC - class $read - package $line9 == Enclosing template or block == ClassDef( // class $iwC extends Serializable 0 $iwC [] Template( // val local $iwC: notype, tree.tpe=$iwC java.lang.Object, scala.Serializable // parents ValDef( private _ tpt empty ) // 5 statements DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC method triedcooking init [] // 1 parameter list ValDef( // $outer: $iwC.$iwC.$iwC.type $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) tpt // tree.tpe=$iwC Block( // tree.tpe=Unit Apply( // def init(): Object in class Object, tree.tpe=Object $iwC.super.init // def init(): Object in class Object, tree.tpe=()Object Nil ) () ) ) ValDef( // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext private local triedcooking hiveCtx tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext Apply( // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext new org.apache.spark.sql.hive.HiveContext.init // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=(sc: org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext Apply( // val sc(): org.apache.spark.SparkContext, tree.tpe=org.apache.spark.SparkContext $iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc // val sc(): org.apache.spark.SparkContext, tree.tpe=()org.apache.spark.SparkContext Nil ) ) ) DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext method stable accessor hiveCtx [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext $iwC.this.hiveCtx // private[this] val hiveCtx: org.apache.spark.sql.hive.HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext ) ValDef( // protected val $outer: $iwC.$iwC.$iwC.type protected synthetic paramaccessor triedcooking $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) DefDef( // val $outer(): $iwC.$iwC.$iwC.type method synthetic stable expandedname triedcooking $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer [] List(Nil) tpt // tree.tpe=Any $iwC.this.$outer // protected val $outer: $iwC.$iwC.$iwC.type, tree.tpe=$iwC.$iwC.$iwC.type ) ) ) == Expanded type of tree == ThisType(class $iwC) uncaught exception during compilation: scala.reflect.internal.Types$TypeError scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveContext.class refers to term conf in value org.apache.hadoop.hive which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last one. [y/n] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-session-crashes-when-trying-to-obtain-hive-context-tp20598.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,
implement query to sparse vector representation in spark
I know quite a lot about machine learning, but new to scala and spark. Got stuck due to Spark API, so please advise. I have a txt file with each line format like this #label \t # query, a strong of words, delimited by space 1 wireless amazon kindle 2 apple iPhone 5 1 kindle fire 8G 2 apple iPad first field is the label, second field is the string My plan is to split the data into label and feature, transform the string into sparse vector using build in function Word2Vec(I assume it is using bag of words to get dict first), then classify using SVMWithSGD to train object QueryClassification { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Query Classification).setMaster(local) val sc = new SparkContext(conf) val input = sc.textFile(spark_data.txt) val word2vec = new Word2Vec() val parsedData = input.map {line = val parts = line.split(\t) ## How to write code here? I need to parse into feature vector ## properly and then apply word2vec function after the map *LabeledPoint(parts(0).toDouble, )* } ## * is the item I got from parsing parts(1) above word2vec.fit(*) val numIterations = 20 val model = SVMWithSGD.train(parsedData,numIterations) } } Thanks a lot
equivalent to sql in
i have and RDD i want to filter and for a single term all works good: ie dataRDD.filter(x=x._2 ==apple) how can i use multiple values, for example if i wanted to filter my rdd to take out apples and oranges and pears with out using . This could get long winded as there may be quite a few. Can you filter using a set or a list? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: equivalent to sql in
This is more a scala specific question. I would look at the List contains implementation -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20600.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: query classification using spark Mlib
the format is bad, the question link is here http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/query-classification-using-spark-Mlib-tp20601p20602.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
To report back how I ultimately solved this issue and someone else can do: 1) Check each jar class path and make sure the jars are listed in the order of Guava class version (i.e. spark-assembly needs to list before Hadoop 2.4 because spark-assembly has guava 14 and Hadoop 2.4 has guava 11). May require update compute-classpath.sh to get the ordering right. 2) If the other jars uses a higher version, bump spark guava library to higher version. Guava supposedly to be very backward compatible. Hope this helps. -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, December 2, 2014 11:35 AM To: Judy Nash Cc: Patrick Wendell; Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava On Tue, Dec 2, 2014 at 11:22 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Any suggestion on how can user with custom Hadoop jar solve this issue? You'll need to include all the dependencies for that custom Hadoop jar to the classpath. Those will include Guava (which is not included in its original form as part of the Spark dependencies). -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Sunday, November 30, 2014 11:06 PM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Thanks Judy. While this is not directly caused by a Spark issue, it is likely other users will run into this. This is an unfortunate consequence of the way that we've shaded Guava in this release, we rely on byte code shading of Hadoop itself as well. And if the user has their own Hadoop classes present it can cause issues. On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, November 26, 2014 8:17 AM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar cd org/apache/hadoop/ javap -v Configuration | grep Precond Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration #497 = Utf8 org/spark-project/guava/common/base/Preconditions #498 = Class #497 // org/spark-project/guava/common/base/Preconditions #502 = Methodref #498.#501// org/spark-project/guava/common/base/Preconditions.checkArgument:(ZL j ava/lang/Object;)V 12: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLj a va/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLj a va/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I
Fwd: Please add us to the Spark users page
Hi, My name is Abhik Majumdar and I am a co-founder of Vidora Corp. We use Spark at Vidora to power our machine learning stack and we are requesting to be included on your Powered by Spark page: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Here is the information you requested: *organization name:* Vidora *URL:* http://www.vidora.com *a list of which Spark components you are using:* Spark Core, MLlib, Spark Streaming. *a short description of your use case:* Vidora personalized the online experiences for content companies and provides a platform to tailor and adapt their consumer experiences to each of their users. Our machine learning stack, running on Spark, is able to completely personalize the entire webpage or mobile app for any kind of content with the objective of optimizing the metric that the customer cares about. Please let me know if there is any additional information we can provide. Thanks Abhik Abhik Majumdar, Co-Founder Vidora Website : www.vidora.com E-mail : ab...@vidora.com Follow us on Twitter https://twitter.com/#!/vidoracorp or LinkedIn http://www.linkedin.com/company/vidora --
RE: equivalent to sql in
Option 1: dataRDD.filter(x=(x._2 ==apple) || (x._2 ==orange)) Option 2: val fruits = Set(apple, orange, pear) dataRDD.filter(x=fruits.contains(x._2)) Mohammed -Original Message- From: dizzy5112 [mailto:dave.zee...@gmail.com] Sent: Tuesday, December 9, 2014 2:16 PM To: u...@spark.incubator.apache.org Subject: equivalent to sql in i have and RDD i want to filter and for a single term all works good: ie dataRDD.filter(x=x._2 ==apple) how can i use multiple values, for example if i wanted to filter my rdd to take out apples and oranges and pears with out using . This could get long winded as there may be quite a few. Can you filter using a set or a list? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: dockerized spark executor on mesos?
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. We don't use Mesos though, running it in Standalone mode, but adding Mesos should not be that difficult I think. Regards Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Cluster getting a null pointer error
I have set up a cluster on AWS and am trying a really simple hello world program as a test. The cluster was built using the ec2 scripts that come with Spark. Anyway, I have output the error message (using --verbose) below. The source code is further below that. Any help would be greatly appreciated. Thanks, Eric *Error code:* r...@ip-xx.xx.xx.xx ~]$ ./spark/bin/spark-submit --verbose --class com.je.test.Hello --master spark://xx.xx.xx.xx:7077 Hello-assembly-1.0.jar Spark assembly has been built with Hive, including Datanucleus jars on classpath Using properties file: /root/spark/conf/spark-defaults.conf Adding default property: spark.executor.memory=5929m Adding default property: spark.executor.extraClassPath=/root/ephemeral-hdfs/conf Adding default property: spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/ Using properties file: /root/spark/conf/spark-defaults.conf Adding default property: spark.executor.memory=5929m Adding default property: spark.executor.extraClassPath=/root/ephemeral-hdfs/conf Adding default property: spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/ Parsed arguments: master spark://xx.xx.xx.xx:7077 deployMode null executorMemory 5929m executorCores null totalExecutorCores null propertiesFile /root/spark/conf/spark-defaults.conf extraSparkPropertiesMap() driverMemorynull driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass com.je.test.Hello primaryResource file:/root/Hello-assembly-1.0.jar namecom.je.test.Hello childArgs [] jarsnull verbose true Default properties from /root/spark/conf/spark-defaults.conf: spark.executor.extraLibraryPath - /root/ephemeral-hdfs/lib/native/ spark.executor.memory - 5929m spark.executor.extraClassPath - /root/ephemeral-hdfs/conf Using properties file: /root/spark/conf/spark-defaults.conf Adding default property: spark.executor.memory=5929m Adding default property: spark.executor.extraClassPath=/root/ephemeral-hdfs/conf Adding default property: spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/ Main class: com.je.test.Hello Arguments: System properties: spark.executor.extraLibraryPath - /root/ephemeral-hdfs/lib/native/ spark.executor.memory - 5929m SPARK_SUBMIT - true spark.app.name - com.je.test.Hello spark.jars - file:/root/Hello-assembly-1.0.jar spark.executor.extraClassPath - /root/ephemeral-hdfs/conf spark.master - spark://xxx.xx.xx.xxx:7077 Classpath elements: file:/root/Hello-assembly-1.0.jar *Actual Error:* Exception in thread main java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Source Code:* package com.je.test import org.apache.spark.{SparkConf, SparkContext} class Hello { def main(args: Array[String]): Unit = { val conf = new SparkConf(true)//.set(spark.cassandra.connection.host, xxx.xx.xx.xxx) val sc = new SparkContext(spark://xxx.xx.xx.xxx:7077, Season, conf) println(Hello World) } }
Reading Yarn log on EMR
Dear all, I would like to run a simple spark job on EMR with yarn. My job is the follows: public voidEMRRun() { SparkConf sparkConf =newSparkConf().setAppName(RunEMR).setMaster(yarn-cluster); sparkConf.set(spark.executor.memory,13000m); JavaSparkContext ctx =newJavaSparkContext(sparkConf); System.out.println(ctx.appName()); ListInteger list =newLinkedListInteger(); for(inti =0;i1;i++){ list.add(i); } JavaRDDInteger listRDD = ctx.parallelize(list); ListInteger results = listRDD.collect(); for(Integer i : results){ System.out.println(i); } ctx.stop(); } public static voidmain(String[] args) { SparkTest sp =newSparkTest(); sp.EMRRun(); } On EMR I run the spark with spark-submit with the following: ./spark-submit --class com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest --master yarn-cluster --executor-memory 512m --num-executors 10 /home/hadoop/MLyBigData.jar After that finished I tried to see yarn log, but I got this: yarn logs -applicationId application_1418123020170_0032 14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at /172.31.3.155:9022 Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032 Log aggregation has not completed or is not enabled. But I modified the yarn-site.xml as: propertynameyarn.log-aggregation-enable/namevaluetrue/value/property propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4 Any suggestions how can I see the logs of the yarn? Thanks, Istvan attachment: nistvan.vcf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading Yarn log on EMR
https://issues.apache.org/jira/browse/YARN-321 There is not a generic application history server yet. The current one works for MR. On Tue, Dec 9, 2014 at 4:48 PM, Nagy István tyson...@gmail.com wrote: Dear all, I would like to run a simple spark job on EMR with yarn. My job is the follows: public void EMRRun() { SparkConf sparkConf = new SparkConf().setAppName(RunEMR).setMaster(yarn-cluster); sparkConf.set(spark.executor.memory, 13000m);JavaSparkContext ctx = new JavaSparkContext(sparkConf);System.out.println(ctx.appName()); ListInteger list = new LinkedListInteger();for (int i =0; i1; i++){ list.add(i);} JavaRDDInteger listRDD = ctx.parallelize(list);ListInteger results = listRDD.collect();for (Integer i : results){ System.out.println(i);} ctx.stop();} public static void main(String[] args) { SparkTest sp = new SparkTest();sp.EMRRun();} On EMR I run the spark with spark-submit with the following: ./spark-submit --class com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest --master yarn-cluster --executor-memory 512m --num-executors 10 /home/hadoop/MLyBigData.jar After that finished I tried to see yarn log, but I got this: yarn logs -applicationId application_1418123020170_0032 14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at / 172.31.3.155:9022 Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032 Log aggregation has not completed or is not enabled. But I modified the yarn-site.xml as: propertynameyarn.log-aggregation-enable/namevaluetrue/value/property propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4 Any suggestions how can I see the logs of the yarn? Thanks, Istvan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Workers keep dying on EC2 Spark cluster: PriviledgedActionException
Hi Spark users, I've been attempting to get flambo https://github.com/yieldbot/flambo/blob/develop/README.md, a Clojure library for Spark, working with my codebase. After getting things to build with this very simple interface: (ns sharknado.core (:require [flambo.conf :as conf] [flambo.api :as spark])) (defn configure [master-url app-name] (- (conf/spark-conf) (conf/master master-url) (conf/app-name app-name))) (defn get-context [master-url app-name] (spark/spark-context (configure master-url app-name))) I run in the lein repl: (use 'sharknado.core) (def cx (get-context spark://MASTER-URL.compute-1.amazonaws.com:7077 flambo-test)) This connects to the master and successfully creates an app; however, the app's workers all die after several seconds. It looks like user Saiph Kappa had similar problems about a month ago. Someone suggested that the cluster and submitted spark application might be using different versions of Spark; that's definitely not the case here. I've tried with both 1.1.0 and 1.1.1 on both ends. With Spark 1.1.0, after all workers die, the application exits. With spark 1.1.1, after each worker dies, another is automatically created; at the moment the app detail screen in the UI is showing 150 exited and 5 running workers. Anyone have any ideas? Example trace from a worker below. Thanks, Jeff 14/12/10 01:22:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/12/10 01:22:10 INFO spark.SecurityManager: Changing view acls to: root,Jeff 14/12/10 01:22:10 INFO spark.SecurityManager: Changing modify acls to: root,Jeff 14/12/10 01:22:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, Jeff); users with modify permissions: Set(root, Jeff) 14/12/10 01:22:10 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/12/10 01:22:10 INFO Remoting: Starting remoting 14/12/10 01:22:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050] 14/12/10 01:22:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050] 14/12/10 01:22:10 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 49050. 14/12/10 01:22:40 ERROR security.UserGroupInformation: PriviledgedActionException as:Jeff cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) ... 7 more
Can HiveContext be used without using Hive?
From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks
Re: Can HiveContext be used without using Hive?
That is correct. It the hive context will create an embedded metastore in the current directory if you have not configured hive. On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com wrote: From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks
Re: Can HiveContext be used without using Hive?
In that case, what should be the behavior of saveTableAs? On Dec 10, 2014 4:03 AM, Michael Armbrust mich...@databricks.com wrote: That is correct. It the hive context will create an embedded metastore in the current directory if you have not configured hive. On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com wrote: From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks
RE: Can HiveContext be used without using Hive?
It works exactly like Create Table As Select (CTAS) in Hive. Cheng Hao From: Anas Mosaad [mailto:anas.mos...@incorta.com] Sent: Wednesday, December 10, 2014 11:59 AM To: Michael Armbrust Cc: Manoj Samel; user@spark.apache.org Subject: Re: Can HiveContext be used without using Hive? In that case, what should be the behavior of saveTableAs? On Dec 10, 2014 4:03 AM, Michael Armbrust mich...@databricks.commailto:mich...@databricks.com wrote: That is correct. It the hive context will create an embedded metastore in the current directory if you have not configured hive. On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.commailto:manojsamelt...@gmail.com wrote: From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks
Mllib error
I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:Object mllib is not a member of package org.apache.spark please note I just migrated from 1.0.2 to 1.1.1 Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: PhysicalRDD problem?
Hi Michael, I think I have found the exact problem in my case. I see that we have written something like following in Analyzer.scala :- // TODO: pass this in as a parameter. val fixedPoint = FixedPoint(100) and Batch(Resolution, fixedPoint, ResolveReferences :: ResolveRelations :: ResolveSortReferences :: NewRelationInstances :: ImplicitGenerate :: StarExpansion :: ResolveFunctions :: GlobalAggregates :: UnresolvedHavingClauseAttributes :: TrimGroupingAliases :: typeCoercionRules ++ extendedRules : _*), Perhaps in my case, it reaches the 100 iterations and break out of while loop in RuleExecutor.scala and thus, doesn't resolve all the attributes. Exception in my logs :- 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached for batch Resolution 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in context with path [] threw exception [Servlet execution threw an exception] with root cause org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS DOWN_BYTESHTTPSUBCR#6567, tree: 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS DOWN_BYTESHTTPSUBCR#6567] ... ... ... at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86) at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67) at org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85) at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50) at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490) I think the solution here is to have the FixedPoint constructor argument as configurable/parameterized (also written as TODO). Do we have a plan to do this in 1.2 release? Or I can take this up as a task for myself if you want (since this is very crucial for our release). Thanks -Nitin On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com wrote: val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema) This line is throwing away the logical information about existingSchemaRDD and thus Spark SQL can't know how to push down projections or predicates past this operator. Can you describe more the problems that you see if you don't do this reapplication of the schema. -- Regards Nitin Goyal
Stack overflow Error while executing spark SQL
HI I am getting Stack overflow Error Exception in main java.lang.stackoverflowerror scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) Exact line where exception occurs. sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) The complete code is from github https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { // Process program arguments and set properties /*if (args.length 3) { System.err.println(Usage: + this.getClass.getSimpleName + tweetInput outputModelDir numClusters numIterations) System.exit(1) } * */ val outputModelDir=C:\\MLModel val tweetInput=C:\\MLInput val numClusters=10 val numIterations=20 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), Utils.IntParam(numIterations)) = args val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Pretty print some of the tweets. val tweets = sc.textFile(tweetInput) println(Sample JSON Tweets---) for (tweet - tweets.take(5)) { println(gson.toJson(jsonParser.parse(tweet))) } val tweetTable = sqlContext.jsonFile(tweetInput).cache() tweetTable.registerTempTable(tweetTable) println(--Tweet table Schema---) tweetTable.printSchema() println(Sample Tweet Text-) sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) println(--Sample Lang, Name, text---) sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000).collect().foreach(println) println(--Total count by languages Lang, count(*)---) sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println) println(--- Training the model and persist it) val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString) // Cache the vectors RDD since it will be used for all the KMeans iterations. val vectors = texts.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = texts.take(100) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } Any Help would be appreciated.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Stack-overflow-Error-while-executing-spark-SQL-tp20604.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reg JDBCRDD code
Thanks Akhil but it is expecting Function1 instead of Function .. I tried out writing a new class by implementing Function1 but got an error . can you please help us to get it resolved JDBCRDD is created as JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1, getResultset, ClassTag$.MODULE$.apply(String.class)); overridden 'apply' method in Function1 public String apply(ResultSet arg0) { String ss = null; try { ss = (String) ((java.sql.ResultSet) arg0).getString(1); } catch (SQLException e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println(ss); return ss; // TODO Auto-generated method stub } Error log Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: java.sql.SQLException: Parameter index out of range (1 number of parameters, which is 0). com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) com.mysql.jdbc.PreparedStatement.checkBounds( PreparedStatement.java:3709) com.mysql.jdbc.PreparedStatement.setInternal( PreparedStatement.java:3693) com.mysql.jdbc.PreparedStatement.setInternal( PreparedStatement.java:3735) com.mysql.jdbc.PreparedStatement.setLong( PreparedStatement.java:3751) Thanks Deepa From: Akhil Das ak...@sigmoidanalytics.com To: Deepa Jayaveer deepa.jayav...@tcs.com Cc: user@spark.apache.org user@spark.apache.org Date: 12/09/2014 09:30 PM Subject:Re: reg JDBCRDD code Hi Deepa, In Scala, You will do something like https://gist.github.com/akhld/ccafb27f098163bea622 With Java API's it will be something like https://gist.github.com/akhld/0d9299aafc981553bc34 Thanks Best Regards On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Hi All, am new to Spark. I tried to connect to Mysql using Spark. want to write a code in Java but getting runtime exception. I guess that the issue is with the function0 and function1 objects being passed in JDBCRDD . I tried my level best and attached the code, can you please help us to fix the issue. Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySprak and UnsupportedOperationException
somehow I now get a better error trace, I believe for the same root issue... any advice of how to narrow this down further highly appreciated: ... 14/12/10 07:15:03 ERROR PythonRDD: Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /spark/python/pyspark/serializers.py, line 146, in _read_with_length length = read_int(stream) File /spark/python/pyspark/serializers.py, line 464, in read_int raise EOFError EOFError at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) On Tue, Dec 9, 2014 at 2:32 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: While trying simple examples of PySpark code, I systematically get these failures when I try this.. I dont see any prior exceptions in the output... How can I debug further to find root cause? es_rdd = sc.newAPIHadoopRDD( inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat, keyClass=org.apache.hadoop.io.NullWritable, valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable, conf={ es.resource : en_2014/doc, es.nodes:rap-es2, es.query : {query:{match_all:{}},fields:[title], size: 100} } ) titles=es_rdd.map(lambda d: d[1]['title'][0]) counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) output = counts.collect() ... 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448 dropped from memory (free 274984768) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391 dropped from memory (free 275148159) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391 dropped from memory (free 275311550) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 72) java.lang.UnsupportedOperationException at java.util.AbstractMap.put(AbstractMap.java:203) at java.util.AbstractMap.putAll(AbstractMap.java:273) at org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373) at org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 14/12/09 19:27:30 INFO
Re: reg JDBCRDD code
Try changing this line *JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 0, 1, getResultset, ClassTag$.*MODULE$*.apply(String.*class*)); to *JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 100, 1, getResultset, ClassTag$.*MODULE$*.apply(String.*class*)); Here: 0 - lower bound 100 - upper bound 1 - number of partitions i believe. Thanks Best Regards On Wed, Dec 10, 2014 at 12:45 PM, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Thanks Akhil but it is expecting Function1 instead of Function .. I tried out writing a new class by implementing Function1 but got an error . can you please help us to get it resolved JDBCRDD is created as *JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 0, 1, getResultset, ClassTag$.*MODULE$*.apply(String.*class*)); overridden 'apply' method in Function1 *public* String apply(ResultSet arg0) { String ss = *null*; *try* { ss = *(String) ((java.sql.ResultSet) arg0).getString(1)*; } *catch* (SQLException e) { // *TODO* Auto-generated catch block e.printStackTrace(); } System.*out*.println(ss); *return* ss; // *TODO* Auto-generated method stub } Error log Exception in thread main *org.apache.spark.SparkException*: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: *java.sql.SQLException*: Parameter index out of range (1 number of parameters, which is 0). com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:1073*) com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:987*) com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:982*) com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:927*) com.mysql.jdbc.PreparedStatement.checkBounds( *PreparedStatement.java:3709*) com.mysql.jdbc.PreparedStatement.setInternal( *PreparedStatement.java:3693*) com.mysql.jdbc.PreparedStatement.setInternal( *PreparedStatement.java:3735*) com.mysql.jdbc.PreparedStatement.setLong( *PreparedStatement.java:3751*) Thanks Deepa From:Akhil Das ak...@sigmoidanalytics.com To:Deepa Jayaveer deepa.jayav...@tcs.com Cc:user@spark.apache.org user@spark.apache.org Date:12/09/2014 09:30 PM Subject:Re: reg JDBCRDD code -- Hi Deepa, In Scala, You will do something like *https://gist.github.com/akhld/ccafb27f098163bea622* https://gist.github.com/akhld/ccafb27f098163bea622 With Java API's it will be something like *https://gist.github.com/akhld/0d9299aafc981553bc34* https://gist.github.com/akhld/0d9299aafc981553bc34 Thanks Best Regards On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer *deepa.jayav...@tcs.com* deepa.jayav...@tcs.com wrote: Hi All, am new to Spark. I tried to connect to Mysql using Spark. want to write a code in Java but getting runtime exception. I guess that the issue is with the function0 and function1 objects being passed in JDBCRDD . I tried my level best and attached the code, can you please help us to fix the issue. Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you - To unsubscribe, e-mail: *user-unsubscr...@spark.apache.org* user-unsubscr...@spark.apache.org For additional commands, e-mail: *user-h...@spark.apache.org* user-h...@spark.apache.org
Re: reg JDBCRDD code
Hi Akhil, Getting the same error . I guess that the issue on Function1 implementation. is it enough if we override apply method in Function1 class? Thanks Deepa From: Akhil Das ak...@sigmoidanalytics.com To: Deepa Jayaveer deepa.jayav...@tcs.com Cc: user@spark.apache.org user@spark.apache.org Date: 12/10/2014 12:55 PM Subject:Re: reg JDBCRDD code Try changing this line JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1, getResultset, ClassTag$.MODULE$.apply(String.class)); to JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 100, 1, getResultset, ClassTag$.MODULE$.apply(String.class)); Here: 0 - lower bound 100 - upper bound 1 - number of partitions i believe. Thanks Best Regards On Wed, Dec 10, 2014 at 12:45 PM, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Thanks Akhil but it is expecting Function1 instead of Function .. I tried out writing a new class by implementing Function1 but got an error . can you please help us to get it resolved JDBCRDD is created as JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1, getResultset, ClassTag$.MODULE$.apply(String.class)); overridden 'apply' method in Function1 public String apply(ResultSet arg0) { String ss = null; try { ss = (String) ((java.sql.ResultSet) arg0).getString(1); } catch (SQLException e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println(ss); return ss; // TODO Auto-generated method stub } Error log Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: java.sql.SQLException: Parameter index out of range (1 number of parameters, which is 0). com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) com.mysql.jdbc.PreparedStatement.checkBounds( PreparedStatement.java:3709) com.mysql.jdbc.PreparedStatement.setInternal( PreparedStatement.java:3693) com.mysql.jdbc.PreparedStatement.setInternal( PreparedStatement.java:3735) com.mysql.jdbc.PreparedStatement.setLong( PreparedStatement.java:3751) Thanks Deepa From:Akhil Das ak...@sigmoidanalytics.com To:Deepa Jayaveer deepa.jayav...@tcs.com Cc:user@spark.apache.org user@spark.apache.org Date:12/09/2014 09:30 PM Subject:Re: reg JDBCRDD code Hi Deepa, In Scala, You will do something like https://gist.github.com/akhld/ccafb27f098163bea622 With Java API's it will be something like https://gist.github.com/akhld/0d9299aafc981553bc34 Thanks Best Regards On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Hi All, am new to Spark. I tried to connect to Mysql using Spark. want to write a code in Java but getting runtime exception. I guess that the issue is with the function0 and function1 objects being passed in JDBCRDD . I tried my level best and attached the code, can you please help us to fix the issue. Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PhysicalRDD problem?
I see that somebody had already raised a PR for this but it hasn't been merged. https://issues.apache.org/jira/browse/SPARK-4339 Can we merge this in next 1.2 RC? Thanks -Nitin On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal nitin2go...@gmail.com wrote: Hi Michael, I think I have found the exact problem in my case. I see that we have written something like following in Analyzer.scala :- // TODO: pass this in as a parameter. val fixedPoint = FixedPoint(100) and Batch(Resolution, fixedPoint, ResolveReferences :: ResolveRelations :: ResolveSortReferences :: NewRelationInstances :: ImplicitGenerate :: StarExpansion :: ResolveFunctions :: GlobalAggregates :: UnresolvedHavingClauseAttributes :: TrimGroupingAliases :: typeCoercionRules ++ extendedRules : _*), Perhaps in my case, it reaches the 100 iterations and break out of while loop in RuleExecutor.scala and thus, doesn't resolve all the attributes. Exception in my logs :- 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached for batch Resolution 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in context with path [] threw exception [Servlet execution threw an exception] with root cause org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS DOWN_BYTESHTTPSUBCR#6567, tree: 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS DOWN_BYTESHTTPSUBCR#6567] ... ... ... at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86) at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67) at org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85) at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50) at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490) I think the solution here is to have the FixedPoint constructor argument as configurable/parameterized (also written as TODO). Do we have a plan to do this in 1.2 release? Or I can take this up as a task for myself if you want (since this is very crucial for our release). Thanks -Nitin On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com wrote: val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema) This line is throwing away the logical information about existingSchemaRDD and thus Spark SQL can't know how to push down projections or predicates past this operator. Can you describe more the problems that you see if you don't do this reapplication of the schema. -- Regards Nitin Goyal -- Regards Nitin Goyal
Re: PySprak and UnsupportedOperationException
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: While trying simple examples of PySpark code, I systematically get these failures when I try this.. I dont see any prior exceptions in the output... How can I debug further to find root cause? es_rdd = sc.newAPIHadoopRDD( inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat, keyClass=org.apache.hadoop.io.NullWritable, valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable, conf={ es.resource : en_2014/doc, es.nodes:rap-es2, es.query : {query:{match_all:{}},fields:[title], size: 100} } ) titles=es_rdd.map(lambda d: d[1]['title'][0]) counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) output = counts.collect() ... 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448 dropped from memory (free 274984768) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391 dropped from memory (free 275148159) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391 dropped from memory (free 275311550) 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 72) java.lang.UnsupportedOperationException at java.util.AbstractMap.put(AbstractMap.java:203) at java.util.AbstractMap.putAll(AbstractMap.java:273) at org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373) at It looks like it's a bug in ElasticSearch (EsInputFormat). org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299) at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339) at This means that the task failed when it read the data in EsInputFormat to feed Python mapper. org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 14/12/09 19:27:30 INFO TaskSetManager: Starting task 2.0 in stage 67.0 (TID 74, localhost, ANY, 26266 bytes) 14/12/09 19:27:30 INFO Executor: Running task 2.0 in stage 67.0 (TID 74) 14/12/09 19:27:30 WARN TaskSetManager: Lost task 0.0 in stage 67.0 (TID 72, localhost): java.lang.UnsupportedOperationException: - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Stackoverflow error
Hi, Unfortunately I am also getting the same error Anybody solved it??.. Exception in main java.lang.stackoverflowerror scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) Exact line where exception occurs. sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) The complete code is from github https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { // Process program arguments and set properties /*if (args.length 3) { System.err.println(Usage: + this.getClass.getSimpleName + tweetInput outputModelDir numClusters numIterations) System.exit(1) } * */ val outputModelDir=C:\\MLModel val tweetInput=C:\\MLInput val numClusters=10 val numIterations=20 //val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), Utils.IntParam(numIterations)) = args val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Pretty print some of the tweets. val tweets = sc.textFile(tweetInput) println(Sample JSON Tweets---) for (tweet - tweets.take(5)) { println(gson.toJson(jsonParser.parse(tweet))) } val tweetTable = sqlContext.jsonFile(tweetInput).cache() tweetTable.registerTempTable(tweetTable) println(--Tweet table Schema---) tweetTable.printSchema() println(Sample Tweet Text-) sqlContext.sql(SELECT text FROM tweetTable LIMIT 10).collect().foreach(println) println(--Sample Lang, Name, text---) sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 1000).collect().foreach(println) println(--Total count by languages Lang, count(*)---) sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println) println(--- Training the model and persist it) val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString) // Cache the vectors RDD since it will be used for all the KMeans iterations. val vectors = texts.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = texts.take(100) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p20605.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org