Re: Software stack for Recommendation engine with spark mlib
Thanks Sean, your suggestions and the links provided are just what I needed to start off with. On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote: I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely up to date. The front-end part of the stack looks right. Spark would do the model building; you'd have to write a process to score recommendations and store the result. Mahout is the same thing, really. 500K items isn't all that large. Your requirements aren't driven just by items though. Number of users and latent features matter too. It matters how often you want to build the model too. I'm guessing you would get away with a handful of modern machines for a problem this size. In a way what you describe reminds me of Wibidata, since it built recommender-like solutions on top of data and results published to a NoSQL store. You might glance at the related OSS project Kiji (http://kiji.org/) for ideas about how to manage the schema. You should have a look at things like Nick's architecture for Graphflow, however it's more concerned with computing recommendation on the fly, and describes a shift from an architecture originally built around something like a NoSQL store: http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf This is also the kind of ground the oryx project is intended to cover, something I've worked on personally: https://github.com/OryxProject/oryx -- a layer on and around the core model building in Spark + Spark Streaming to provide a whole recommender (for example), down to the REST API. On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is going to use Android app. 2. Rest Api sent to app server from the android app to get recommendations. 3. Spark Mlib core engine for recommendation engine 4. MongoDB database backend. I would like to know more on the cluster configuration( how many nodes etc) part of spark for calculating the recommendations for 500,000 items. This items include products for day care etc. Other software stack suggestions would also be very useful.It has to run on multiple vendor machines. Please suggest. Thanks shashi
Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class
Currently there’s no convenient way to convert a |SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case class. But you can convert an |RDD|/|JavaRDD| into an |RDD[Row]|/|JavaRDDRow| using |schemaRdd.rdd| and |new JavaRDDRow(schemaRdd.rdd)|. Cheng On 3/15/15 10:22 PM, Renato Marroquín Mogrovejo wrote: Hi Spark experts, Is there a way to convert a JavaSchemaRDD (for instance loaded from a parquet file) back to a JavaRDD of a given case class? I read on stackOverFlow[1] that I could do a select over the parquet file and then by reflection get the fields out, but I guess that would be an overkill. Then I saw [2] from 2014 which says that this feature would be available in the future. So could you please let me know how I can accomplish this? Thanks in advance! Renato M. [1] http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class [2] http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html
Re: Spark Streaming on Yarn Input from Flume
have you fixed this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755p22055.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: 1.3 release
I think (I hope) it's because the generic builds just work. Even though these are of course distributed mostly verbatim in CDH5, with tweaks to be compatible with other stuff at the edges, the stock builds should be fine too. Same for HDP as I understand. The CDH4 build may work on some builds of CDH4, but I think is lurking there as a Hadoop 2.0.x plus a certain YARN beta build. I'd prefer to rename it that way, myself, since it doesn't actually work with all of CDH4 anyway. Are the MapR builds there because the stock Hadoop build doesn't work on MapR? that would actually surprise me, but then, why are these two builds distributed? On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class
Hi Spark experts, Is there a way to convert a JavaSchemaRDD (for instance loaded from a parquet file) back to a JavaRDD of a given case class? I read on stackOverFlow[1] that I could do a select over the parquet file and then by reflection get the fields out, but I guess that would be an overkill. Then I saw [2] from 2014 which says that this feature would be available in the future. So could you please let me know how I can accomplish this? Thanks in advance! Renato M. [1] http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class [2] http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html
Re: Spark Release 1.3.0 DataFrame API
Thank you for your help. toDF() solved my first problem. And, the second issue was a non-issue, since the second example worked without any modification. David On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote: programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote: Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a .toDF() is missing On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can then call registerTempTable on that DataFrame. So people.toDF.registerTempTable(people) should work — Sent from Mailbox On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell jdavidmitch...@gmail.com wrote: I am pleased with the release of the DataFrame API. However, I started playing with it, and neither of the two main examples in the documentation work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html Specfically: Inferring the Schema Using Reflection Programmatically Specifying the Schema Scala 2.11.6 Spark 1.3.0 prebuilt for Hadoop 2.4 and later Inferring the Schema Using Reflection scala people.registerTempTable(people) console:31: error: value registerTempTable is not a member of org.apache.spark .rdd.RDD[Person] people.registerTempTable(people) ^ Programmatically Specifying the Schema scala val peopleDataFrame = sqlContext.createDataFrame(people, schema) console:41: error: overloaded method value createDataFrame with alternatives: (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spar k.sql.DataFrame and (rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.Dat aFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns: java.util.List[String])org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame and (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache .spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.ty pes.StructType) val df = sqlContext.createDataFrame(people, schema) Any help would be appreciated. David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- ### Confidential e-mail, for recipient's (or recipients') eyes only, not for distribution. ###
Saving Dstream into a single file
i am doing word count example on flume stream and trying to save output as text files in HDFS , but in the save directory i got multiple sub directories each having files with small size , i wonder if there is a way to append in a large file instead of saving in multiple files , as i intend to save the output in hive hdfs directory so i can query the result using hive hope anyone have a workaround for this issue , Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Dstream-into-a-single-file-tp22058.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Software stack for Recommendation engine with spark mlib
I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely up to date. The front-end part of the stack looks right. Spark would do the model building; you'd have to write a process to score recommendations and store the result. Mahout is the same thing, really. 500K items isn't all that large. Your requirements aren't driven just by items though. Number of users and latent features matter too. It matters how often you want to build the model too. I'm guessing you would get away with a handful of modern machines for a problem this size. In a way what you describe reminds me of Wibidata, since it built recommender-like solutions on top of data and results published to a NoSQL store. You might glance at the related OSS project Kiji (http://kiji.org/) for ideas about how to manage the schema. You should have a look at things like Nick's architecture for Graphflow, however it's more concerned with computing recommendation on the fly, and describes a shift from an architecture originally built around something like a NoSQL store: http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf This is also the kind of ground the oryx project is intended to cover, something I've worked on personally: https://github.com/OryxProject/oryx -- a layer on and around the core model building in Spark + Spark Streaming to provide a whole recommender (for example), down to the REST API. On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is going to use Android app. 2. Rest Api sent to app server from the android app to get recommendations. 3. Spark Mlib core engine for recommendation engine 4. MongoDB database backend. I would like to know more on the cluster configuration( how many nodes etc) part of spark for calculating the recommendations for 500,000 items. This items include products for day care etc. Other software stack suggestions would also be very useful.It has to run on multiple vendor machines. Please suggest. Thanks shashi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Need Advice about reading lots of text files
Ah most interesting—thanks. So it seems sc.textFile(longFileList) has to read all metadata before starting the read for partitioning purposes so what you do is not use it? You create a task per file that reads one file (in parallel) per task without scanning for _all_ metadata. Can’t argue with the logic but perhaps Spark should incorporate something like this in sc.textFile? My case can’t be that unusual especially since I am periodically processing micro-batches from Spark Streaming. In fact Actually I have to scan HDFS to create the longFileList to begin with so get file status and therefore probably all the metadata needed by sc.textFile. Your method would save one scan, which is good. Might a better sc.textFile take a beginning URI, a file pattern regex, and a recursive flag? Then one scan could create all metadata automatically for a large subset of people using the function, something like sc.textFile(beginDir: String, filePattern: String = “^part.*”, recursive: Boolean = false) I fact it should be easy to create BetterSC that overrides the textFile method with a re-implementation that only requires one scan to get metadata. Just thinking on email… On Mar 14, 2015, at 11:11 AM, Michael Armbrust mich...@databricks.com wrote: Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E FromMichael Armbrust mich...@databricks.com mailto:mich...@databricks.com Subject Re: S3NativeFileSystem inefficient implementation when calling sc.textFile DateThu, 27 Nov 2014 03:20:14 GMT In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is a file in S3 and then flatMap that with something that reads the files and returns the lines. Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe https://gist.github.com/marmbrus/fff0b058f134fa7752fe Using this class you can do something like: sc.parallelize(s3n://mybucket/file1 :: s3n://mybucket/file1 ... :: Nil).flatMap(new ReadLinesSafe(_)) You can also build up the list of files by running a Spark job: https://gist.github.com/marmbrus/15e72f7bc22337cf6653 https://gist.github.com/marmbrus/15e72f7bc22337cf6653 Michael On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: It’s a long story but there are many dirs with smallish part- files in them so we create a list of the individual files as input to sparkContext.textFile(fileList). I suppose we could move them and rename them to be contiguous part- files in one dir. Would that be better than passing in a long list of individual filenames? We could also make the part files much larger by collecting the smaller ones. But would any of this make a difference in IO speed? I ask because using the long file list seems to read, what amounts to a not very large data set rather slowly. If it were all in large part files in one dir I’d expect it to go much faster but this is just intuition. On Mar 14, 2015, at 9:58 AM, Koert Kuipers ko...@tresata.com mailto:ko...@tresata.com wrote: why can you not put them in a directory and read them as one input? you will get a task per file, but spark is very fast at executing many tasks (its not a jvm per task). On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be broken up and read in groups appending the resulting RDDs or should we just pass in the entire list at once? In effect I’m asking if Spark does some optimization of whether we should do it explicitly. If the later, what rule might we use depending on our cluster setup? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands,
Re: deploying Spark on standalone cluster
i was having a similar issue but it was in spark and flume integration i was getting failed to bind error , but got it fixed by shutting down firewall for both machines (make sure : service iptables status = firewall stopped) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/deploying-Spark-on-standalone-cluster-tp22049p22057.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
1.3 release
Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Null Pointer Exception due to mapVertices function in GraphX
I have got NullPointerException in aggregateMessages on a graph which is the output of mapVertices function of a graph. I found the problem is because of the mapVertices funciton did not affect all the triplet of the graph. // Initial the graph, assign a counter to each vertex that contains the vertex id onlyvar anfGraph = graph.mapVertices { case (vid, _) = val counter = new HyperLogLog(5) counter.offer(vid) counter} val nullVertex = anfGraph.triplets.filter(edge = edge.srcAttr == null).first// There is an edge whose src attr is null anfGraph.vertices.filter(_._1 == nullVertex).first// I could see that the vertex has a not null attribute messages = anfGraph.aggregateMessages(msgFun, mergeMessage) // - NullPointerException My spark version:1.2.0
Re: LogisticRegressionWithLBFGS shows ERRORs
In LBFGS version of logistic regression, the data is properly standardized, so this should not happen. Can you provide a copy of your dataset to us so we can test it? If the dataset can not be public, can you have just send me a copy so I can dig into this? I'm the author of LORWithLBFGS. Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.25 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.125 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.0625 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.03125 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.015625 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.0078125 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.005859375 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line search t: NaN fval: NaN rhs: NaN cdd: NaN 2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resetting history: breeze.optimize.FirstOrderException: Line search zoom failed What causes them and how do I fix them? I checked my data and there seemed nothing out of the ordinary. The resulting prediction model seemed acceptable to me. So, are these ERRORs actually WARNINGs? Could we or should we tune the level of these messages down one notch? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LogisticRegressionWithLBFGS-shows-ERRORs-tp22042.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: order preservation with RDDs
Yes I don't think this is entirely reliable in general. I would emit (label,features) pairs and then transform the values. In practice, this may happen to work fine in simple cases. On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote: Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x: x.features) ... ... data1 = label.zip(scaler1.transform(features)) my question: wouldn't it be possible that some labels in the pairs returned by the label.zip(..) operation are not paired with their original features? i.e. are the original orderings of `labels` and `features` preserved after the scaler1.transform(..) and label.zip(..) operations? This issue was also mentioned in http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html I would greatly appreciate some clarification on this, as I've run into this issue whilst experimenting with feature extraction for text classification, where (correct me if I'm wrong) there is no built-in mechanism to keep track of document-ids through the HashingTF and IDF fitting and transformations. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Software stack for Recommendation engine with spark mlib
Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is going to use Android app. 2. Rest Api sent to app server from the android app to get recommendations. 3. Spark Mlib core engine for recommendation engine 4. MongoDB database backend. I would like to know more on the cluster configuration( how many nodes etc) part of spark for calculating the recommendations for 500,000 items. This items include products for day care etc. Other software stack suggestions would also be very useful.It has to run on multiple vendor machines. Please suggest. Thanks shashi
Re: Explanation on the Hive in the Spark assembly
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: 1. DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and commands (e.g. |SET key = value|, |ADD FILE|, |ADD JAR|, etc.) In most cases, Spark SQL simply delegates these statements to Hive, as they don’t need to issue any distributed jobs and don’t rely on the computation engine (Spark, MR, or Tez). 2. |SELECT| queries, |CREATE TABLE ... AS SELECT ...| statements and insertions These statements are executed using Spark as the execution engine. The Hive classes packaged in the assembly jar are used to provide entry points to Hive features, for example: 1. HiveQL parser 2. Talking to Hive metastore to execute DDL statements 3. Accessing UDF/UDAF/UDTF As for the differences between Hive on Spark and Spark SQL’s Hive support, please refer to this article by Reynold: https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html Cheng On 3/14/15 10:53 AM, bit1...@163.com wrote: Thanks Daoyuan. What do you mean by running some native command, I never thought that hive will run without an computing engine like Hadoop MR or spark. Thanks. bit1...@163.com *From:* Wang, Daoyuan mailto:daoyuan.w...@intel.com *Date:* 2015-03-13 16:39 *To:* bit1...@163.com mailto:bit1...@163.com; user mailto:user@spark.apache.org *Subject:* RE: Explanation on the Hive in the Spark assembly Hi bit1129, 1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid conflicts. 2, this hive is used to run some native command, which does not rely on spark or mapreduce. Thanks, Daoyuan *From:*bit1...@163.com [mailto:bit1...@163.com] *Sent:* Friday, March 13, 2015 4:24 PM *To:* user *Subject:* Explanation on the Hive in the Spark assembly Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the same thing as Hive On Spark(Hive On Spark, is meant to run hive using spark execution engine). So, my question is: 1. What is the difference between Hive in the spark assembly and Hive on Hadoop? 2. Does Hive in the spark assembly use Spark execution engine or Hadoop MR engine? Thanks. bit1...@163.com mailto:bit1...@163.com
Benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs Spark SQL
Hi I would like to share with you my comments on Hortonworks' benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs 'Spark SQL'. Please check them in my related blog entry at http://goo.gl/K5mk0U Thanks Slim Baltagi Chicago, IL http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarks-of-Hive-on-Tez-vs-Hive-on-Spark-vs-Spark-SQL-tp22060.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problem connecting to HBase
Hello all, Thank you for your responses. I did try to include the zookeeper.znode.parent property in the hbase-site.xml. It still continues to give the same error. I am using Spark 1.2.0 and hbase 0.98.9. Could you please suggest what else could be done? On Fri, Mar 13, 2015 at 10:25 PM, Ted Yu yuzhih...@gmail.com wrote: In HBaseTest.scala: val conf = HBaseConfiguration.create() You can add some log (for zookeeper.znode.parent, e.g.) to see if the values from hbase-site.xml are picked up correctly. Please use pastebin next time you want to post errors. Which Spark release are you using ? I assume it contains SPARK-1297 Cheers On Fri, Mar 13, 2015 at 7:47 PM, HARIPRIYA AYYALASOMAYAJULA aharipriy...@gmail.com wrote: Hello, I am running a HBase test case. I am using the example from the following: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala I created a very small HBase table with 5 rows and 2 columns. I have attached a screenshot of the error log. I believe it is a problem where the driver program is unable to establish connection to the hbase. The following is my simple.sbt: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.2.0, org.apache.hbase % hbase % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-client % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-server % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-common % 0.98.9-hadoop2 % provided ) I am using a 23 node cluster, did copy hbase-site.xml into /spark/conf folder and set spark.executor.extraClassPath pointing to the /hbase/ folder in the spark-defaults.conf Also, while submitting the spark job I am including the required jars : spark-submit --class HBaseTest --master yarn-cluster --driver-class-path /opt/hbase/0.98.9/lib/hbase-server-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-protocol-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-hadoop2-compat-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-client-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-common-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/htrace-core-2.04.jar /home/priya/usingHBase/Spark/target/scala-2.10/simple-project_2.10-1.0.jar /Priya/sparkhbase-test1 It would be great if you could point where I am going wrong, and what could be done to correct it. Thank you for your time. -- Regards, Haripriya Ayyalasomayajula Graduate Student Department of Computer Science University of Houston Contact : 650-796-7112 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Haripriya Ayyalasomayajula Graduate Student Department of Computer Science University of Houston Contact : 650-796-7112
Re: Running spark function on parquet without sql
That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change. Cheng On 2/28/15 8:13 AM, Deborah Siegel wrote: Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call |schemaRDD.cache()| rather than |sqlContext.cacheTable(...)|, tables will /not/ be cached using the in-memory columnar format, and therefore |sqlContext.cacheTable(...)| is strongly recommended for this use case. Yet the API doc shows that : def cache(): SchemaRDD https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html.this.type Overridden cache function will always use the in-memory columnar caching. links https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD Thanks Sincerely Deb On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com mailto:mich...@databricks.com wrote: From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and .cache() since Spark 1.2+
Re: Issue with yarn cluster - hangs in accepted state.
Thanks, It worked. -Abhi On Tue, Mar 3, 2015 at 5:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote: Do you have enough resource in your cluster? You can check your resource manager to see the usage. Yep, I can confirm that this is a very annoying issue. If there is not enough memory or VCPUs available, your app will just stay in ACCEPTED state until resources are available. You can have a look at https://github.com/jubatus/jubaql-docker/blob/master/hadoop/yarn-site.xml#L35 to see some settings that might help. Tobias
Re: Writing wide parquet file in Spark SQL
This article by Ryan Blue should be helpful to understand the problem http://ingest.tips/2015/01/31/parquet-row-group-size/ The TL;DR is, you may decrease |parquet.block.size| to reduce memory consumption. Anyway, 100K columns is a really big burden for Parquet, but I guess your data should be pretty sparse. Cheng On 3/11/15 4:13 AM, kpeng1 wrote: Hi All, I am currently trying to write a very wide file into parquet using spark sql. I have 100K column records that I am trying to write out, but of course I am running into space issues(out of memory - heap space). I was wondering if there are any tweaks or work arounds for this. I am basically calling saveAsParquetFile on the schemaRDD. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Software stack for Recommendation engine with spark mlib
Thanks Nick, for your suggestions. On Sun, Mar 15, 2015 at 10:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still, since you plan to serve these via a REST service anyway, computing on demand via a serving layer such as Oryx or PredictionIO (or the newly open sourced Seldon.io) is a good option. You can also cache the recommendations quite aggressively - once you compute a user or item top-K list, just stick the result in mem cache / redis / whatever and evict it when you recompute your offline model, or every hour or whatever. — Sent from Mailbox https://www.dropbox.com/mailbox On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Thanks Sean, your suggestions and the links provided are just what I needed to start off with. On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote: I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely up to date. The front-end part of the stack looks right. Spark would do the model building; you'd have to write a process to score recommendations and store the result. Mahout is the same thing, really. 500K items isn't all that large. Your requirements aren't driven just by items though. Number of users and latent features matter too. It matters how often you want to build the model too. I'm guessing you would get away with a handful of modern machines for a problem this size. In a way what you describe reminds me of Wibidata, since it built recommender-like solutions on top of data and results published to a NoSQL store. You might glance at the related OSS project Kiji (http://kiji.org/) for ideas about how to manage the schema. You should have a look at things like Nick's architecture for Graphflow, however it's more concerned with computing recommendation on the fly, and describes a shift from an architecture originally built around something like a NoSQL store: http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf This is also the kind of ground the oryx project is intended to cover, something I've worked on personally: https://github.com/OryxProject/oryx -- a layer on and around the core model building in Spark + Spark Streaming to provide a whole recommender (for example), down to the REST API. On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is going to use Android app. 2. Rest Api sent to app server from the android app to get recommendations. 3. Spark Mlib core engine for recommendation engine 4. MongoDB database backend. I would like to know more on the cluster configuration( how many nodes etc) part of spark for calculating the recommendations for 500,000 items. This items include products for day care etc. Other software stack suggestions would also be very useful.It has to run on multiple vendor machines. Please suggest. Thanks shashi
Re: Streaming linear regression example question
Hi again Tried the same examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala from 1.3.0 and getting in case testing file content is: (0.0,[3.0,4.0,3.0]) (0.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[5.0,6.0,6.0]) (6.0,[7.0,4.0,7.0]) (7.0,[8.0,6.0,8.0]) (8.0,[44.0,9.0,9.0]) (9.0,[14.0,30.0,10.0]) and the answer: (0.0,0.0) (0.0,0.0) (0.0,0.0) (4.0,0.0) (5.0,0.0) (6.0,0.0) (7.0,0.0) (8.0,0.0) (9.0,0.0) What is wrong? I can see that model's weights are changing in case I put new data into training dir. Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480 On 14/03/15 09:05, Margus Roo wrote: Hi I try to understand example provided in https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - Streaming linear regression Code: import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.DStream object StreamingLinReg { def main(args: Array[String]) { val conf = new SparkConf().setAppName(StreamLinReg).setMaster(local[2]) val ssc = new StreamingContext(conf, Seconds(10)) val trainingData = ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/training/).map(LabeledPoint.parse).cache() val testData = ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/testing/).map(LabeledPoint.parse) val numFeatures = 3 val model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures)) model.trainOn(trainingData) model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() ssc.start() ssc.awaitTermination() } } Compiled code and run it Put file contains (1.0,[2.0,2.0,2.0]) (2.0,[3.0,3.0,3.0]) (3.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[6.0,6.0,6.0]) (6.0,[7.0,7.0,7.0]) (7.0,[8.0,8.0,8.0]) (8.0,[9.0,9.0,9.0]) (9.0,[10.0,10.0,10.0]) in to training directory. I can see that models weight change: 15/03/14 08:53:40 INFO StreamingLinearRegressionWithSGD: Current model: weights, [7.333,7.333,7.333] No I can put what ever in to testing directory but I can not understand answer. In example I can put the same file I used for training in to testing directory. File content is (1.0,[2.0,2.0,2.0]) (2.0,[3.0,3.0,3.0]) (3.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[6.0,6.0,6.0]) (6.0,[7.0,7.0,7.0]) (7.0,[8.0,8.0,8.0]) (8.0,[9.0,9.0,9.0]) (9.0,[10.0,10.0,10.0]) And answer will be (1.0,0.0) (2.0,0.0) (3.0,0.0) (4.0,0.0) (5.0,0.0) (6.0,0.0) (7.0,0.0) (8.0,0.0) (9.0,0.0) And in case my file content is (0.0,[2.0,2.0,2.0]) (0.0,[3.0,3.0,3.0]) (0.0,[4.0,4.0,4.0]) (0.0,[5.0,5.0,5.0]) (0.0,[6.0,6.0,6.0]) (0.0,[7.0,7.0,7.0]) (0.0,[8.0,8.0,8.0]) (0.0,[9.0,9.0,9.0]) (0.0,[10.0,10.0,10.0]) the answer will be: (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) I except to get label predicted by model. -- Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480
Submitting spark application using Yarn Rest API
Hi All, I am trying to submit the spark application using yarn rest API. I am able to submit the application but final status shows as 'UNDEFINED.'. Couple of other observations: User shows as Dr.who Application type is empty though I specify it as Spark Is any one had this problem before? I am creating the app id using: http://{cluster host}/ws/v1/cluster/apps/new-application Post Request Url: http://{cluster host}/ws/v1/cluster/apps Here is the request body: { application-id:application_1426273041023_0055, application-name:test, am-container-spec: { credentials: { secrets: { entry: [ { key:user.name, value:x } ] } }, commands: { command:%SPARK_HOME%/bin/spark-submit.cmd --class org.apache.spark.examples.SparkPi --conf spark.yarn.jar=hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-assembly-1.2.1-hadoop2.6.0.jar --master yarn-cluster hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-examples-1.2.1-hadoop2.6.0.jar }, application-type:SPARK }
Re: Read Parquet file from scala directly
The parquet-tools code should be pretty helpful (although it's Java) https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command On 3/10/15 12:25 AM, Shuai Zheng wrote: Hi All, I have a lot of parquet files, and I try to open them directly instead of load them into RDD in driver (so I can optimize some performance through special logic). But I do some research online and can’t find any example to access parquet directly from scala, anyone has done this before? Regards, Shuai
Re: From Spark web ui, how to prove the parquet column pruning working
Hey Yong, It seems that Hadoop `FileSystem` adds the size of a block to the metrics even if you only touch a fraction of it (reading Parquet metadata for example). This behavior can be verified by the following snippet: ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sc._ import sqlContext._ case class KeyValue(key: Int, value: String) parallelize(1 to 1024 * 1024 * 20). flatMap(i = Seq.fill(10)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) hadoopConfiguration.set(parquet.task.side.metadata, true) sql(SET spark.sql.parquet.filterPushdown=true) parquetFile(large.parquet).where('key === 0).queryExecution.toRdd.mapPartitions { _ = new Iterator[Row] { def hasNext = false def next() = ??? } }.collect() ``` Apparently we’re reading nothing here (except for Parquet metadata in the footers), but the web UI still suggests that the input size of all tasks equals to the file size. Cheng On 3/10/15 3:15 AM, java8964 wrote: Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet. I generated the same dataset in both Avro and Parquet in HDFS, and load them both in Spark-SQL. Now I run the same query like select colum1 from src_table_avro/parqut where colum2=xxx, I can see that for the parquet data format, the job runs much fast. The test files size for both format are around 930M. So Avro job generated 8 tasks to read the data with 21s as the median duration, vs parquet job generate 7 tasks to read the data with 0.4s as the median duration. Since the dataset has more than 100 columns, I can see the parquet file really coming with fast read. But my question is that from the spark UI, both job show 900M as the input size, and 0 for rest, in this case, how do I know column pruning really works? I think it is due to that, so parquet file can be read so fast, but is there any statistic can prove that to me on the Spark UI? Something like the input total file size is 900M, but only 10M really read due to column pruning? So in case that the columns pruning not work in parquet due to what kind of SQL query, I can identify in the first place. Thanks Yong
Re: Software stack for Recommendation engine with spark mlib
As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still, since you plan to serve these via a REST service anyway, computing on demand via a serving layer such as Oryx or PredictionIO (or the newly open sourced Seldon.io) is a good option. You can also cache the recommendations quite aggressively - once you compute a user or item top-K list, just stick the result in mem cache / redis / whatever and evict it when you recompute your offline model, or every hour or whatever. — Sent from Mailbox On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Thanks Sean, your suggestions and the links provided are just what I needed to start off with. On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote: I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely up to date. The front-end part of the stack looks right. Spark would do the model building; you'd have to write a process to score recommendations and store the result. Mahout is the same thing, really. 500K items isn't all that large. Your requirements aren't driven just by items though. Number of users and latent features matter too. It matters how often you want to build the model too. I'm guessing you would get away with a handful of modern machines for a problem this size. In a way what you describe reminds me of Wibidata, since it built recommender-like solutions on top of data and results published to a NoSQL store. You might glance at the related OSS project Kiji (http://kiji.org/) for ideas about how to manage the schema. You should have a look at things like Nick's architecture for Graphflow, however it's more concerned with computing recommendation on the fly, and describes a shift from an architecture originally built around something like a NoSQL store: http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf This is also the kind of ground the oryx project is intended to cover, something I've worked on personally: https://github.com/OryxProject/oryx -- a layer on and around the core model building in Spark + Spark Streaming to provide a whole recommender (for example), down to the REST API. On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is going to use Android app. 2. Rest Api sent to app server from the android app to get recommendations. 3. Spark Mlib core engine for recommendation engine 4. MongoDB database backend. I would like to know more on the cluster configuration( how many nodes etc) part of spark for calculating the recommendations for 500,000 items. This items include products for day care etc. Other software stack suggestions would also be very useful.It has to run on multiple vendor machines. Please suggest. Thanks shashi
Re: Is there any problem in having a long opened connection to spark sql thrift server
It should be OK. If you encountered problems in having a long opened connection to the Thrift server, it should be a bug. Cheng On 3/9/15 6:41 PM, fanooos wrote: I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift server. ( Here is the problem I am talking about. http://apache-spark-user-list.1001560.n3.nabble.com/Connection-PHP-application-to-Spark-Sql-thrift-server-td21925.html ) Until I find a solution to this problem, there is a suggestion to make a little java application that connects to spark sql thrift server and provide an API to the PHP applications to executes the required queries. From my little experience, opening a connection and closing it for each query is not recommended (I am talking from my experience in working with CRUD applications the deals with some kind of database). 1- Is the same recommendation applied to working with spark sql thrift server ? 2- If yes, Is there any problem in having one connection connected for a long time with the spark sql thrift server? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-in-having-a-long-opened-connection-to-spark-sql-thrift-server-tp21967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problem connecting to HBase
org.apache.hbase % hbase % 0.98.9-hadoop2 % provided, There is no module in hbase 0.98.9 called hbase. But this would not be the root cause of the error. Most likely hbase-site.xml was not picked up. Meaning this is classpath issue. On Sun, Mar 15, 2015 at 10:04 AM, HARIPRIYA AYYALASOMAYAJULA aharipriy...@gmail.com wrote: Hello all, Thank you for your responses. I did try to include the zookeeper.znode.parent property in the hbase-site.xml. It still continues to give the same error. I am using Spark 1.2.0 and hbase 0.98.9. Could you please suggest what else could be done? On Fri, Mar 13, 2015 at 10:25 PM, Ted Yu yuzhih...@gmail.com wrote: In HBaseTest.scala: val conf = HBaseConfiguration.create() You can add some log (for zookeeper.znode.parent, e.g.) to see if the values from hbase-site.xml are picked up correctly. Please use pastebin next time you want to post errors. Which Spark release are you using ? I assume it contains SPARK-1297 Cheers On Fri, Mar 13, 2015 at 7:47 PM, HARIPRIYA AYYALASOMAYAJULA aharipriy...@gmail.com wrote: Hello, I am running a HBase test case. I am using the example from the following: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala I created a very small HBase table with 5 rows and 2 columns. I have attached a screenshot of the error log. I believe it is a problem where the driver program is unable to establish connection to the hbase. The following is my simple.sbt: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.2.0, org.apache.hbase % hbase % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-client % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-server % 0.98.9-hadoop2 % provided, org.apache.hbase % hbase-common % 0.98.9-hadoop2 % provided ) I am using a 23 node cluster, did copy hbase-site.xml into /spark/conf folder and set spark.executor.extraClassPath pointing to the /hbase/ folder in the spark-defaults.conf Also, while submitting the spark job I am including the required jars : spark-submit --class HBaseTest --master yarn-cluster --driver-class-path /opt/hbase/0.98.9/lib/hbase-server-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-protocol-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-hadoop2-compat-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-client-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-common-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/htrace-core-2.04.jar /home/priya/usingHBase/Spark/target/scala-2.10/simple-project_2.10-1.0.jar /Priya/sparkhbase-test1 It would be great if you could point where I am going wrong, and what could be done to correct it. Thank you for your time. -- Regards, Haripriya Ayyalasomayajula Graduate Student Department of Computer Science University of Houston Contact : 650-796-7112 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Haripriya Ayyalasomayajula Graduate Student Department of Computer Science University of Houston Contact : 650-796-7112
Re: Spark 1.2 – How to change Default (Random) port ….
Hi SM, Apologize for delayed response. No, the issue is with Spark 1.2.0. There is a bug in Spark 1.2.0. Recently Spark have latest 1.3.0 release so it might have fixed in it. I am not planning to test it soon, may be after some time. You can try for it. Regards, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p22063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Explanation on the Hive in the Spark assembly
Thanks Cheng for the great explanation! bit1...@163.com From: Cheng Lian Date: 2015-03-16 00:53 To: bit1...@163.com; Wang, Daoyuan; user Subject: Re: Explanation on the Hive in the Spark assembly Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: DDL statements (e.g. CREATE TABLE, DROP TABLE, etc.) and commands (e.g. SET key = value, ADD FILE, ADD JAR, etc.) In most cases, Spark SQL simply delegates these statements to Hive, as they don’t need to issue any distributed jobs and don’t rely on the computation engine (Spark, MR, or Tez). SELECT queries, CREATE TABLE ... AS SELECT ... statements and insertions These statements are executed using Spark as the execution engine. The Hive classes packaged in the assembly jar are used to provide entry points to Hive features, for example: HiveQL parser Talking to Hive metastore to execute DDL statements Accessing UDF/UDAF/UDTF As for the differences between Hive on Spark and Spark SQL’s Hive support, please refer to this article by Reynold: https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html Cheng On 3/14/15 10:53 AM, bit1...@163.com wrote: Thanks Daoyuan. What do you mean by running some native command, I never thought that hive will run without an computing engine like Hadoop MR or spark. Thanks. bit1...@163.com From: Wang, Daoyuan Date: 2015-03-13 16:39 To: bit1...@163.com; user Subject: RE: Explanation on the Hive in the Spark assembly Hi bit1129, 1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid conflicts. 2, this hive is used to run some native command, which does not rely on spark or mapreduce. Thanks, Daoyuan From: bit1...@163.com [mailto:bit1...@163.com] Sent: Friday, March 13, 2015 4:24 PM To: user Subject: Explanation on the Hive in the Spark assembly Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the same thing as Hive On Spark(Hive On Spark, is meant to run hive using spark execution engine). So, my question is: 1. What is the difference between Hive in the spark assembly and Hive on Hadoop? 2. Does Hive in the spark assembly use Spark execution engine or Hadoop MR engine? Thanks. bit1...@163.com
Re: RE: Building spark over specified tachyon
Thanks, Jerry I got that way. Just to make sure whether there can be some option to directly specifying tachyon version. fightf...@163.com From: Shao, Saisai Date: 2015-03-16 11:10 To: fightf...@163.com CC: user Subject: RE: Building spark over specified tachyon I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I’m not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 11:01 AM To: user Subject: Building spark over specified tachyon Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun. fightf...@163.com
Re: Input validation for LogisticRegressionWithSGD
ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote: Hi, I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints loaded using loadLibSVMFile: val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, s3n://logistic-regression/epsilon_normalized) val model = LogisticRegressionWithSGD.train(logistic, 100) It gives an input validation error after about 10 minutes: org.apache.spark.SparkException: Input validation failed. at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146) at org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157) at org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192) From reading this bug report ( https://issues.apache.org/jira/browse/SPARK-2575) since I am loading LibSVM format file there should be only 0/1 in the dataset and should not be facing the issue in the bug report. Is there something else I'm missing here? Thanks!
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Hi Sparkers, I couldn't able to run spark-sql on spark.Please find the following error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Regards, Sandeep.v
Re: Slides of my talk in LA: 'Spark or Hadoop: is it an either-or proposition?'
Hi The video recording of this talk titled Spark or Hadoop: is it an either-or proposition? at the Los Angeles Spark Users Group on March 12, 2015 is now available on youtube at this link: http://goo.gl/0iJZ4n Thanks Slim Baltagi http://www.SparkBigData.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Slides-of-my-talk-in-LA-Spark-or-Hadoop-is-it-an-either-or-proposition-tp22061p22069.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Can you provide more information ? Such as: Version of Spark you're using Command line Thanks On Mar 15, 2015, at 9:51 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I couldn't able to run spark-sql on spark.Please find the following error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Regards, Sandeep.v - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Building spark over specified tachyon
I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I'm not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 11:01 AM To: user Subject: Building spark over specified tachyon Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun. fightf...@163.commailto:fightf...@163.com
k-means hang without error/warning
Hi, I am running k-means using Spark in local mode. My data set is about 30k records, and I set the k = 1000. The algorithm starts and finished 13 jobs according to the UI monitor, then it stopped working. The last log I saw was: [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned broadcast *16* There're many similar log repeated, but it seems it always stop at the 16th. If I try to low down the *k* value, the algorithm will terminated. So I just want to know what's wrong with *k=1000*. Thanks, David
Re: Input validation for LogisticRegressionWithSGD
I checked the labels across the entire dataset and it looks like it has -1 and 1 (not the 0 and 1 I originally expected). I will try replacing the -1 with 0 and run it again. On Mon, Mar 16, 2015 at 12:51 AM, Rishi Yadav ri...@infoobjects.com wrote: ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote: Hi, I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints loaded using loadLibSVMFile: val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, s3n://logistic-regression/epsilon_normalized) val model = LogisticRegressionWithSGD.train(logistic, 100) It gives an input validation error after about 10 minutes: org.apache.spark.SparkException: Input validation failed. at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146) at org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157) at org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192) From reading this bug report ( https://issues.apache.org/jira/browse/SPARK-2575) since I am loading LibSVM format file there should be only 0/1 in the dataset and should not be facing the issue in the bug report. Is there something else I'm missing here? Thanks!
Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Hi Ted, I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration files attached below. ERROR IN SPARK n: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkS QLCLIDriver.scala:101) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQ LCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.h ive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore Utils.java:1412) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(Retry ingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Ret ryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.ja va:2453) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:340) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct orAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC onstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:534) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore Utils.java:1410) ... 14 more Caused by: javax.jdo.JDOFatalInternalException: Error creating transactional con nection factory NestedThrowables: java.lang.reflect.InvocationTargetException at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusExc eption(NucleusJDOHelper.java:587) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfigurat ion(JDOPersistenceManagerFactory.java:788) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenc eManagerFactory(JDOPersistenceManagerFactory.java:333) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceMa nagerFactory(JDOPersistenceManagerFactory.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementatio n(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java: 310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob jectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.j ava:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java :223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:6 2) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.ja va:117) at org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.j ava:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy .java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore (HiveMetaStore.java:497) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveM etaStore.java:475) at
Building spark over specified tachyon
Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun. fightf...@163.com
Re: Trouble launching application that reads files
I figured out how to use local files with file:// but not with either the persistent or ephemeral-hdfs -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-launching-application-that-reads-files-tp22065p22068.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Building spark over specified tachyon
Thanks haoyuan. fightf...@163.com From: Haoyuan Li Date: 2015-03-16 12:59 To: fightf...@163.com CC: Shao, Saisai; user Subject: Re: RE: Building spark over specified tachyon Here is a patch: https://github.com/apache/spark/pull/4867 On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com fightf...@163.com wrote: Thanks, Jerry I got that way. Just to make sure whether there can be some option to directly specifying tachyon version. fightf...@163.com From: Shao, Saisai Date: 2015-03-16 11:10 To: fightf...@163.com CC: user Subject: RE: Building spark over specified tachyon I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I’m not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 11:01 AM To: user Subject: Building spark over specified tachyon Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun. fightf...@163.com -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: Streaming linear regression example question
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can hopefully include in 1.3.1. In the meantime, you can get the desired result using transform: model.trainOn(trainingData) testingData.transform { rdd = val latest = model.latestModel() rdd.map(lp = (lp.label, latest.predict(lp.features))) }.print() - jeremyfreeman.net @thefreemanlab On Mar 15, 2015, at 2:56 PM, Margus Roo mar...@roo.ee wrote: Hi again Tried the same examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala from 1.3.0 and getting in case testing file content is: (0.0,[3.0,4.0,3.0]) (0.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[5.0,6.0,6.0]) (6.0,[7.0,4.0,7.0]) (7.0,[8.0,6.0,8.0]) (8.0,[44.0,9.0,9.0]) (9.0,[14.0,30.0,10.0]) and the answer: (0.0,0.0) (0.0,0.0) (0.0,0.0) (4.0,0.0) (5.0,0.0) (6.0,0.0) (7.0,0.0) (8.0,0.0) (9.0,0.0) What is wrong? I can see that model's weights are changing in case I put new data into training dir. Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480 On 14/03/15 09:05, Margus Roo wrote: Hi I try to understand example provided in https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - Streaming linear regression Code: import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.DStream object StreamingLinReg { def main(args: Array[String]) { val conf = new SparkConf().setAppName(StreamLinReg).setMaster(local[2]) val ssc = new StreamingContext(conf, Seconds(10)) val trainingData = ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/training/).map(LabeledPoint.parse).cache() val testData = ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/testing/).map(LabeledPoint.parse) val numFeatures = 3 val model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures)) model.trainOn(trainingData) model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() ssc.start() ssc.awaitTermination() } } Compiled code and run it Put file contains (1.0,[2.0,2.0,2.0]) (2.0,[3.0,3.0,3.0]) (3.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[6.0,6.0,6.0]) (6.0,[7.0,7.0,7.0]) (7.0,[8.0,8.0,8.0]) (8.0,[9.0,9.0,9.0]) (9.0,[10.0,10.0,10.0]) in to training directory. I can see that models weight change: 15/03/14 08:53:40 INFO StreamingLinearRegressionWithSGD: Current model: weights, [7.333,7.333,7.333] No I can put what ever in to testing directory but I can not understand answer. In example I can put the same file I used for training in to testing directory. File content is (1.0,[2.0,2.0,2.0]) (2.0,[3.0,3.0,3.0]) (3.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[6.0,6.0,6.0]) (6.0,[7.0,7.0,7.0]) (7.0,[8.0,8.0,8.0]) (8.0,[9.0,9.0,9.0]) (9.0,[10.0,10.0,10.0]) And answer will be (1.0,0.0) (2.0,0.0) (3.0,0.0) (4.0,0.0) (5.0,0.0) (6.0,0.0) (7.0,0.0) (8.0,0.0) (9.0,0.0) And in case my file content is (0.0,[2.0,2.0,2.0]) (0.0,[3.0,3.0,3.0]) (0.0,[4.0,4.0,4.0]) (0.0,[5.0,5.0,5.0]) (0.0,[6.0,6.0,6.0]) (0.0,[7.0,7.0,7.0]) (0.0,[8.0,8.0,8.0]) (0.0,[9.0,9.0,9.0]) (0.0,[10.0,10.0,10.0]) the answer will be: (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) I except to get label predicted by model. -- Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480
Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Hi Ted, Did you find any solution. Thanks Sandeep On Mon, Mar 16, 2015 at 10:44 AM, sandeep vura sandeepv...@gmail.com wrote: Hi Ted, I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration files attached below. ERROR IN SPARK n: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:346) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkS QLCLIDriver.scala:101) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQ LCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.h ive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore Utils.java:1412) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(Retry ingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Ret ryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.ja va:2453) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav a:340) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct orAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC onstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:534) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore Utils.java:1410) ... 14 more Caused by: javax.jdo.JDOFatalInternalException: Error creating transactional con nection factory NestedThrowables: java.lang.reflect.InvocationTargetException at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusExc eption(NucleusJDOHelper.java:587) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfigurat ion(JDOPersistenceManagerFactory.java:788) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenc eManagerFactory(JDOPersistenceManagerFactory.java:333) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceMa nagerFactory(JDOPersistenceManagerFactory.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementatio n(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java: 310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob jectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.j ava:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java :223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:6 2) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.ja va:117) at org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.j ava:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy .java:67) at