Hi Michael, If I understand correctly, the assembly JAR file is deployed onto HDFS /user/$USER/.stagingSpark folders that will be used by all computing (worker) nodes when people run in yarn-cluster mode. Could you elaborate more what does the document mean by this? It is a bit misleading and I guess this only applies to standalone mode? Andrew L
Date: Fri, 25 Jul 2014 15:25:42 -0700 Subject: RE: Spark SQL and Hive tables From: ssti...@live.com To: user@spark.apache.org Thanks! Will do. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone -------- Original message -------- From: Michael Armbrust Date:07/25/2014 3:24 PM (GMT-08:00) To: user@spark.apache.org Subject: Re: Spark SQL and Hive tables [S]ince Hive has a large number of dependencies, it is not included in the default Spark assembly. In order to use Hive you must first run ‘SPARK_HIVE=true sbt/sbt assembly/assembly’ (or use -Phive for maven). This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to acccess data stored in Hive. On Fri, Jul 25, 2014 at 3:20 PM, Sameer Tilak <ssti...@live.com> wrote: Hi Jerry, I am having trouble with this. May be something wrong with my import or version etc. scala> import org.apache.spark.sql._; import org.apache.spark.sql._ scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) <console>:24: error: object hive is not a member of package org.apache.spark.sql val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) ^ Here is what I see for autocompletion: scala> org.apache.spark.sql. Row SQLContext SchemaRDD SchemaRDDLike api catalyst columnar execution package parquet test Date: Fri, 25 Jul 2014 17:48:27 -0400 Subject: Re: Spark SQL and Hive tables From: chiling...@gmail.com To: user@spark.apache.org Hi Sameer, The blog post you referred to is about Spark SQL. I don't think the intent of the article is meant to guide you how to read data from Hive via Spark SQL. So don't worry too much about the blog post. The programming guide I referred to demonstrate how to read data from Hive using Spark SQL. It is a good starting point. Best Regards, Jerry On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak <ssti...@live.com> wrote: Hi Michael, Thanks. I am not creating HiveContext, I am creating SQLContext. I am using CDH 5.1. Can you please let me know which conf/ directory you are talking about? From: mich...@databricks.com Date: Fri, 25 Jul 2014 14:34:53 -0700 Subject: Re: Spark SQL and Hive tables To: user@spark.apache.org In particular, have you put your hive-site.xml in the conf/ directory? Also, are you creating a HiveContext instead of a SQLContext? On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam <chiling...@gmail.com> wrote: Hi Sameer, Maybe this page will help you: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Best Regards, Jerry On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak <ssti...@live.com> wrote: Hi All, I am trying to load data from Hive tables using Spark SQL. I am using spark-shell. Here is what I see: val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, demographics.birth_year, demographics.income_group FROM prod p JOIN demographics d ON d.user_id = p.user_id""") 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences java.lang.RuntimeException: Table Not Found: prod. I have these tables in hive. I used show tables command to confirm this. Can someone please let me know how do I make them accessible here?