Re: Spark SQL CLI

Michael Armbrust Tue, 23 Sep 2014 10:53:11 -0700

A workaround for now would be to save the JSON as parquet and the create a
metastore parquet table.  Using parquet will be much faster for repeated
querying. This function might be helpful:


import org.apache.spark.sql.hive.HiveMetastoreTypes

def createParquetTable(name: String, file: String, sqlContext: SQLContext):
Unit = {
  import sqlContext._

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f => s"${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
  val ddl = s"""
    |CREATE EXTERNAL TABLE $name (
    |  $schema
    |)
    |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
    |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
    |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
    |LOCATION '$file'""".stripMargin
  sql(ddl)
  setConf("spark.sql.hive.convertMetastoreParquet", "true")
}

On Tue, Sep 23, 2014 at 10:49 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> You can't directly query JSON tables from the CLI or JDBC server since
> temporary tables only live for the life of the Spark Context.  This PR will
> eventually (targeted for 1.2) let you do what you want in pure SQL:
> https://github.com/apache/spark/pull/2475
>
> On Mon, Sep 22, 2014 at 4:52 PM, Yin Huai <huaiyin....@gmail.com> wrote:
>
>> Hi Gaurav,
>>
>> Seems metastore should be created by LocalHiveContext and metastore_db
>> should be created by a regular HiveContext. Can you check if you are still
>> using LocalHiveContext when you tried to access your tables? Also, if you
>> created those tables when you launched your sql cli under bin/, you can
>> launch sql cli in the same dir (bin/) and spark sql should be able to
>> connect to the metastore without any setting.
>>
>> btw, Can you let me know your settings in hive-site?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 22, 2014 at 7:18 PM, Gaurav Tiwari <gtins...@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> I tried setting the metastore and metastore_db location in the
>>> *conf/hive-site.xml *to the directories created in spark bin folder
>>> (they were created when I ran spark shell and used LocalHiveContext), but
>>> still doesn't work
>>>
>>> Do I need to same my RDD as a table through hive context to make this
>>> work?
>>>
>>> Regards,
>>> Gaurav
>>>
>>> On Mon, Sep 22, 2014 at 6:30 PM, Yin Huai <huaiyin....@gmail.com> wrote:
>>>
>>>> Hi Gaurav,
>>>>
>>>> Can you put hive-site.xml in conf/ and try again?
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>> On Mon, Sep 22, 2014 at 4:02 PM, gtinside <gtins...@gmail.com> wrote:
>>>>
>>>>> Hi ,
>>>>>
>>>>> I have been using spark shell to execute all SQLs. I am connecting to
>>>>> Cassandra , converting the data in JSON and then running queries on
>>>>> it,  I
>>>>> am using HiveContext (and not SQLContext) because of "explode "
>>>>> functionality in it.
>>>>>
>>>>> I want to see how can I use Spark SQL CLI for directly running the
>>>>> queries
>>>>> on saved table. I see metastore and metastore_db getting created in the
>>>>> spark bin directory (my hive context is LocalHiveContext). I tried
>>>>> executing
>>>>> queries in spark-sql cli after putting in a hive-site.xml with
>>>>> metastore and
>>>>> metastore db directory same as the one in spark bin,  but it doesn't
>>>>> seem to
>>>>> be working. I am getting
>>>>> "org.apache.hadoop.hive.ql.metadata.HiveException:
>>>>> Unable to fetch table test_tbl".
>>>>>
>>>>> Is this possible ?
>>>>>
>>>>> Regards,
>>>>> Gaurav
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-CLI-tp14840.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark SQL CLI

Reply via email to