If you are running a version > 1.1 you can create external parquet tables.
I'd recommend setting spark.sql.hive.convertMetastoreParquet=true. Here's a
helper function to do it automatically:

/**
 * Sugar for creating a Hive external table from a parquet path.
 */
def createParquetTable(name: String, file: String): Unit = {
  import org.apache.spark.sql.hive.HiveMetastoreTypes

  val rdd = parquetFile(file)
  val schema = rdd.schema.fields.map(f => s"${f.name}
${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
  val ddl = s"""
    |CREATE EXTERNAL TABLE $name (
    |  $schema
    |)
    |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
    |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
    |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
    |LOCATION '$file'""".stripMargin
  sql(ddl)
  setConf("spark.sql.hive.convertMetastoreParquet", "true")
}

On Mon, Oct 13, 2014 at 9:20 AM, Sadhan Sood <sadhan.s...@gmail.com> wrote:

> We want to persist table schema of parquet file so as to use spark-sql cli
> on that table later on? Is it possible or is spark-sql cli only good for
> tables in hive metastore ? We are reading parquet data using this example:
>
> // Read in the parquet file created above.  Parquet files are self-describing 
> so the schema is preserved.// The result of loading a Parquet file is also a 
> SchemaRDD.val parquetFile = sqlContext.parquetFile("people.parquet")
> //Parquet files can also be registered as tables and then used in SQL 
> statements.parquetFile.registerTempTable("parquetFile")
>
>

Reply via email to