Re: CREATE TABLE ignores database when using PARQUET option

ayan guha Fri, 08 May 2015 20:20:07 -0700

I am just wondering if create table supports the syntax of
Create table dB.tablename
Instead of two step process of use dB and then create table tablename?
On 9 May 2015 08:17, "Michael Armbrust" <mich...@databricks.com> wrote:


> Actually, I was talking about the support for inferring different but
> compatible schemata from various files, automatically merging them into a
> single schema.  However, you are right that I think you need to specify the
> columns / types if you create it as a Hive table.
>
> On Fri, May 8, 2015 at 3:11 PM, Carlos Pereira <cpere...@groupon.com>
> wrote:
>
>> Thanks Michael for the quick return. I was looking forward the automatic
>> schema inferring (I think that's you mean by 'schema merging' ?), and I
>> think the STORED AS would still require me to define the table columns
>> right?
>>
>> Anyways, I am glad to hear you guys already working to fix this on future
>> releases.
>>
>> Thanks,
>> Carlos
>>
>> On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> This is an unfortunate limitation of the datasource api which does not
>>> support multiple databases.  For parquet in particular (if you aren't using
>>> schema merging).  You can create a hive table using STORED AS PARQUET
>>> today.  I hope to fix this limitation in Spark 1.5.
>>>
>>> On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <cpere...@groupon.com>
>>> wrote:
>>>
>>>> Hi, I would like to create a hive table on top a existent parquet file
>>>> as
>>>> described here:
>>>>
>>>> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
>>>>
>>>> Due network restrictions, I need to store the metadata definition in a
>>>> different path than '/user/hive/warehouse', so I first set a new
>>>> database on
>>>> my own HDFS dir:
>>>>
>>>> CREATE DATABASE foo_db LOCATION '/user/foo';
>>>> USE foo_db;
>>>>
>>>> And then I run the following query:
>>>>
>>>> CREATE TABLE mytable_parquet
>>>> USING parquet
>>>> OPTIONS (path "/user/foo/data.parquet")
>>>>
>>>> The problem is that SparkSQL is not using the same database defined the
>>>> in
>>>> shell context, but the default metastore instead of:
>>>>
>>>> ----------------------------
>>>>  > CREATE TABLE mytable_parquet USING parquet OPTIONS (path
>>>> "/user/foo/data.parquet");
>>>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table :
>>>> *db=foo_db*
>>>> tbl=mytable_parquet
>>>>
>>>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo
>>>>  ip=unknown-ip-addr
>>>> cmd=get_table : db=foo_db tbl=mytable_parquet
>>>>
>>>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table:
>>>> Table(tableName:mytable_parquet, *dbName:default,* owner:foo,
>>>> createTime:1431117741, lastAccessTime:0, retention:0,
>>>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>>>> comment:from deserializer)], location:null,
>>>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>>>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>>>
>>>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>>>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>>>> bucketCols:[], sortCols:[], parameters:{},
>>>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>>>> skewedColValueLocationMaps:{})), partitionKeys:[],
>>>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>>>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>>>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo
>>>>  ip=unknown-ip-addr
>>>> cmd=create_table: Table(tableName:mytable_parquet, dbName:default,
>>>> owner:foo, createTime:1431117741, lastAccessTime:0, retention:0,
>>>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>>>> comment:from deserializer)], location:null,
>>>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>>>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>>>
>>>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>>>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>>>> bucketCols:[], sortCols:[], parameters:{},
>>>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>>>> skewedColValueLocationMaps:{})), partitionKeys:[],
>>>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>>>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>>>>
>>>> 15/05/08 20:42:21 ERROR hive.log: Got exception:
>>>> org.apache.hadoop.security.AccessControlException Permission denied:
>>>> user=foo, access=WRITE,
>>>> inode="/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x
>>>> ----------------------------
>>>>
>>>>
>>>> The permission error above happens because my linux user does not have
>>>> write
>>>> access on the default metastore path. I can workaround this issue if I
>>>> use
>>>> CREATE TEMPORARY TABLE and have no metadata written on disk.
>>>>
>>>> I would like to know if I am doing anything wrong here and if there is
>>>> any
>>>> additional property I can use to force the database/metastore_dir I
>>>> need to
>>>> write on.
>>>>
>>>> Thanks,
>>>> Carlos
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: CREATE TABLE ignores database when using PARQUET option

Reply via email to