I am just wondering if create table supports the syntax of Create table dB.tablename Instead of two step process of use dB and then create table tablename? On 9 May 2015 08:17, "Michael Armbrust" <mich...@databricks.com> wrote:
> Actually, I was talking about the support for inferring different but > compatible schemata from various files, automatically merging them into a > single schema. However, you are right that I think you need to specify the > columns / types if you create it as a Hive table. > > On Fri, May 8, 2015 at 3:11 PM, Carlos Pereira <cpere...@groupon.com> > wrote: > >> Thanks Michael for the quick return. I was looking forward the automatic >> schema inferring (I think that's you mean by 'schema merging' ?), and I >> think the STORED AS would still require me to define the table columns >> right? >> >> Anyways, I am glad to hear you guys already working to fix this on future >> releases. >> >> Thanks, >> Carlos >> >> On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> This is an unfortunate limitation of the datasource api which does not >>> support multiple databases. For parquet in particular (if you aren't using >>> schema merging). You can create a hive table using STORED AS PARQUET >>> today. I hope to fix this limitation in Spark 1.5. >>> >>> On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <cpere...@groupon.com> >>> wrote: >>> >>>> Hi, I would like to create a hive table on top a existent parquet file >>>> as >>>> described here: >>>> >>>> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html >>>> >>>> Due network restrictions, I need to store the metadata definition in a >>>> different path than '/user/hive/warehouse', so I first set a new >>>> database on >>>> my own HDFS dir: >>>> >>>> CREATE DATABASE foo_db LOCATION '/user/foo'; >>>> USE foo_db; >>>> >>>> And then I run the following query: >>>> >>>> CREATE TABLE mytable_parquet >>>> USING parquet >>>> OPTIONS (path "/user/foo/data.parquet") >>>> >>>> The problem is that SparkSQL is not using the same database defined the >>>> in >>>> shell context, but the default metastore instead of: >>>> >>>> ---------------------------- >>>> > CREATE TABLE mytable_parquet USING parquet OPTIONS (path >>>> "/user/foo/data.parquet"); >>>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : >>>> *db=foo_db* >>>> tbl=mytable_parquet >>>> >>>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo >>>> ip=unknown-ip-addr >>>> cmd=get_table : db=foo_db tbl=mytable_parquet >>>> >>>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table: >>>> Table(tableName:mytable_parquet, *dbName:default,* owner:foo, >>>> createTime:1431117741, lastAccessTime:0, retention:0, >>>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>, >>>> comment:from deserializer)], location:null, >>>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, >>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, >>>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, >>>> >>>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, >>>> parameters:{serialization.format=1, path=/user/foo/data.parquet}), >>>> bucketCols:[], sortCols:[], parameters:{}, >>>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], >>>> skewedColValueLocationMaps:{})), partitionKeys:[], >>>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet}, >>>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) >>>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo >>>> ip=unknown-ip-addr >>>> cmd=create_table: Table(tableName:mytable_parquet, dbName:default, >>>> owner:foo, createTime:1431117741, lastAccessTime:0, retention:0, >>>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>, >>>> comment:from deserializer)], location:null, >>>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, >>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, >>>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, >>>> >>>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, >>>> parameters:{serialization.format=1, path=/user/foo/data.parquet}), >>>> bucketCols:[], sortCols:[], parameters:{}, >>>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], >>>> skewedColValueLocationMaps:{})), partitionKeys:[], >>>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet}, >>>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) >>>> >>>> 15/05/08 20:42:21 ERROR hive.log: Got exception: >>>> org.apache.hadoop.security.AccessControlException Permission denied: >>>> user=foo, access=WRITE, >>>> inode="/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x >>>> ---------------------------- >>>> >>>> >>>> The permission error above happens because my linux user does not have >>>> write >>>> access on the default metastore path. I can workaround this issue if I >>>> use >>>> CREATE TEMPORARY TABLE and have no metadata written on disk. >>>> >>>> I would like to know if I am doing anything wrong here and if there is >>>> any >>>> additional property I can use to force the database/metastore_dir I >>>> need to >>>> write on. >>>> >>>> Thanks, >>>> Carlos >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >