Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI,

Latest hive 0.14/parquet will have column renaming support.

Jianshi

On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com
wrote:

 You might also try out the recently added support for views.

 On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ah... I see. Thanks for pointing it out.

 Then it means we cannot mount external table using customized column
 names. hmm...

 Then the only option left is to use a subquery to add a bunch of column
 alias. I'll try it later.

 Thanks,
 Jianshi

 On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
 wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the
 columns became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new
 SchemaRDD, had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name
 = s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet
 files for Spark to consume.

 However, due to Pig's limitation, the generated schema contains
 Pig's identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-09 Thread Michael Armbrust
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ah... I see. Thanks for pointing it out.

 Then it means we cannot mount external table using customized column
 names. hmm...

 Then the only option left is to use a subquery to add a bunch of column
 alias. I'll try it later.

 Thanks,
 Jianshi

 On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
 wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new
 SchemaRDD, had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet
 files for Spark to consume.

 However, due to Pig's limitation, the generated schema contains
 Pig's identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust
This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Ok, found another possible bug in Hive.

My current solution is to use ALTER TABLE CHANGE to rename the column names.

The problem is after renaming the column names, the value of the columns
became all NULL.

Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
Native command: executed by Hive

After renaming:
scala sql(select cre_ts from pmt limit 1).collect
res16: Array[org.apache.spark.sql.Row] = Array([null])

I created a JIRA for it:

  https://issues.apache.org/jira/browse/SPARK-4781


Jianshi

On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE ...
 COMPUTE STATISTICS will fail to attach the metadata to the table, and later
 broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/