Re: Re-arrange columns

2016-07-26 Thread Furcy Pin
Hi,

I think I had similar issues to yours.

Did you look in the Hive documentation at what the CASCADE keyword does on
ADD or CHANGE COLUMNS statements?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterColumn

>From what I understand, the behavior of Hive when adding or changing
columns is different from standard RDBMS:
I think that adding or moving a column in Hive will only change the
metadata and not the underlying data.
To cope with that, the metastore keeps separate schema information for the
table itself and for its partitions.
This way, when you change the schema of your table, the old partition can
still be read according to the old schema,
and new partitions will be created with the new schema (unless you use
CASCADE)

There are two dangers with this, however:
- if you create a new column, it will not be accessible in the old
partitions.
- if you regenerate (override) an existing partition, the new data will
correspond to the new schema, but the partition's metadata will not be
updated.
   (I believe this could be considered as a bug, with a workaround since
the "ALTER TABLE table_name ADD COLUMNS ..."
   can also be applied to one specific partition at a time: "ALTER TABLE
table_name PARTITION partition_spec ADD COLUMNS ...")

If you use CASCADE, the change you apply to the table will be immediately
applied to its partitions as well.
But if you don't regenerate your existing partitions, I believe you will
have problems as well, since your partition's schema will not match the
underlying data.


So, I guess it mostly depend on if (and when) you plan to regenerate your
partition to add the new column to your existing data.
But you can either:

A. Drop your table and do a msck repair table.

B. To reduce unavailability: create another table, populate it, and then
swap your tables.

[image: Inline image 1]

C. Do your column change with CASCADE, and regenerate your partitions
immediately,
but they might not be correctly readable between the time you make your
change and regenerate them.

D. Do your column change without CASCADE: you can still query the old
partitions (without the new column though)
and after regenerating a partition, change its schema with a "ALTER TABLE
table_name PARTITION partition_spec ADD COLUMN ..."

E. (I'm not sure this one works) Do your column change without CASCADE,
create a copy of the table, generate the partitions there, and then use
EXCHANGE PARTITION (
https://cwiki.apache.org/confluence/display/Hive/Exchange+Partition) to
move the partitions from the new table to the old.

Hope this helps, let me know how it turns out,

Furcy







On Tue, Jul 26, 2016 at 2:34 AM, Binh Nguyen Van  wrote:

> Hi,
>
> I am writing an application that insert data into Hive and it depends on
> the order of columns in table so I have to rearrange my columns to make
> work around that but I have difficulty of doing that. Could you please help?
>
> Here is my problem:
>
> I have table 'tbl' which is created by using this statement:
> CREATE EXTERNAL TABLE tbl (col1 int, col3 string) PARTITIONED BY (dt string) 
> STORED AS PARQUET LOCATION '/wh/db/tb'.
> Now I want to add new column 'col2' with type int to that table and I want to 
> put it after 'col1' and before 'col3' so I use these two statements:
> ALTER TABLE tbl ADD COLUMNS (col2 int);
> ALTER TABLE tbl CHANGE col2 col2 int AFTER col1;
>
> With Hive 1.2.1, these two statements are executed successfully but I got 
> type cast exception when I query data back.
> With Hive 2.x, The second alter statement failed with exception: 'Unable to 
> alter table. The following columns have types incompatible with the existing 
> columns in their respective positions'
>
> I could work around this by drop old table and then create a new table
> with new schema but this requires me to to run MSCK to update metadata for
> that table and this process could be very slow when I have a lot of data
> and a lot of partition so I am looking for a better one.
>
> Please help!
> Thanks
> -Binh
> ​
>


Re-arrange columns

2016-07-25 Thread Binh Nguyen Van
Hi,

I am writing an application that insert data into Hive and it depends on
the order of columns in table so I have to rearrange my columns to make
work around that but I have difficulty of doing that. Could you please help?

Here is my problem:

I have table 'tbl' which is created by using this statement:
CREATE EXTERNAL TABLE tbl (col1 int, col3 string) PARTITIONED BY (dt
string) STORED AS PARQUET LOCATION '/wh/db/tb'.
Now I want to add new column 'col2' with type int to that table and I
want to put it after 'col1' and before 'col3' so I use these two
statements:
ALTER TABLE tbl ADD COLUMNS (col2 int);
ALTER TABLE tbl CHANGE col2 col2 int AFTER col1;

With Hive 1.2.1, these two statements are executed successfully but I
got type cast exception when I query data back.
With Hive 2.x, The second alter statement failed with exception:
'Unable to alter table. The following columns have types incompatible
with the existing columns in their respective positions'

I could work around this by drop old table and then create a new table with
new schema but this requires me to to run MSCK to update metadata for that
table and this process could be very slow when I have a lot of data and a
lot of partition so I am looking for a better one.

Please help!
Thanks
-Binh
​