Re: HIVE Parquet column names issue

Prasad Nagaraj Subramanya Fri, 26 Jan 2018 11:24:04 -0800

Brandon,

Its good that my suggested approach served your purpose!


Now coming back to why your approach didn't work as expected -
1) In your first approach, you aren't defining the schema for the parquet
table. So hive uses implicit names for columns
2) In your second approach, you are creating an internal hive table. You
should still be able to find the parquet file under
/user/hive/warehouse/${PARQUET_TABLE_NAME}

- Prasad

On Fri, Jan 26, 2018 at 11:02 AM, Brandon Cooke <brandon.co...@engage.cx>
wrote:

> Prasad and Jorn,
>
> Prasad’s fix did the trick!
> An output table and OVERWRITE statement worked!
>
> Up to this point, what I had been doing is
> - Create the initial Events table
> - Create Parquet_table ( …. ) Stored as Parquet
> - Insert INTO parquet_table select * from Events
>
> I wasn’t using LOCATION for the Parquet table either.
> Are you able to tell me why the above steps and my example below (where I
> Overwrite Directory Stored as Parquet) did not work?
>
> Thanks so much by the way!
>
> -Brandon
>
> On Jan 26, 2018, at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Drop the old parquet table before and then create it with explicit
> statements. The above statement keeps using the old parquet table if it
> existed
>
> On 26. Jan 2018, at 17:35, Brandon Cooke <brandon.co...@engage.cx> wrote:
>
> Hi Prasad,
>
> I actually have tried this and I had that same result.
> Although I am certainly willing to try again.
>
> Sincerely,
>
> Brandon Cooke
>
> On Jan 26, 2018, at 11:29 AM, Prasad Nagaraj Subramanya <
> prasadn...@gmail.com> wrote:
>
> Hi Brandon,
>
> Have you tried creating an external table with the required names for
> parquet -
>
> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
>     `release` STRING,
>     `customer` STRING,
>     `cookie` STRING,
>     `category` STRING,
>     `end_time` STRING,
>     `start_time` STRING,
>     `first_name` STRING,
>     `email` STRING,
>     `phone` STRING,
>     `last_name` STRING,
>     `site` STRING,
>     `source` STRING,
>     `subject` STRING,
>     `raw` STRING
> )
> STORED AS PARQUET
> LOCATION '${OUTPUT}';
>
>
> And then inserting data into this table from your csv table -
>
> INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;
>
>
> This will create a parquet file at the specified location (${OUTPUT})
>
> Thanks,
> Prasad
>
> On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.co...@engage.cx>
> wrote:
>
>> Hello,
>>
>> I posted the following on a Cloudera forum but haven’t had much luck.
>> I’m hoping someone here can tell me what step I have probably missed:
>>
>> Hello,
>>
>>
>> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet
>> for use in AWS Athena.
>> However, no mater what I try the resulting Parquet always has columns
>> titles *[_col0, _col1, ..., _colN]*
>>
>>
>> After researching, I read that the line *SET
>> parquet.column.index.access=false *was supposed to allow for Parquet to
>> use the column titles of my HIVE table; however, it has been unsuccessful
>> so far.
>>
>>
>> Below is an example script I use to create the Parquet from data
>>
>>
>>
>> SET parquet.column.index.access=false;
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>>     `release` STRING,
>>     `customer` STRING,
>>     `cookie` STRING,
>>     `category` STRING,
>>     `end_time` STRING,
>>     `start_time` STRING,
>>     `first_name` STRING,
>>     `email` STRING,
>>     `phone` STRING,
>>     `last_name` STRING,
>>     `site` STRING,
>>     `source` STRING,
>>     `subject` STRING,
>>     `raw` STRING
>> )
>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> LOCATION '${INPUT}';
>>
>> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
>> STORED AS PARQUET
>> SELECT *
>> FROM EVENTS;
>>
>>
>> Using parquet-tools, I read the resulting file and below is an example
>> output:
>>
>>
>> _col0 = 0.1.2
>> _col1 = customer1
>> _col2 = NULL
>> _col3 = api
>> _col4 = 2018-01-21T06:57:57Z
>> _col5 = 2018-01-21T06:57:56Z
>> _col6 = Brandon
>> _col7 = bran...@fakesite.com
>> _col8 = 999-999-9999
>> _col9 = Pompei
>> _col10 = Boston
>> _col11 = Wifi
>> _col12 = NULL
>> _col13 = 
>> eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>>
>>
>> This is problematic because it is impossible to transfer it to an Athena
>> table (or even back to HIVE) without using these index-based column
>> titles. I need HIVE's column titles to transfer over to the Parquet file.
>>
>>
>> I've search for a very long time and have come up short. Am I doing
>> something wrong?
>> Please let me know if I can provide more information. Thank you!
>>
>> I appreciate your time.
>> Sincerely,
>>
>> Brandon Cooke
>> Software Engineer
>> engage.cx
>> 5500 Interstate N Parkway Suite 130
>> <https://maps.google.com/?q=5500+Interstate+N+Parkway+Suite+130&entry=gmail&source=g>
>>
>
>
>
>

Re: HIVE Parquet column names issue

Reply via email to