Hi Brandon, Have you tried creating an external table with the required names for parquet -
CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET( `release` STRING, `customer` STRING, `cookie` STRING, `category` STRING, `end_time` STRING, `start_time` STRING, `first_name` STRING, `email` STRING, `phone` STRING, `last_name` STRING, `site` STRING, `source` STRING, `subject` STRING, `raw` STRING ) STORED AS PARQUET LOCATION '${OUTPUT}'; And then inserting data into this table from your csv table - INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS; This will create a parquet file at the specified location (${OUTPUT}) Thanks, Prasad On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.co...@engage.cx> wrote: > Hello, > > I posted the following on a Cloudera forum but haven’t had much luck. > I’m hoping someone here can tell me what step I have probably missed: > > Hello, > > > I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet > for use in AWS Athena. > However, no mater what I try the resulting Parquet always has columns > titles *[_col0, _col1, ..., _colN]* > > > After researching, I read that the line *SET > parquet.column.index.access=false *was supposed to allow for Parquet to > use the column titles of my HIVE table; however, it has been unsuccessful > so far. > > > Below is an example script I use to create the Parquet from data > > > > SET parquet.column.index.access=false; > > CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS( > `release` STRING, > `customer` STRING, > `cookie` STRING, > `category` STRING, > `end_time` STRING, > `start_time` STRING, > `first_name` STRING, > `email` STRING, > `phone` STRING, > `last_name` STRING, > `site` STRING, > `source` STRING, > `subject` STRING, > `raw` STRING > ) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > LOCATION '${INPUT}'; > > INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet' > STORED AS PARQUET > SELECT * > FROM EVENTS; > > > Using parquet-tools, I read the resulting file and below is an example > output: > > > _col0 = 0.1.2 > _col1 = customer1 > _col2 = NULL > _col3 = api > _col4 = 2018-01-21T06:57:57Z > _col5 = 2018-01-21T06:57:56Z > _col6 = Brandon > _col7 = bran...@fakesite.com > _col8 = 999-999-9999 > _col9 = Pompei > _col10 = Boston > _col11 = Wifi > _col12 = NULL > _col13 = > eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI > > > This is problematic because it is impossible to transfer it to an Athena > table (or even back to HIVE) without using these index-based column > titles. I need HIVE's column titles to transfer over to the Parquet file. > > > I've search for a very long time and have come up short. Am I doing > something wrong? > Please let me know if I can provide more information. Thank you! > > I appreciate your time. > Sincerely, > > Brandon Cooke > Software Engineer > engage.cx > 5500 Interstate N Parkway Suite 130 > <https://maps.google.com/?q=5500+Interstate+N+Parkway+Suite+130&entry=gmail&source=g> >