Hi Dan,

We've run into the same issue with some of our datasets.  The best option is to 
see if the provider can remove the column headers.  We've found that the 
least-worst way to do it ourselves is to run a query that filters out the 
values that we see in header row.

Example:
CREATE TABLE table_a_clean AS
SELECT *
FROM table_a
WHERE column1 != 'column1_name' AND column2 != 'column2_name'

Matt Tucker

From: Dan Y [mailto:[email protected]]
Sent: Wednesday, March 07, 2012 11:01 AM
To: [email protected]
Subject: Need a smart way to delete the first row of my data

Hello,

I have huge gzipped files that I need to drop the header row from before 
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would involve 
loading the data to Hive as a first step and then deleting the first row, to 
avoid the unzip/rezip steps.

Any ideas would be appreciated!

-Dan

Reply via email to