RE: Need a smart way to delete the first row of my data

Tucker, Matt Wed, 07 Mar 2012 08:06:28 -0800

Hi Dan,

We've run into the same issue with some of our datasets.  The best option is to 
see if the provider can remove the column headers.  We've found that the 
least-worst way to do it ourselves is to run a query that filters out the 
values that we see in header row.


Example:
CREATE TABLE table_a_clean AS
SELECT *
FROM table_a
WHERE column1 != 'column1_name' AND column2 != 'column2_name'

Matt Tucker

From: Dan Y [mailto:[email protected]]
Sent: Wednesday, March 07, 2012 11:01 AM
To: [email protected]
Subject: Need a smart way to delete the first row of my data

Hello,

I have huge gzipped files that I need to drop the header row from before 
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would involve 
loading the data to Hive as a first step and then deleting the first row, to 
avoid the unzip/rezip steps.

Any ideas would be appreciated!

-Dan

RE: Need a smart way to delete the first row of my data

Reply via email to