RE: Need a smart way to delete the first row of my data

Raghunath, Ranjith Wed, 07 Mar 2012 08:12:58 -0800

Sorry didn't read that you were trying to drop the header row specifically. 
Having said that the solution outlined below is probably not a way to go on 
this.  I like Matt's suggestion and seems like a better approach.

From: Raghunath, Ranjith [mailto:[email protected]]
Sent: Wednesday, March 07, 2012 10:06 AM
To: [email protected]
Subject: RE: Need a smart way to delete the first row of my data

Give you a key column that is unique within your dataset I think this could 
work.

1.       Load the file as is, gunzipped, into a hive table

2.       Determine the total row size.

3.       Perform a insert into table .... Select * from .... Order by 
<col_name> desc limit <total_size -1>

From: Dan Y [mailto:[email protected]]
Sent: Wednesday, March 07, 2012 10:01 AM
To: [email protected]
Subject: Need a smart way to delete the first row of my data

Hello,

I have huge gzipped files that I need to drop the header row from before 
loading to a hive table.

Right now, my process is:
1. Gunzip the data (...takes forever)
2. Drop the first row using the Unix sed command
3. Re-zip the data with gzip -1 (...takes forever)
4. Create the Hive table (on the compressed file to store it efficiently)

I am trying to find a way to speed up this process.  Ideally, it would involve 
loading the data to Hive as a first step and then deleting the first row, to 
avoid the unzip/rezip steps.

Any ideas would be appreciated!

-Dan

RE: Need a smart way to delete the first row of my data

Reply via email to