Sorry didn't read that you were trying to drop the header row specifically. Having said that the solution outlined below is probably not a way to go on this. I like Matt's suggestion and seems like a better approach.
From: Raghunath, Ranjith [mailto:[email protected]] Sent: Wednesday, March 07, 2012 10:06 AM To: [email protected] Subject: RE: Need a smart way to delete the first row of my data Give you a key column that is unique within your dataset I think this could work. 1. Load the file as is, gunzipped, into a hive table 2. Determine the total row size. 3. Perform a insert into table .... Select * from .... Order by <col_name> desc limit <total_size -1> From: Dan Y [mailto:[email protected]] Sent: Wednesday, March 07, 2012 10:01 AM To: [email protected] Subject: Need a smart way to delete the first row of my data Hello, I have huge gzipped files that I need to drop the header row from before loading to a hive table. Right now, my process is: 1. Gunzip the data (...takes forever) 2. Drop the first row using the Unix sed command 3. Re-zip the data with gzip -1 (...takes forever) 4. Create the Hive table (on the compressed file to store it efficiently) I am trying to find a way to speed up this process. Ideally, it would involve loading the data to Hive as a first step and then deleting the first row, to avoid the unzip/rezip steps. Any ideas would be appreciated! -Dan
