Re: How to load csv data into HIVE

Bejoy KS Sat, 08 Sep 2012 05:33:39 -0700

Hi Chuck

I believe Praveenesh was adding his thought to the discussion on preprocessing 
the data using mapreduce itself. If you go with hadoop streaming you can use 
the python script in the mapper and that will do the preprocessing parallely on 
large volume data. Then this preprocessed data can be loaded into hive table.




Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Connell, Chuck" <[email protected]>
Date: Sat, 8 Sep 2012 12:18:33 
To: [email protected]<[email protected]>
Reply-To: [email protected]
Subject: RE: How to load csv data into HIVE

I would like to hear more about this "hadoop streaming to Hive" idea. I have 
used streaming jobs as mappers, with a python script as map.py. Are you saying 
that such a streaming mapper can load its output into Hive? Can you send some 
example code? Hive wants to load "files" not individual lines/records. How 
would you do this?

Thanks very much,
Chuck


________________________________
From: praveenesh kumar [[email protected]]
Sent: Saturday, September 08, 2012 7:54 AM
To: [email protected]
Subject: Re: How to load csv data into HIVE

You can use hadoop streaming that would be much faster... Just run your 
cleaning shell script logic in map phase and it will be done in just few 
minutes. That will keep the data in HDFS.

Regards,
Praveenesh

On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
Thank you all for your help. I'll try both ways and i'll get back to you.


On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq 
<[email protected]<mailto:[email protected]>> wrote:
I said this assuming that a Hadoop cluster is available since Sandeep is 
planning to use Hive. If that is the case then MapReduce would be faster for 
such large files.

Regards,
    Mohammad Tariq



On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck 
<[email protected]<mailto:[email protected]>> wrote:
I cannot promise which is faster. A lot depends on how clever your scripts are.



From: Sandeep Reddy P 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, September 07, 2012 10:42 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: How to load csv data into HIVE

Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB csv 
its taking more time. If i run a python script will that be faster?
On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck 
<[email protected]<mailto:[email protected]>> wrote:
How about a Python script that changes it into plain tab-separated text? So it 
would look like this…

174969274<tab>14-mar-2006<tab>3522876<tab> 
<tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
etc…

Tab-separated with newlines is easy to read and works perfectly on import.

Chuck Connell
Nuance R&D Data Team
Burlington, MA
781-565-4611<tel:781-565-4611>

From: Sandeep Reddy P 
[mailto:[email protected]<mailto:[email protected]>]
Subject: How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking very 
long time to work on each csv which are 12GB each. What is the best way to do 
this?




--
Thanks,
sandeep




--
Thanks,
sandeep

Re: How to load csv data into HIVE

Reply via email to