Yup, Bejoy is correct :-) Just use hadoop streaming, for what it can do best --->>> Cleaning, Transformations and Validations, in just simple steps.
Regards, Praveenesh On Sat, Sep 8, 2012 at 6:03 PM, Bejoy KS <bejoy...@yahoo.com> wrote: > Hi Chuck > > I believe Praveenesh was adding his thought to the discussion on > preprocessing the data using mapreduce itself. If you go with hadoop > streaming you can use the python script in the mapper and that will do the > preprocessing parallely on large volume data. Then this preprocessed data > can be loaded into hive table. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * "Connell, Chuck" <chuck.conn...@nuance.com> > *Date: *Sat, 8 Sep 2012 12:18:33 +0000 > *To: *user@hive.apache.org<user@hive.apache.org> > *ReplyTo: * user@hive.apache.org > *Subject: *RE: How to load csv data into HIVE > > I would like to hear more about this "hadoop streaming to Hive" idea. I > have used streaming jobs as mappers, with a python script as map.py. Are > you saying that such a streaming mapper can load its output into Hive? Can > you send some example code? Hive wants to load "files" not individual > lines/records. How would you do this? > > Thanks very much, > Chuck > > > ------------------------------ > *From:* praveenesh kumar [praveen...@gmail.com] > *Sent:* Saturday, September 08, 2012 7:54 AM > *To:* user@hive.apache.org > *Subject:* Re: How to load csv data into HIVE > > You can use hadoop streaming that would be much faster... Just run your > cleaning shell script logic in map phase and it will be done in just few > minutes. That will keep the data in HDFS. > > Regards, > Praveenesh > > On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P < > sandeepreddy.3...@gmail.com> wrote: > >> Hi, >> Thank you all for your help. I'll try both ways and i'll get back to you. >> >> >> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <donta...@gmail.com>wrote: >> >>> I said this assuming that a Hadoop cluster is available since Sandeep is >>> planning to use Hive. If that is the case then MapReduce would be faster >>> for such large files. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <chuck.conn...@nuance.com >>> > wrote: >>> >>>> I cannot promise which is faster. A lot depends on how clever your >>>> scripts are.**** >>>> >>>> ** ** >>>> >>>> ** ** >>>> >>>> ** ** >>>> >>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3...@gmail.com] >>>> *Sent:* Friday, September 07, 2012 10:42 AM >>>> *To:* user@hive.apache.org >>>> *Subject:* Re: How to load csv data into HIVE**** >>>> >>>> ** ** >>>> >>>> Hi, >>>> I wrote a shell script to get csv data but when i run that script on a >>>> 12GB csv its taking more time. If i run a python script will that be >>>> faster? >>>> **** >>>> >>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck < >>>> chuck.conn...@nuance.com> wrote:**** >>>> >>>> How about a Python script that changes it into plain tab-separated >>>> text? So it would look like this…**** >>>> >>>> **** >>>> >>>> 174969274<tab>14-mar-2006<tab>3522876<tab> >>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> >>>> etc…**** >>>> >>>> **** >>>> >>>> Tab-separated with newlines is easy to read and works perfectly on >>>> import.**** >>>> >>>> **** >>>> >>>> Chuck Connell**** >>>> >>>> Nuance R&D Data Team**** >>>> >>>> Burlington, MA**** >>>> >>>> 781-565-4611**** >>>> >>>> **** >>>> >>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3...@gmail.com] >>>> *Subject:* How to load csv data into HIVE**** >>>> >>>> **** >>>> >>>> Hi, >>>> Here is the sample data >>>> "174969274","14-mar-2006","**** >>>> >>>> 3522876","","14-mar-2006","500000308","65","1"| >>>> >>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| >>>> >>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| >>>> >>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| >>>> >>>> How to load this kind of data into HIVE? >>>> I'm using shell script to get rid of double quotes and '|' but its >>>> taking very long time to work on each csv which are 12GB each. What is the >>>> best way to do this?**** >>>> >>>> **** >>>> >>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> sandeep**** >>>> >>> >>> >> >> >> -- >> Thanks, >> sandeep >> >> >