You can use hadoop streaming that would be much faster... Just run your
cleaning shell script logic in map phase and it will be done in just few
minutes. That will keep the data in HDFS.

Regards,
Praveenesh

On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <sandeepreddy.3...@gmail.com
> wrote:

> Hi,
> Thank you all for your help. I'll try both ways and i'll get back to you.
>
>
> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <donta...@gmail.com>wrote:
>
>> I said this assuming that a Hadoop cluster is available since Sandeep is
>> planning to use Hive. If that is the case then MapReduce would be faster
>> for such large files.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck 
>> <chuck.conn...@nuance.com>wrote:
>>
>>>  I cannot promise which is faster. A lot depends on how clever your
>>> scripts are.****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3...@gmail.com]
>>> *Sent:* Friday, September 07, 2012 10:42 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: How to load csv data into HIVE****
>>>
>>> ** **
>>>
>>> Hi,
>>> I wrote a shell script to get csv data but when i run that script on a
>>> 12GB csv its taking more time. If i run a python script will that be faster?
>>> ****
>>>
>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <
>>> chuck.conn...@nuance.com> wrote:****
>>>
>>> How about a Python script that changes it into plain tab-separated text?
>>> So it would look like this…****
>>>
>>>  ****
>>>
>>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>>> etc…****
>>>
>>>  ****
>>>
>>> Tab-separated with newlines is easy to read and works perfectly on
>>> import.****
>>>
>>>  ****
>>>
>>> Chuck Connell****
>>>
>>> Nuance R&D Data Team****
>>>
>>> Burlington, MA****
>>>
>>> 781-565-4611****
>>>
>>>  ****
>>>
>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3...@gmail.com]
>>> *Subject:* How to load csv data into HIVE****
>>>
>>>  ****
>>>
>>> Hi,
>>> Here is the sample data
>>> "174969274","14-mar-2006","****
>>>
>>> 3522876","","14-mar-2006","500000308","65","1"|
>>>
>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>>>
>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>>>
>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>>
>>> How to load this kind of data into HIVE?
>>> I'm using shell script to get rid of double quotes and '|' but its
>>> taking very long time to work on each csv which are 12GB each. What is the
>>> best way to do this?****
>>>
>>>  ****
>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> sandeep****
>>>
>>
>>
>
>
> --
> Thanks,
> sandeep
>
>

Reply via email to