Re: recombining split files after data is processed

Alexander Alten-Lorenz Mon, 23 Feb 2015 00:45:12 -0800

You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one 
<http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>


BR,
 Alex


> On 23 Feb 2015, at 08:10, Jonathan Aquilina <[email protected]> wrote:
> 
> Thanks Alex. where would that command be placed in a mapper or reducer or run 
> as a command. Here at work we are looking to use Amazon EMR to do our number 
> crunching and we have access to the master node, but not really the rest of 
> the cluster. Can this be added as a step to be run after initial processing?
> 
>  
> ---
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T
> On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
> 
>> Hi,
>>  
>> You can use an single reducer 
>> (http://wiki.apache.org/hadoop/HowManyMapsAndReduces 
>> <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, 
>> or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
>> 
>>  
>> BR,
>>  Alex
>>  
>> 
>>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hey all,
>>> 
>>> I understand that the purpose of splitting files is to distribute the data 
>>> to multiple core and task nodes in a cluster. My question is that after the 
>>> output is complete is there a way one can combine all the parts into a 
>>> single file?
>>> 
>>>  
>>> -- 
>>> Regards,
>>> Jonathan Aquilina
>>> Founder Eagle Eye T

Re: recombining split files after data is processed

Reply via email to