You could attach the hadoop dfs command per bootstrap. http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one <http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>
BR, Alex > On 23 Feb 2015, at 08:10, Jonathan Aquilina <[email protected]> wrote: > > Thanks Alex. where would that command be placed in a mapper or reducer or run > as a command. Here at work we are looking to use Amazon EMR to do our number > crunching and we have access to the master node, but not really the rest of > the cluster. Can this be added as a step to be run after initial processing? > > > --- > Regards, > Jonathan Aquilina > Founder Eagle Eye T > On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: > >> Hi, >> >> You can use an single reducer >> (http://wiki.apache.org/hadoop/HowManyMapsAndReduces >> <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, >> or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name >> >> >> BR, >> Alex >> >> >>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hey all, >>> >>> I understand that the purpose of splitting files is to distribute the data >>> to multiple core and task nodes in a cluster. My question is that after the >>> output is complete is there a way one can combine all the parts into a >>> single file? >>> >>> >>> -- >>> Regards, >>> Jonathan Aquilina >>> Founder Eagle Eye T
