Thank you.I have explained the problem better here below.Is this
possible?.
We have a use case where we have files in the below directory structure.
The requirement is that we should not process files inside a Parent
directory in parallel(1.txt and 2.txt cannot be processed in parallel
since we need to do some check pointing we have to process the oldest file
first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
am over riding the list status method to pick only the oldest file but
this means I cannot achieve parallelism outside the parent as well since
the number of input splits is always 1. What would be the way to go about
this use case ?.In short I want to achieve parallelism outside Parent
directory but not within it. Please advise.
published/
+-- Parent1/
¦ +-- 1.txt
¦ +-- 2.txt
¦ +-- 3.txt
+-- Parent2/
+-- 4.txt
+-- 5.txt
On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <[email protected]> wrote:
> Can you clarify the requirement "processed first"? Maps run in parallel
> without any ordering guarantees. If you want to affect the mapping
> file->split number, you can implement your own getSplits in the custom
> input format and return splits ordered anyway your like.
>
> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <[email protected]>
> wrote:
>
>> Hey folks,
>>
>> Is their a way to sort the input splits in map reduce.We have a case
>> where there are two files file1 and file2 in the input directory.Since we
>> have custominputformat which has issplittable return false always each
>> of these files would be processed by a different mapper.How could I make
>> sure that file1 is processed before file2(I want the oldest file to be
>> processed first).Is this possible?.
>>
>> Thanks,
>> Nishan
>>
>
>