On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote:
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>
> >Each line from FileA gets compared with every line from FileB1,
> >FileB2 etc.
> >etc. FileB1, FileB2 etc. are in a different input directory
>
> In the general case, I'd define an Inp
; >> > the field you want to do the comparison, At this point you read the
>>> >> > contents
>>> >> > of FileA that reached this reducer and since its contents were
>>> sorted
>>> >> as
>>> >> > well, you can quickl
gt; >> >>
>> >> >>
>> >> >> pmg wrote:
>> >> >> >
>> >> >> > Thanks owen. Are there any examples that I can look at?
>> >> >> >
>> >> >> >
>> >> >> >
; >> >> In the general case, I'd define an InputFormat that takes two
> >> >> >> directories, computes the input splits for each directory and
> >> >> >> generates a new list of InputSplits that is the cross-product of
&
leSplit for dir2 and the
>> record
>> >> >> reader would return a TextPair with left and right records (ie.
>> >> >> lines). Clearly, you read the first line of split1 and cross it by
>> >> >> each line from s
gt; >> process each line from split2, etc.
> >> >>
> >> >> You'll need to ensure that you don't overwhelm the system with either
> >> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
> >> >> much fas
with either
>> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
>> >> much faster with the size of the input (N) than the M machines can
>> >> handle in a fixed amount of time.
>> >>
>> >>> Two input direc
directories
> >>>
> >>> 1. input1 directory with a single file of 600K records - FileA
> >>> 2. input2 directory segmented into different files with 2Million
> >>> records -
> >>> FileB1, FileB2 etc.
> >>
> >> In this parti
600K records - FileA
>>> 2. input2 directory segmented into different files with 2Million
>>> records -
>>> FileB1, FileB2 etc.
>>
>> In this particular case, it would be right to load all of FileA into
>> memory and process the chunks o
lar case, it would be right to load all of FileA into
> memory and process the chunks of FileB/part-*. Then it would be much
> faster than needing to re-read the file over and over again, but
> otherwise it would be the same.
>
> -- Owen
>
>
--
View this message in context:
http://www.nabble.com/multiple-file-input-tp24095358p24105398.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
On Jun 18, 2009, at 10:56 AM, pmg wrote:
Each line from FileA gets compared with every line from FileB1,
FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory
In the general case, I'd define an InputFormat that takes two
directories, computes the input splits for each dir
compares the line with each line from input2?
What is the best way forward? I have seen plenty of examples that maps each
record from single input file and reduces into an output forward.
thanks
--
View this message in context:
http://www.nabble.com/multiple-file-input-tp24095358p24095358.html
12 matches
Mail list logo