Re: multiple file input

2009-06-22 Thread Erik Paulson
On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote: > On Jun 18, 2009, at 10:56 AM, pmg wrote: > > >Each line from FileA gets compared with every line from FileB1, > >FileB2 etc. > >etc. FileB1, FileB2 etc. are in a different input directory > > In the general case, I'd define an Inp

Re: multiple file input

2009-06-20 Thread pmg
; >> > the field you want to do the comparison, At this point you read the >>> >> > contents >>> >> > of FileA that reached this reducer and since its contents were >>> sorted >>> >> as >>> >> > well, you can quickl

Re: multiple file input

2009-06-19 Thread pmg
gt; >> >> >> >> >> >> >> >> pmg wrote: >> >> >> > >> >> >> > Thanks owen. Are there any examples that I can look at? >> >> >> > >> >> >> > >> >> >> >

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
; >> >> In the general case, I'd define an InputFormat that takes two > >> >> >> directories, computes the input splits for each directory and > >> >> >> generates a new list of InputSplits that is the cross-product of &

Re: multiple file input

2009-06-19 Thread pmg
leSplit for dir2 and the >> record >> >> >> reader would return a TextPair with left and right records (ie. >> >> >> lines). Clearly, you read the first line of split1 and cross it by >> >> >> each line from s

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
gt; >> process each line from split2, etc. > >> >> > >> >> You'll need to ensure that you don't overwhelm the system with either > >> >> too many input splits (ie. maps). Also don't forget that N^2/M grows > >> >> much fas

Re: multiple file input

2009-06-19 Thread pmg
with either >> >> too many input splits (ie. maps). Also don't forget that N^2/M grows >> >> much faster with the size of the input (N) than the M machines can >> >> handle in a fixed amount of time. >> >> >> >>> Two input direc

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
directories > >>> > >>> 1. input1 directory with a single file of 600K records - FileA > >>> 2. input2 directory segmented into different files with 2Million > >>> records - > >>> FileB1, FileB2 etc. > >> > >> In this parti

Re: multiple file input

2009-06-19 Thread pmg
600K records - FileA >>> 2. input2 directory segmented into different files with 2Million >>> records - >>> FileB1, FileB2 etc. >> >> In this particular case, it would be right to load all of FileA into >> memory and process the chunks o

Re: multiple file input

2009-06-18 Thread pmg
lar case, it would be right to load all of FileA into > memory and process the chunks of FileB/part-*. Then it would be much > faster than needing to re-read the file over and over again, but > otherwise it would be the same. > > -- Owen > > -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24105398.html Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: multiple file input

2009-06-18 Thread Owen O'Malley
On Jun 18, 2009, at 10:56 AM, pmg wrote: Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory In the general case, I'd define an InputFormat that takes two directories, computes the input splits for each dir

multiple file input

2009-06-18 Thread pmg
compares the line with each line from input2? What is the best way forward? I have seen plenty of examples that maps each record from single input file and reduces into an output forward. thanks -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24095358.html