Hi Harsh That means I have to lose my input data because of Hadoop's FileSplit evenly splits input file according to the "numSplits". But, I want to prevent this. Is there any way?
Regards! Chen On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <[email protected]> wrote: > No, what I mean is that your RecordReader should be able to handle a > case where it may start from middle of a record and hence not be able > to read any record (i.e. return false or whatever right up front). > > On Wed, Aug 29, 2012 at 1:27 PM, Chen He <[email protected]> wrote: > > Hi Harsh > > > > Thank you for your reply. Do you mean I need to change the FileSplit to > > avoid those errors I mentioned happen? > > > > Regards! > > > > Chen > > > > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <[email protected]> wrote: > >> > >> Hi Chen, > >> > >> Does your record reader and mapper handle the case where one map split > >> may not exactly get the whole record? Your case is not very different > >> from the newlines logic presented here: > >> http://wiki.apache.org/hadoop/HadoopMapReduce > >> > >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <[email protected]> wrote: > >> > Hi guys > >> > > >> > I met a interesting problem when I implement my own custom InputFormat > >> > which > >> > extends the FileInputFormat.(I rewrite the RecordReader class but not > >> > the > >> > InputSplit class) > >> > > >> > My recordreader will take following format as a basic record: (my > >> > recordreader extends the LineRecordReader. It returns a record if it > >> > meets > >> > #Trailer# and contains #Header#. I only have one input file that is > >> > composed > >> > of many of following basic record) > >> > > >> > #Header# > >> > .....(many lines, may be 0 lines or 1000 lines, it varies) > >> > #Trailer# > >> > > >> > Everything works fine if above basic input unit in a file is integer > >> > times > >> > of mapper. For example, I use 2 mappers and there are two basic > records > >> > in > >> > my input file. Or I use 3 mappers and there are 6 basic units in the > >> > input > >> > file. > >> > > >> > However, if I use 4 mappers but there are 3 basic units in the input > >> > file(not integer times). The final output is incorrect. The "Map Input > >> > Bytes" in the job counter is also less than the input file size. How > can > >> > I > >> > fix it? Do I need to rewrite the inputSplit? > >> > > >> > Any reply will be appreciated! > >> > > >> > Regards! > >> > > >> > Chen > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >
