Or, as another example, I'm writing a program to analyze a large email dump. The emails are more than one line. TextInputFormat will split them up by line, in addition to deserializing them to text. I'm going to need to customize RecordReader to split based on the MIME metadata length of the emails instead of the newline character, and also preserve them in stream form for the parser to properly parse.
Or, I could customize InputFormat to a subclass that was isSplittable(false) and then just have to handle the preserving as InputStream part. Incidentally, tips on that are welcome if anyone on the list wants to help. So, there are some reasons the isSplittable is able to be modified. There is a trade-off for performance at some point, too, once the files get big, I think, with the mapper having to spill records to disk if the data being mapped gets too big for the JVM memory... *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <[email protected]> wrote: > if you have a simple one line record format you should allow files to be > splitted, since your simulations will be better balanced. > > > 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <[email protected]>: > >> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set >> its value as false and keep the data of records consistent. I mean, the >> length of all the records should be the same. >> >> -- >> Thanks & Regards, >> Sugandha Naolekar >> >> >> >> >> >> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <[email protected]>wrote: >> >>> No, an example could be that records have a variable number of lines, if >>> you would then allow a file to be split your record may be broken, so then >>> you could override isSplittable to be always false. >>> >>> >>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <[email protected]>: >>> >>> So basically what I can deduce from it is, isSplittable() only applies >>>> to stream compressed files. Right? >>>> >>>> -- >>>> Thanks & Regards, >>>> Sugandha Naolekar >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <[email protected]>wrote: >>>> >>>>> Hi Sugandha, >>>>> >>>>> Take gz file as an example, It is not splittable because of the >>>>> compression algorithm it is used. It can not guarantee that one record is >>>>> located in one block, if one record is in 2 blocks, your program will >>>>> crash >>>>> since you can not get the whole record. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> If a single file is split of size 129 MB is split in two >>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of the >>>>>> blocks is read depending on the InputFormat it supports. Thus, what is >>>>>> the >>>>>> significance of isSplittable() method then? >>>>>> >>>>>> If it is set to false, entire block will be considered as single >>>>>> input split? How will TextInputFormat react to it? >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks & Regards, >>>>>> Sugandha Naolekar >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
