sc.textFile problem due to newlines within a CSV record

Mohit Jaggi Fri, 12 Sep 2014 19:45:04 -0700

Folks,
I think this might be due to the default TextInputFormat in Hadoop. Any
pointers to solutions much appreciated.
>>
More powerfully, you can define your own *InputFormat* implementations to
format the input to your programs however you want. For example, the
default TextInputFormat reads lines of text files. The key it emits for
each record is the byte offset of the line read (as a LongWritable), and
the value is the contents of the line up to the terminating '\n' character
(as a Text object). If you have multi-line records each separated by a
$character,
you could write your own *InputFormat* that parses files into records split
on this character instead.
>>


Thanks,
Mohit

sc.textFile problem due to newlines within a CSV record

Reply via email to