Recently I needed the ability to read in a CSV file with Crunch. Reading in the 
CSV file as a text file and then splitting at a delimiter wasn’t an option as 
the values in the CSV file could have had a new line character embedded inside 
quotes. So, myself and another guy on my team worked on creating our own custom 
input format to read from the file and properly generate splits at the end of a 
valid CSV line, rather than just the first new line character.

We started using From.formattedFile (I wasn’t aware of this until the 
user-guide, so thanks Josh for throwing that together) to create the 
TableSource we needed to read the file. After some testing we noticed that the 
getSplits method that we overrode in our InputFormat wasn’t being called. After 
some time debugging we found our way to ‘CrunchInputFormat’, and saw that our 
InputFormat was being replaced with the ‘CrunchCombineInputFormat’, and this 
was causing our splits to be incorrect. After disabling the config key so 
‘CrunchCombineInputFormat’ wasn’t used, everything was working as it should.

I have two possible requests/suggestions:

  1.  If the desired behavior is to use the CrunchCombineInputFormat by default 
(even if developer specifies their own InputFormat), can this be mentioned in 
the Source section in the user-guide? The config key for disabling the combine 
is mentioned in the user-guide but not near the Source information, so we were 
unaware of this behavior until we debugged through the code.
  2.  If the developer uses From.formattedFile and specifically uses a certain 
InputFormat, can that be honored and have the use of CrunchCombineInputFormat 
be disabled without developer intervention?

I would think option 2 is preferred. My expectation was that my InputFormat 
would be used rather than the code defaulting to a different InputFormat.

Stephen Durfey
Software Engineer|The Record
816-201-2689 | [email protected]

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to