Recently I needed the ability to read in a CSV file with Crunch. Reading in the CSV file as a text file and then splitting at a delimiter wasn’t an option as the values in the CSV file could have had a new line character embedded inside quotes. So, myself and another guy on my team worked on creating our own custom input format to read from the file and properly generate splits at the end of a valid CSV line, rather than just the first new line character.
We started using From.formattedFile (I wasn’t aware of this until the user-guide, so thanks Josh for throwing that together) to create the TableSource we needed to read the file. After some testing we noticed that the getSplits method that we overrode in our InputFormat wasn’t being called. After some time debugging we found our way to ‘CrunchInputFormat’, and saw that our InputFormat was being replaced with the ‘CrunchCombineInputFormat’, and this was causing our splits to be incorrect. After disabling the config key so ‘CrunchCombineInputFormat’ wasn’t used, everything was working as it should. I have two possible requests/suggestions: 1. If the desired behavior is to use the CrunchCombineInputFormat by default (even if developer specifies their own InputFormat), can this be mentioned in the Source section in the user-guide? The config key for disabling the combine is mentioned in the user-guide but not near the Source information, so we were unaware of this behavior until we debugged through the code. 2. If the developer uses From.formattedFile and specifically uses a certain InputFormat, can that be honored and have the use of CrunchCombineInputFormat be disabled without developer intervention? I would think option 2 is preferred. My expectation was that my InputFormat would be used rather than the code defaulting to a different InputFormat. Stephen Durfey Software Engineer|The Record 816-201-2689 | [email protected] CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
