I think option 2 makes sense-- let's file a JIRA for it. J
On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen <[email protected]>wrote: > Recently I needed the ability to read in a CSV file with Crunch. Reading > in the CSV file as a text file and then splitting at a delimiter wasn’t an > option as the values in the CSV file could have had a new line character > embedded inside quotes. So, myself and another guy on my team worked on > creating our own custom input format to read from the file and properly > generate splits at the end of a valid CSV line, rather than just the first > new line character. > > We started using From.formattedFile (I wasn’t aware of this until the > user-guide, so thanks Josh for throwing that together) to create the > TableSource we needed to read the file. After some testing we noticed that > the getSplits method that we overrode in our InputFormat wasn’t being > called. After some time debugging we found our way to ‘CrunchInputFormat’, > and saw that our InputFormat was being replaced with the > ‘CrunchCombineInputFormat’, and this was causing our splits to be > incorrect. After disabling the config key so ‘CrunchCombineInputFormat’ > wasn’t used, everything was working as it should. > > I have two possible requests/suggestions: > > 1. If the desired behavior is to use the CrunchCombineInputFormat by > default (even if developer specifies their own InputFormat), can this be > mentioned in the Source section in the user-guide? The config key for > disabling the combine is mentioned in the user-guide but not near the > Source information, so we were unaware of this behavior until we debugged > through the code. > 2. If the developer uses From.formattedFile and specifically uses a > certain InputFormat, can that be honored and have the use of > CrunchCombineInputFormat be disabled without developer intervention? > > > I would think option 2 is preferred. My expectation was that my > InputFormat would be used rather than the code defaulting to a different > InputFormat. > > Stephen Durfey > Software Engineer|The Record > 816-201-2689 | [email protected] > CONFIDENTIALITY NOTICE This message and any included attachments are > from Cerner Corporation and are intended only for the addressee. The > information contained in this message is confidential and may constitute > inside or non-public information under international, federal, or state > securities laws. Unauthorized forwarding, printing, copying, distribution, > or use of such information is strictly prohibited and may be unlawful. If > you are not the addressee, please promptly delete this message and notify > the sender of the delivery error by e-mail or you may call Cerner's > corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
