And filed: https://issues.apache.org/jira/browse/CRUNCH-331
On Thu, Jan 23, 2014 at 12:41 PM, Josh Wills <[email protected]> wrote: > I think option 2 makes sense-- let's file a JIRA for it. > > J > > > On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen < > [email protected]> wrote: > >> Recently I needed the ability to read in a CSV file with Crunch. >> Reading in the CSV file as a text file and then splitting at a delimiter >> wasn’t an option as the values in the CSV file could have had a new line >> character embedded inside quotes. So, myself and another guy on my team >> worked on creating our own custom input format to read from the file and >> properly generate splits at the end of a valid CSV line, rather than just >> the first new line character. >> >> We started using From.formattedFile (I wasn’t aware of this until the >> user-guide, so thanks Josh for throwing that together) to create the >> TableSource we needed to read the file. After some testing we noticed that >> the getSplits method that we overrode in our InputFormat wasn’t being >> called. After some time debugging we found our way to ‘CrunchInputFormat’, >> and saw that our InputFormat was being replaced with the >> ‘CrunchCombineInputFormat’, and this was causing our splits to be >> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’ >> wasn’t used, everything was working as it should. >> >> I have two possible requests/suggestions: >> >> 1. If the desired behavior is to use the CrunchCombineInputFormat by >> default (even if developer specifies their own InputFormat), can this be >> mentioned in the Source section in the user-guide? The config key for >> disabling the combine is mentioned in the user-guide but not near the >> Source information, so we were unaware of this behavior until we debugged >> through the code. >> 2. If the developer uses From.formattedFile and specifically uses a >> certain InputFormat, can that be honored and have the use of >> CrunchCombineInputFormat be disabled without developer intervention? >> >> >> I would think option 2 is preferred. My expectation was that my >> InputFormat would be used rather than the code defaulting to a different >> InputFormat. >> >> Stephen Durfey >> Software Engineer|The Record >> 816-201-2689 | [email protected] >> CONFIDENTIALITY NOTICE This message and any included attachments are >> from Cerner Corporation and are intended only for the addressee. The >> information contained in this message is confidential and may constitute >> inside or non-public information under international, federal, or state >> securities laws. Unauthorized forwarding, printing, copying, distribution, >> or use of such information is strictly prohibited and may be unlawful. If >> you are not the addressee, please promptly delete this message and notify >> the sender of the delivery error by e-mail or you may call Cerner's >> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
