I think that's a fair assumption to make. I'll open a JIRA for making quoted string parsing optional and a configurable quote character.
2014-12-09 18:51 GMT+01:00 Max Michels <[email protected]>: > That sounds like a good idea. Just like setDelimeter("|"), one should be > able to do a setParseDoubleQuotes(false) to disable the special handling of > double quotes. > > You're right, Fabian, the current implementation treats all String fields > alike. Maybe we can expect the user to provide a consistently formatted > input file (i.e. with or without the use of double quotes as identifiers)? > > On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <[email protected]> wrote: > >> With the current implementation, quoted string parsing kicks in, if the >> first non-whitespace character of a field is a double quote (just as in >> Malte's case). I think this behaviour can be quite unexpected for users. >> Wouldn't it be better to make the behaviour of the String parsing more >> explicit, i.e., add a switch to dis/enable quoted string parsing. With the >> current implementation, the configuration would affect all String fields in >> a file, though... >> >> Cheers, Fabian >> >> 2014-12-09 12:17 GMT+01:00 Max Michels <[email protected]>: >> >>> Hi Malte, >>> >>> Typically, double quotes are used to identify strings and thus are not >>> interpreted literally. Any data in a field after a double quoted string is >>> regarded as invalid trailing data. >>> >>> You could replace double quotes with single quotes: >>> >>> A|ggg >>> B|'hhh' xx >>> C|xxx >>> >>> This results in the expected >'hhh' xx< for the second line. >>> >>> Best regards, >>> Max >>> >>> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[email protected]> wrote: >>> >>>> Hi Stephan, >>>> >>>> The result should be >"hhh“ xx< as field value. Enclosures should be >>>> disabled but there seems to be no method to do that. >>>> >>>> >>>> Malte >>>> >>>> Von: Stephan Ewen <[email protected]> >>>> Antworten an: <[email protected]> >>>> Datum: Freitag, 5. Dezember 2014 16:28 >>>> An: <[email protected]> >>>> Betreff: Re: Quotes in fields of CsvInputFormat >>>> >>>> Hi! >>>> >>>> The parser interprets the quotes as quotes for the field. That means >>>> the second field (the string) stops after the "hhh" and the xx is >>>> considered invalid trailing data. >>>> >>>> What do you expect as the result of parsing that line? >>>> >>>> Stephan >>>> >>>> >>>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I’m try to import a CSV file but the parser seems to have problems >>>>> this quotes in the beginning of a field. Is there a way to set or disable >>>>> enclosures for the CSV input? >>>>> >>>>> This is my code: >>>>> >>>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename) >>>>> .fieldDelimiter('|') >>>>> .types(String.class, String.class) >>>>> >>>>> CSV: >>>>> >>>>> A|ggg >>>>> B|"hhh" xx >>>>> C|xxx >>>>> >>>>> As result I’m receiving a ParserException for line B: >>>>> >>>>> *org.apache.flink.api.common.io.ParseException: Line could not be >>>>> parsed: 'B|"hhh" xx**‘* >>>>> >>>>> >>>>> Thanks, >>>>> Malte >>>>> >>>> >>>> >>> >> >
