With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though...
Cheers, Fabian 2014-12-09 12:17 GMT+01:00 Max Michels <[email protected]>: > Hi Malte, > > Typically, double quotes are used to identify strings and thus are not > interpreted literally. Any data in a field after a double quoted string is > regarded as invalid trailing data. > > You could replace double quotes with single quotes: > > A|ggg > B|'hhh' xx > C|xxx > > This results in the expected >'hhh' xx< for the second line. > > Best regards, > Max > > On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[email protected]> wrote: > >> Hi Stephan, >> >> The result should be >"hhh“ xx< as field value. Enclosures should be >> disabled but there seems to be no method to do that. >> >> >> Malte >> >> Von: Stephan Ewen <[email protected]> >> Antworten an: <[email protected]> >> Datum: Freitag, 5. Dezember 2014 16:28 >> An: <[email protected]> >> Betreff: Re: Quotes in fields of CsvInputFormat >> >> Hi! >> >> The parser interprets the quotes as quotes for the field. That means the >> second field (the string) stops after the "hhh" and the xx is considered >> invalid trailing data. >> >> What do you expect as the result of parsing that line? >> >> Stephan >> >> >> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[email protected]> wrote: >> >>> Hi, >>> >>> I’m try to import a CSV file but the parser seems to have problems this >>> quotes in the beginning of a field. Is there a way to set or disable >>> enclosures for the CSV input? >>> >>> This is my code: >>> >>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename) >>> .fieldDelimiter('|') >>> .types(String.class, String.class) >>> >>> CSV: >>> >>> A|ggg >>> B|"hhh" xx >>> C|xxx >>> >>> As result I’m receiving a ParserException for line B: >>> >>> *org.apache.flink.api.common.io.ParseException: Line could not be >>> parsed: 'B|"hhh" xx**‘* >>> >>> >>> Thanks, >>> Malte >>> >> >> >
