[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279818#comment-16279818 ] Apache Spark commented on SPARK-22516: -- User 'smurakozi' has created a pull request for this issue: https://github.com/apache/spark/pull/19906 > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262738#comment-16262738 ] Hyukjin Kwon commented on SPARK-22516: -- Sure. Please go ahead. Probably, you could refer the changes here - https://github.com/apache/spark/pull/19113/files. I opened a PR bumping the version of Univocity library before in order to to resolve an issue fixed in higher version of it. We probably also need a test case too likewise. > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262516#comment-16262516 ] Sandor Murakozi commented on SPARK-22516: - I'm a newbie, I would be happy to happy to work on it. Would it be ok for you [~hyukjin.kwon]? > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262334#comment-16262334 ] Hyukjin Kwon commented on SPARK-22516: -- Seems fixed in 2.5.9. We could probably bump up Univocity library. > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262189#comment-16262189 ] Hyukjin Kwon commented on SPARK-22516: -- This can be reproduced by: {code} spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show() {code} The root cause seems from Univocity parser. I filed an issue there - https://github.com/uniVocity/univocity-parsers/issues/213 BTW, let's keep the description and reproducer clean as possible as we can. I was actually about to say the same things above ^ but realised it's a separate issue after multiple close looks. > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260904#comment-16260904 ] Marco Gaido commented on SPARK-22516: - [~crkumaresh24] I can't reproduce the issue with the new file you have uploaded. I am running on a OSX, maybe it depends on the OS: {code} scala> val a = spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "c").option("parserLib", "univocity").csv("/Users/mgaido/Downloads/test_file_without_eof_char.csv") a: org.apache.spark.sql.DataFrame = [abc: string, def: string] scala> a.show +---+---+ |abc|def| +---+---+ |ghi|jkl| +---+---+ {code} > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259157#comment-16259157 ] Kumaresh C R commented on SPARK-22516: -- [~mgaido]: Even after I replaced all 'CR LF' to 'LF', still in the below case, the error is thrown. -> When the file doesn't have 'LF' as the last character in its last line i.e. EOF (Note: All other lines in the file ends with LF) character Attached the failing file 'test_file_without_eof_char.csv' for your reference. Is it something the problem with the parser or the input data (which doesn't have any line ending as its last character) ? > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257088#comment-16257088 ] Marco Gaido commented on SPARK-22516: - not sure why but this is caused by the fact that your file contains "CR LF" as line separator instead of only LF > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at >
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251110#comment-16251110 ] Kumaresh C R commented on SPARK-22516: -- [~hyukjin.kwon]: Need your help here :) > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) > at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) > at