[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1062158983 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v1") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2177", Review Comment: Could you explain why did you add the test for the error class in the PR. ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" Review Comment: The same is defined in the parent class (just make it as `protected`). Please, remove it. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -319,15 +319,17 @@ class UnivocityParser( throw BadRecordException( () => getCurrentInput, () => None, -QueryExecutionErrors.malformedCSVRecordError()) +QueryExecutionErrors.malformedCSVRecordError("")) } +val currentInput = getCurrentInput Review Comment: It is not used in regular cases, correct? Don't think we should introduce additional overhead. Please, use `getCurrentInput` directly in errors. ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite { super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "csv") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v1") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2177", +parameters = Map("failFastMode" -> "FAILFAST") + ) +} + } } class CSVv2Suite extends CSVSuite { override protected def sparkConf: SparkConf = super .sparkConf .set(SQLConf.USE_V1_SOURCE_LIST, "") + + private val carsFile = "test-data/cars.csv" + + test("test for FAILFAST parsing mode on CSV v2") { +Seq(false, true).foreach { multiLine => + val exception = intercept[SparkException] { +spark.read + .format("csv") + .option("multiLine", multiLine) + .options(Map("header" -> "true", "mode" -> "failfast")) + .load(testFile(carsFile)).collect() + } + + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2064", Review Comment: The same question like above. How this is related to assigning a name to `_LEGACY_ERROR_TEMP_2149`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1059273564 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -370,8 +370,17 @@ abstract class CSVSuite .load(testFile(carsFile)).collect() } - assert(exception.getMessage.contains("Malformed CSV record")) - assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException]) + checkError( +exception = exception.getCause.asInstanceOf[SparkException], +errorClass = "_LEGACY_ERROR_TEMP_2177", +parameters = Map("failFastMode" -> "FAILFAST") + ) Review Comment: You could move the check to those specific test suites. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1059273190 ## core/src/main/resources/error/error-classes.json: ## @@ -851,6 +851,11 @@ "Cannot name the managed table as , as its associated location already exists. Please pick a different table name, or remove the existing location first." ] }, + "MALFORMED_CSV_RECORD" : { +"message" : [ + "Malformed CSV record. The number of tokens doesn't match the schema ." Review Comment: This one process actual CSV records: ``` private def convert(tokens: Array[String]): Option[InternalRow] = { if (tokens == null) { throw BadRecordException( () => getCurrentInput, () => None, QueryExecutionErrors.malformedCSVRecordError()) } ``` and `getCurrentInput` can get you a bad record. Can't it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1058972310 ## core/src/main/resources/error/error-classes.json: ## @@ -851,6 +851,11 @@ "Cannot name the managed table as , as its associated location already exists. Please pick a different table name, or remove the existing location first." ] }, + "MALFORMED_CSV_RECORD" : { +"message" : [ + "Malformed CSV record. The number of tokens doesn't match the schema ." Review Comment: Can we output the malformed CSV record? If an user process billions of CSV lines, how he/she should figure out which one is bad? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
MaxGekk commented on code in PR #39258: URL: https://github.com/apache/spark/pull/39258#discussion_r1058438876 ## core/src/main/resources/error/error-classes.json: ## @@ -851,6 +851,11 @@ "Cannot name the managed table as , as its associated location already exists. Please pick a different table name, or remove the existing location first." ] }, + "MALFORMED_CSV_RECORD" : { +"message" : [ + "Malformed CSV record" Review Comment: Can we provide more info to users? At least which record is malformed, and where is the problem - schema mismatch, the converter failed, and so on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org