[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2023-01-04 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062158983


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2177",

Review Comment:
   Could you explain why did you add the test for the error class in the PR. 



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"

Review Comment:
   The same is defined in the parent class (just make it as `protected`). 
Please, remove it.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -319,15 +319,17 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => None,
-QueryExecutionErrors.malformedCSVRecordError())
+QueryExecutionErrors.malformedCSVRecordError(""))
 }
 
+val currentInput = getCurrentInput

Review Comment:
   It is not used in regular cases, correct? Don't think we should introduce 
additional overhead. Please, use `getCurrentInput` directly in errors.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2177",
+parameters = Map("failFastMode" -> "FAILFAST")
+  )
+}
+  }
 }
 
 class CSVv2Suite extends CSVSuite {
   override protected def sparkConf: SparkConf =
 super
   .sparkConf
   .set(SQLConf.USE_V1_SOURCE_LIST, "")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v2") {
+Seq(false, true).foreach { multiLine =>
+  val exception = intercept[SparkException] {
+spark.read
+  .format("csv")
+  .option("multiLine", multiLine)
+  .options(Map("header" -> "true", "mode" -> "failfast"))
+  .load(testFile(carsFile)).collect()
+  }
+
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2064",

Review Comment:
   The same question like above. How this is related to assigning a name to 
`_LEGACY_ERROR_TEMP_2149`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2022-12-29 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059273564


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -370,8 +370,17 @@ abstract class CSVSuite
   .load(testFile(carsFile)).collect()
   }
 
-  assert(exception.getMessage.contains("Malformed CSV record"))
-  
assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+  checkError(
+exception = exception.getCause.asInstanceOf[SparkException],
+errorClass = "_LEGACY_ERROR_TEMP_2177",
+parameters = Map("failFastMode" -> "FAILFAST")
+  )

Review Comment:
   You could move the check to those specific test suites.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2022-12-29 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059273190


##
core/src/main/resources/error/error-classes.json:
##
@@ -851,6 +851,11 @@
   "Cannot name the managed table as , as its associated 
location  already exists. Please pick a different table name, or 
remove the existing location first."
 ]
   },
+  "MALFORMED_CSV_RECORD" : {
+"message" : [
+  "Malformed CSV record. The number of tokens  doesn't match 
the schema ."

Review Comment:
   This one process actual CSV records:
   ```
 private def convert(tokens: Array[String]): Option[InternalRow] = {
   if (tokens == null) {
 throw BadRecordException(
   () => getCurrentInput,
   () => None,
   QueryExecutionErrors.malformedCSVRecordError())
   }
   ```
   and `getCurrentInput` can get you a bad record. Can't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2022-12-29 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058972310


##
core/src/main/resources/error/error-classes.json:
##
@@ -851,6 +851,11 @@
   "Cannot name the managed table as , as its associated 
location  already exists. Please pick a different table name, or 
remove the existing location first."
 ]
   },
+  "MALFORMED_CSV_RECORD" : {
+"message" : [
+  "Malformed CSV record. The number of tokens  doesn't match 
the schema ."

Review Comment:
   Can we output the malformed CSV record? If an user process billions of CSV 
lines, how he/she should figure out which one is bad?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

2022-12-28 Thread GitBox


MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058438876


##
core/src/main/resources/error/error-classes.json:
##
@@ -851,6 +851,11 @@
   "Cannot name the managed table as , as its associated 
location  already exists. Please pick a different table name, or 
remove the existing location first."
 ]
   },
+  "MALFORMED_CSV_RECORD" : {
+"message" : [
+  "Malformed CSV record"

Review Comment:
   Can we provide more info to users? At least which record is malformed, and 
where is the problem - schema mismatch, the converter failed, and so on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org