[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107385370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala --- @@ -17,25 +17,35 @@ package org.apache.spark.sql.catalyst.util -object ParseModes { - val PERMISSIVE_MODE = "PERMISSIVE" - val DROP_MALFORMED_MODE = "DROPMALFORMED" - val FAIL_FAST_MODE = "FAILFAST" +import org.apache.spark.internal.Logging - val DEFAULT = PERMISSIVE_MODE +object ParseMode extends Enumeration with Logging { --- End diff -- seems people usually use `sealed trait` and `case object` to implement enum in scala, see http://stackoverflow.com/questions/1898932/case-objects-vs-enumerations-in-scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107385007 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala --- @@ -17,25 +17,35 @@ package org.apache.spark.sql.catalyst.util -object ParseModes { - val PERMISSIVE_MODE = "PERMISSIVE" - val DROP_MALFORMED_MODE = "DROPMALFORMED" - val FAIL_FAST_MODE = "FAILFAST" +import org.apache.spark.internal.Logging - val DEFAULT = PERMISSIVE_MODE +object ParseMode extends Enumeration with Logging { --- End diff -- it's not public, not a big deal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107243921 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala --- @@ -17,25 +17,35 @@ package org.apache.spark.sql.catalyst.util -object ParseModes { - val PERMISSIVE_MODE = "PERMISSIVE" - val DROP_MALFORMED_MODE = "DROPMALFORMED" - val FAIL_FAST_MODE = "FAILFAST" +import org.apache.spark.internal.Logging - val DEFAULT = PERMISSIVE_MODE +object ParseMode extends Enumeration with Logging { --- End diff -- Not sure whether we should use JAVA Enum instead. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107235659 --- Diff: python/pyspark/sql/streaming.py --- @@ -625,6 +625,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows +Spark will log. However, it does not log them after +2.2.0. This parameter exists only for backwards +compatibility for positional arguments. --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107235501 --- Diff: python/pyspark/sql/readwriter.py --- @@ -369,10 +369,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. -:param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will -log for each partition. Malformed records beyond this -number will be ignored. If None is set, it -uses the default value, ``10``. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows +Spark will log. However, it does not log them after +2.2.0. This parameter exists only for backwards +compatibility for positional arguments. --- End diff -- Let us simplify it to > This parameter is no longer used since Spark 2.2.0. If specified, it is ignored. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107169080 --- Diff: python/pyspark/sql/streaming.py --- @@ -625,6 +625,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows --- End diff -- It seems this documentation was missed. See above - https://github.com/apache/spark/pull/17377/files#diff-1ffa6007687db29eb32770f95d817144L572 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107171523 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1083,83 +1083,59 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } test("Corrupt records: PERMISSIVE mode, without designated column for malformed records") { -withTempView("jsonTable") { - val schema = StructType( -StructField("a", StringType, true) :: - StructField("b", StringType, true) :: - StructField("c", StringType, true) :: Nil) +val schema = StructType( + StructField("a", StringType, true) :: +StructField("b", StringType, true) :: +StructField("c", StringType, true) :: Nil) - val jsonDF = spark.read.schema(schema).json(corruptRecords) - jsonDF.createOrReplaceTempView("jsonTable") +val jsonDF = spark.read.schema(schema).json(corruptRecords) - checkAnswer( -sql( - """ -|SELECT a, b, c -|FROM jsonTable - """.stripMargin), -Seq( - // Corrupted records are replaced with null - Row(null, null, null), - Row(null, null, null), - Row(null, null, null), - Row("str_a_4", "str_b_4", "str_c_4"), - Row(null, null, null)) - ) -} +checkAnswer( + jsonDF.select($"a", $"b", $"c"), + Seq( +// Corrupted records are replaced with null +Row(null, null, null), +Row(null, null, null), +Row(null, null, null), +Row("str_a_4", "str_b_4", "str_c_4"), +Row(null, null, null)) +) } test("Corrupt records: PERMISSIVE mode, with designated column for malformed records") { // Test if we can query corrupt records. withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") { - withTempView("jsonTable") { -val jsonDF = spark.read.json(corruptRecords) -jsonDF.createOrReplaceTempView("jsonTable") -val schema = StructType( - StructField("_unparsed", StringType, true) :: + val jsonDF = spark.read.json(corruptRecords) + val schema = StructType( +StructField("_unparsed", StringType, true) :: StructField("a", StringType, true) :: StructField("b", StringType, true) :: StructField("c", StringType, true) :: Nil) -assert(schema === jsonDF.schema) --- End diff -- Here too. The actual changes are as below: While trying to check related other PRs, I saw some minor comments in https://github.com/apache/spark/pull/14929. The actual changes are as below: **From** ``` withTempView("jsonTable") { ... jsonDF.createOrReplaceTempView("jsonTable") ... sql( """ |SELECT a, b, c, _unparsed |FROM jsonTable """.stripMargin), ... sql( """ |SELECT a, b, c |FROM jsonTable |WHERE _unparsed IS NULL """.stripMargin), ... sql( """ |SELECT _unparsed |FROM jsonTable |WHERE _unparsed IS NOT NULL """.stripMargin), ... } ``` **To** ``` ... jsonDF.select($"a", $"b", $"c", $"_unparsed"), ... jsonDF.filter($"_unparsed".isNull).select($"a", $"b", $"c"), ... jsonDF.filter($"_unparsed".isNotNull).select($"_unparsed"), ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107170585 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1083,83 +1083,59 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } test("Corrupt records: PERMISSIVE mode, without designated column for malformed records") { -withTempView("jsonTable") { - val schema = StructType( -StructField("a", StringType, true) :: - StructField("b", StringType, true) :: - StructField("c", StringType, true) :: Nil) +val schema = StructType( --- End diff -- While trying to check related other PRs, I saw some minor comments in https://github.com/apache/spark/pull/14929. The actual changes are as below: **From** ``` withTempView("jsonTable") { ... jsonDF.createOrReplaceTempView("jsonTable") ... sql( """ |SELECT a, b, c |FROM jsonTable """.stripMargin), ... } ``` **To** ``` jsonDF.select($"a", $"b", $"c"), ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107169305 --- Diff: python/pyspark/sql/readwriter.py --- @@ -369,10 +369,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. -:param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will -log for each partition. Malformed records beyond this -number will be ignored. If None is set, it -uses the default value, ``10``. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows --- End diff -- We can't just remove this option. Otherwise, it will break the existing python codes that use those options via positional arguments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107169934 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala --- @@ -17,25 +17,35 @@ package org.apache.spark.sql.catalyst.util -object ParseModes { - val PERMISSIVE_MODE = "PERMISSIVE" - val DROP_MALFORMED_MODE = "DROPMALFORMED" - val FAIL_FAST_MODE = "FAILFAST" +import org.apache.spark.internal.Logging - val DEFAULT = PERMISSIVE_MODE +object ParseMode extends Enumeration with Logging { + type ParseMode = Value - def isValidMode(mode: String): Boolean = { -mode.toUpperCase match { - case PERMISSIVE_MODE | DROP_MALFORMED_MODE | FAIL_FAST_MODE => true - case _ => false -} - } + /** + * This mode permissively parses the records. + */ + val Permissive = Value("PERMISSIVE") + + /** + * This mode ignores the whole corrupted records. + */ + val DropMalformed = Value("DROPMALFORMED") + + /** + * This mode throws an exception when it meets corrupted records. + */ + val FailFast = Value("FAILFAST") - def isDropMalformedMode(mode: String): Boolean = mode.toUpperCase == DROP_MALFORMED_MODE - def isFailFastMode(mode: String): Boolean = mode.toUpperCase == FAIL_FAST_MODE - def isPermissiveMode(mode: String): Boolean = if (isValidMode(mode)) { -mode.toUpperCase == PERMISSIVE_MODE - } else { -true // We default to permissive is the mode string is not valid + /** + * Returns `ParseMode` enum from the given string. + */ + def fromString(mode: String): ParseMode = mode.toUpperCase match { +case "PERMISSIVE" => ParseMode.Permissive --- End diff -- We can use `Permissive.toString`. ``` Error:(34, 33) stable identifier required, but ParseMode.Permissive.toString found. case ParseMode.Permissive.toString => ParseMode.Permissive ^ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/17377 [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as enum and update related comments ## What changes were proposed in this pull request? This PR proposes to make `mode` options in both CSV and JSON to use enumeration and fix some related comments related previous fix. Also, this PR modifies some tests related parse modes. ## How was this patch tested? Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-19949 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17377.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17377 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org