[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107385370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala 
---
@@ -17,25 +17,35 @@
 
 package org.apache.spark.sql.catalyst.util
 
-object ParseModes {
-  val PERMISSIVE_MODE = "PERMISSIVE"
-  val DROP_MALFORMED_MODE = "DROPMALFORMED"
-  val FAIL_FAST_MODE = "FAILFAST"
+import org.apache.spark.internal.Logging
 
-  val DEFAULT = PERMISSIVE_MODE
+object ParseMode extends Enumeration with Logging {
--- End diff --

seems people usually use `sealed trait` and `case object` to implement enum 
in scala, see 
http://stackoverflow.com/questions/1898932/case-objects-vs-enumerations-in-scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107385007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala 
---
@@ -17,25 +17,35 @@
 
 package org.apache.spark.sql.catalyst.util
 
-object ParseModes {
-  val PERMISSIVE_MODE = "PERMISSIVE"
-  val DROP_MALFORMED_MODE = "DROPMALFORMED"
-  val FAIL_FAST_MODE = "FAILFAST"
+import org.apache.spark.internal.Logging
 
-  val DEFAULT = PERMISSIVE_MODE
+object ParseMode extends Enumeration with Logging {
--- End diff --

it's not public, not a big deal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107243921
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala 
---
@@ -17,25 +17,35 @@
 
 package org.apache.spark.sql.catalyst.util
 
-object ParseModes {
-  val PERMISSIVE_MODE = "PERMISSIVE"
-  val DROP_MALFORMED_MODE = "DROPMALFORMED"
-  val FAIL_FAST_MODE = "FAILFAST"
+import org.apache.spark.internal.Logging
 
-  val DEFAULT = PERMISSIVE_MODE
+object ParseMode extends Enumeration with Logging {
--- End diff --

Not sure whether we should use JAVA Enum instead. cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107235659
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -625,6 +625,10 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it 
uses the default value,
   ``-1`` meaning unlimited length.
+:param maxMalformedLogPerPartition: previously sets the maximum 
number of malformed rows
+Spark will log. However, it 
does not log them after
+2.2.0. This parameter exists 
only for backwards
+compatibility for positional 
arguments.
--- End diff --

The same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107235501
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -369,10 +369,10 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it 
uses the default value,
   ``-1`` meaning unlimited length.
-:param maxMalformedLogPerPartition: sets the maximum number of 
malformed rows Spark will
-log for each partition. 
Malformed records beyond this
-number will be ignored. If 
None is set, it
-uses the default value, ``10``.
+:param maxMalformedLogPerPartition: previously sets the maximum 
number of malformed rows
+Spark will log. However, it 
does not log them after
+2.2.0. This parameter exists 
only for backwards
+compatibility for positional 
arguments.
--- End diff --

Let us simplify it to 
> This parameter is no longer used since Spark 2.2.0. If specified, it is 
ignored.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107169080
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -625,6 +625,10 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it 
uses the default value,
   ``-1`` meaning unlimited length.
+:param maxMalformedLogPerPartition: previously sets the maximum 
number of malformed rows
--- End diff --

It seems this documentation was missed. See above - 
https://github.com/apache/spark/pull/17377/files#diff-1ffa6007687db29eb32770f95d817144L572


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107171523
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1083,83 +1083,59 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 
   test("Corrupt records: PERMISSIVE mode, without designated column for 
malformed records") {
-withTempView("jsonTable") {
-  val schema = StructType(
-StructField("a", StringType, true) ::
-  StructField("b", StringType, true) ::
-  StructField("c", StringType, true) :: Nil)
+val schema = StructType(
+  StructField("a", StringType, true) ::
+StructField("b", StringType, true) ::
+StructField("c", StringType, true) :: Nil)
 
-  val jsonDF = spark.read.schema(schema).json(corruptRecords)
-  jsonDF.createOrReplaceTempView("jsonTable")
+val jsonDF = spark.read.schema(schema).json(corruptRecords)
 
-  checkAnswer(
-sql(
-  """
-|SELECT a, b, c
-|FROM jsonTable
-  """.stripMargin),
-Seq(
-  // Corrupted records are replaced with null
-  Row(null, null, null),
-  Row(null, null, null),
-  Row(null, null, null),
-  Row("str_a_4", "str_b_4", "str_c_4"),
-  Row(null, null, null))
-  )
-}
+checkAnswer(
+  jsonDF.select($"a", $"b", $"c"),
+  Seq(
+// Corrupted records are replaced with null
+Row(null, null, null),
+Row(null, null, null),
+Row(null, null, null),
+Row("str_a_4", "str_b_4", "str_c_4"),
+Row(null, null, null))
+)
   }
 
   test("Corrupt records: PERMISSIVE mode, with designated column for 
malformed records") {
 // Test if we can query corrupt records.
 withSQLConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD.key -> "_unparsed") {
-  withTempView("jsonTable") {
-val jsonDF = spark.read.json(corruptRecords)
-jsonDF.createOrReplaceTempView("jsonTable")
-val schema = StructType(
-  StructField("_unparsed", StringType, true) ::
+  val jsonDF = spark.read.json(corruptRecords)
+  val schema = StructType(
+StructField("_unparsed", StringType, true) ::
   StructField("a", StringType, true) ::
   StructField("b", StringType, true) ::
   StructField("c", StringType, true) :: Nil)
 
-assert(schema === jsonDF.schema)
--- End diff --

Here too. The actual changes are as below:

While trying to check related other PRs, I saw some minor comments in 
https://github.com/apache/spark/pull/14929.

The actual changes are as below:

**From**

```
withTempView("jsonTable") {
  ...
  jsonDF.createOrReplaceTempView("jsonTable")
  ...
sql(
  """
|SELECT a, b, c, _unparsed
|FROM jsonTable
  """.stripMargin),
  ...
sql(
  """
|SELECT a, b, c
|FROM jsonTable
|WHERE _unparsed IS NULL
  """.stripMargin),
  ...
sql(
 """
|SELECT _unparsed
|FROM jsonTable
|WHERE _unparsed IS NOT NULL
  """.stripMargin),
...
}
```

**To**

```
...
jsonDF.select($"a", $"b", $"c", $"_unparsed"),
...
jsonDF.filter($"_unparsed".isNull).select($"a", $"b", $"c"),
...
jsonDF.filter($"_unparsed".isNotNull).select($"_unparsed"),
...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107170585
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1083,83 +1083,59 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 
   test("Corrupt records: PERMISSIVE mode, without designated column for 
malformed records") {
-withTempView("jsonTable") {
-  val schema = StructType(
-StructField("a", StringType, true) ::
-  StructField("b", StringType, true) ::
-  StructField("c", StringType, true) :: Nil)
+val schema = StructType(
--- End diff --

While trying to check related other PRs, I saw some minor comments in 
https://github.com/apache/spark/pull/14929.

The actual changes are as below:

**From**

```
withTempView("jsonTable") {
  ...
  jsonDF.createOrReplaceTempView("jsonTable")
  ...
sql(
  """
 |SELECT a, b, c
 |FROM jsonTable
  """.stripMargin),
  ...
}
```

**To**

```
jsonDF.select($"a", $"b", $"c"),
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107169305
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -369,10 +369,10 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param maxCharsPerColumn: defines the maximum number of characters 
allowed for any given
   value being read. If None is set, it 
uses the default value,
   ``-1`` meaning unlimited length.
-:param maxMalformedLogPerPartition: sets the maximum number of 
malformed rows Spark will
-log for each partition. 
Malformed records beyond this
-number will be ignored. If 
None is set, it
-uses the default value, ``10``.
+:param maxMalformedLogPerPartition: previously sets the maximum 
number of malformed rows
--- End diff --

We can't just remove this option. Otherwise, it will break the existing 
python codes that use those options via positional arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17377#discussion_r107169934
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala 
---
@@ -17,25 +17,35 @@
 
 package org.apache.spark.sql.catalyst.util
 
-object ParseModes {
-  val PERMISSIVE_MODE = "PERMISSIVE"
-  val DROP_MALFORMED_MODE = "DROPMALFORMED"
-  val FAIL_FAST_MODE = "FAILFAST"
+import org.apache.spark.internal.Logging
 
-  val DEFAULT = PERMISSIVE_MODE
+object ParseMode extends Enumeration with Logging {
+  type ParseMode = Value
 
-  def isValidMode(mode: String): Boolean = {
-mode.toUpperCase match {
-  case PERMISSIVE_MODE | DROP_MALFORMED_MODE | FAIL_FAST_MODE => true
-  case _ => false
-}
-  }
+  /**
+   * This mode permissively parses the records.
+   */
+  val Permissive = Value("PERMISSIVE")
+
+  /**
+   * This mode ignores the whole corrupted records.
+   */
+  val DropMalformed = Value("DROPMALFORMED")
+
+  /**
+   * This mode throws an exception when it meets corrupted records.
+   */
+  val FailFast = Value("FAILFAST")
 
-  def isDropMalformedMode(mode: String): Boolean = mode.toUpperCase == 
DROP_MALFORMED_MODE
-  def isFailFastMode(mode: String): Boolean = mode.toUpperCase == 
FAIL_FAST_MODE
-  def isPermissiveMode(mode: String): Boolean = if (isValidMode(mode))  {
-mode.toUpperCase == PERMISSIVE_MODE
-  } else {
-true // We default to permissive is the mode string is not valid
+  /**
+   * Returns `ParseMode` enum from the given string.
+   */
+  def fromString(mode: String): ParseMode = mode.toUpperCase match {
+case "PERMISSIVE" => ParseMode.Permissive
--- End diff --

We can use `Permissive.toString`. 

```
Error:(34, 33) stable identifier required, but 
ParseMode.Permissive.toString found.
  case ParseMode.Permissive.toString => ParseMode.Permissive
^
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...

2017-03-21 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17377

[SPARK-19949][SQL][FOLLOW-UP] Make parse modes as enum and update related 
comments

## What changes were proposed in this pull request?

This PR proposes to make `mode` options in both CSV and JSON to use 
enumeration and fix some related comments related previous fix.

Also, this PR modifies some tests related parse modes.

## How was this patch tested?

Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-19949

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17377.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17377






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org