from:"MaxGekk"

[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-12-10 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23132
  
>  mind fixing PR description accordingly?

@HyukjinKwon fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-09 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r240041107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -347,17 +347,28 @@ class JacksonParser(
   schema: StructType,
   fieldConverters: Array[ValueConverter]): InternalRow = {
 val row = new GenericInternalRow(schema.length)
+var badRecordException: Option[Throwable] = None
+
 while (nextUntil(parser, JsonToken.END_OBJECT)) {
   schema.getFieldIndex(parser.getCurrentName) match {
 case Some(index) =>
-  row.update(index, fieldConverters(index).apply(parser))
-
+  try {
+row.update(index, fieldConverters(index).apply(parser))
+  } catch {
+case NonFatal(e) =>
+  badRecordException = badRecordException.orElse(Some(e))
+  parser.skipChildren()
+  }
 case None =>
   parser.skipChildren()
   }
 }
 
-row
+if (badRecordException.isEmpty) {
+  row
+} else {
+  throw BadRecordException(() => UTF8String.EMPTY_UTF8, () => 
Some(row), badRecordException.get)
--- End diff --

I added a separate exception


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-09 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240038837
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

I didn't mean type inference in partition values but you are probably right 
we should follow the same logic in schema inferring in datasources and 
partition value types.

Just wondering how it works for now, this code: 
https://github.com/apache/spark/blob/5a140b7844936cf2b65f08853b8cfd8c499d4f13/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L474-L482
 and this 
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L163
 can use different timestamp patterns, or it is supposed to work only with 
default settings?

Maybe `inferPartitionColumnValue` should ask a datasource for inferring 
date/timestamp types?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-09 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240031238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

Yes, date parsing can be less strict in that case but if we prefer 
`TimestampType` over `DateType` for similar date and timestamp pattern, we will 
consume more memory. And from `DateType` we can lift to `TimestampType` during 
schema inferring but opposite way is impossible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-09 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r240030751
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala
 ---
@@ -229,6 +229,11 @@ private[json] trait TestJsonData {
   """{"date": "27/10/2014 18:30"}""" ::
   """{"date": "28/01/2016 20:00"}""" :: Nil))(Encoders.STRING)
 
+  def badRecords: Dataset[String] =
--- End diff --

moved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-09 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r240029821
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -35,7 +35,9 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, CSV datasource uses java.time API for parsing and 
generating CSV content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
 
-  - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
+  - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
the returned row can contain non-`null` fields if some of CSV column values 
were parsed and converted to desired types successfully.
+
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
--- End diff --

`from_csv` was added recently. It didn't exist in 2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23257
  
@cloud-fan What do you think of the PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r240014440
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
 
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
--- End diff --

Fixed for CSV too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r240006479
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2563,4 +2563,18 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   assert(!files.exists(_.getName.endsWith("json")))
 }
   }
+
+  test("return partial result for bad records") {
+val schema = new StructType()
+  .add("a", DoubleType)
+  .add("b", ArrayType(IntegerType))
+  .add("c", StringType)
+  .add("_corrupt_record", StringType)
--- End diff --

replaced


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23196
  
@srowen @HyukjinKwon @gatorsmile Could you take a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r24694
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
 
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+
--- End diff --

@cloud-fan This PR propose similar changes as in 
https://github.com/apache/spark/pull/23120 . Could you take a look at it. 

> For such behavior change, shall we add a config to roll back to previous 
behavior?

I don't think it makes sense to introduce global SQL config for this 
particular case. The risk of breaking users apps is low because apps logic 
cannot based only on presence of all nulls in row. All nulls don't 
differentiate bad and not-bad JSON records. From my point of view, a note in 
the migration guide is enough. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r24411
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

The order can be matter if you have the same pattern (or similar) for dates 
and timestamps. `DateType` can be preferable because it requires less memory. 

It seems reasonable to move from `DateType` to `TimestampType` during 
schema inferring since opposite one is impossible without loosing info.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r24119
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

I added an example when there is a difference, and updated the migration 
guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r239998397
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
 
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+
--- End diff --

In the `PERMISSIVE` mode, no way but at the moment (without the PR) you 
cannot distinguish a row produced from a bad record from a row produced from 
JSON object with all `null` fields too.

A row itself with all null cannot be an indicator of bad record. Need an 
additional flag. `null` or non-`null` in the corrupt column plays such role.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-08 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239998126
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

Tests passed on new parser. I reverted back all settings for 
`HiveCompatibilitySuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options

2018-12-07 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23257
  
@gatorsmile Please, have a look at this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23257: [SPARK-26310][SQL] Verify applicability of JSON o...

2018-12-07 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23257

[SPARK-26310][SQL] Verify applicability of JSON options

## What changes were proposed in this pull request?

In the PR, I propose additional verification of JSON options. In 
particular, detection of using `read` options in `write` and vice versa. The 
verification can be disabled by new SQL config 
`spark.sql.verifyDataSourceOptions` (`true` by default). This changes should 
prevent misusing of JSON options, and improve usage experience of built-in 
datasource.

## How was this patch tested?

Added new test to `JacksonGeneratorSuite` to check how `in read` option is 
handled if it is passed `in write`. Also the test checks new SQL config can 
disable the verification.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 sep-options-in-save-load

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23257


commit edbe0b7e7f29631f6b92255a7b561c916a6c39cc
Author: Maxim Gekk 
Date:   2018-12-07T17:00:11Z

Introducing JSONOptionsInWrite

commit af1507093f42a206c2c22f6c40cf4c43290244b8
Author: Maxim Gekk 
Date:   2018-12-07T20:34:24Z

Test for options verification




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-07 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r239853384
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
 
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+
--- End diff --

If the row was produced from a bad JSON record, the bad record is placed to 
the corrupt column otherwise the corrupt column contains null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...

2018-12-07 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23253
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-07 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23253

[SPARK-26303][SQL] Return partial results for bad JSON records

## What changes were proposed in this pull request?

In the PR, I propose to return partial results from JSON datasource and 
JSON functions in the PERMISSIVE mode if some of JSON fields are parsed and 
converted to desired types successfully. The changes are made only for 
`StructType`. Whole bad JSON records are placed into the corrupt column 
specified by the `columnNameOfCorruptRecord` option or SQL config.

Partial results are not returned for malformed JSON input. 

## How was this patch tested?

Added new UT which checks converting JSON strings with one invalid and one 
valid field at the end of the string.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-bad-record

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23253


commit 2d2fed923e9567b6bc22cf4355ee3b57f8699cd4
Author: Maxim Gekk 
Date:   2018-12-06T21:26:42Z

Test for bad records

commit de4885899b2a5e866b6c4a365e91fc901cdc0f86
Author: Maxim Gekk 
Date:   2018-12-06T23:11:29Z

Added a test

commit 101df4e7fcee60c042c21951551bd3376ae325cb
Author: Maxim Gekk 
Date:   2018-12-06T23:16:09Z

Return partial results

commit 542ae74a9240b2905f4ca1f9de62fd53f1562d64
Author: Maxim Gekk 
Date:   2018-12-07T09:02:37Z

Updating the migration guide

commit 439f57ac59a7c1f481317126934a55d20b38f579
Author: Maxim Gekk 
Date:   2018-12-07T09:04:58Z

Revert "Updating the migration guide"

This reverts commit 542ae74a9240b2905f4ca1f9de62fd53f1562d64.

commit 77d767016607056809d5f8c4e9653809838b91dd
Author: Maxim Gekk 
Date:   2018-12-07T09:10:31Z

Updating the migration guide - separate notes

commit fa431ee293d5ae3a7c3b8b441c4a0b6778ac84e1
Author: Maxim Gekk 
Date:   2018-12-07T09:12:16Z

Updating the migration guide - minor

commit e13673de2537fb42bdc2bb8df9849560ae399c89
Author: Maxim Gekk 
Date:   2018-12-07T10:00:07Z

Updating the migration guide - minor 2




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-06 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r239547742
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

`DateType` is not inferred at all but there is another type inference code 
that could be shared between JSON and CSV (maybe somewhere else).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-06 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r239537694
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

Yes, we can do that. There is some common code that could be shared. Can we 
do it in a separate PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23201: [SPARK-26246][SQL] Infer date and timestamp types from J...

2018-12-06 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23201
  
@cloud-fan May I ask you to look at this PR, please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-06 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23202
  
@HyukjinKwon @srowen Is there anything which worries you in the PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23159#discussion_r239269414
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1777,7 +1777,7 @@ class Analyzer(
 
   case p if p.expressions.exists(hasGenerator) =>
 throw new AnalysisException("Generators are not supported outside 
the SELECT clause, but " +
-  "got: " + p.simpleString)
+  "got: " + p.simpleString((SQLConf.get.maxToStringFields)))
--- End diff --

fixed here and other places


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r239269170
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +121,18 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
+if ((allCatch opt 
options.timestampFormat.parse(stringValue)).isDefined) {
--- End diff --

I made similar changes to https://github.com/apache/spark/pull/23202 - 
strong `DateType` inferring before `TimestampType`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23196
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23235#discussion_r239055208
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -35,6 +35,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, CSV datasource uses java.time API for parsing and 
generating CSV content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
 
+  - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode if specified schema 
is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if 
some of CSV column values were parsed and converted to desired types 
successfully.
--- End diff --

you are right. I will remove the part about `StructType`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial results for ...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23235
  
@cloud-fan Please, have a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...

2018-12-05 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23235

[SPARK-26151][SQL][FOLLOWUP] Return partial results for bad CSV records

## What changes were proposed in this pull request?

Updated SQL migration guide according to changes in 
https://github.com/apache/spark/pull/23120

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 
failuresafe-partial-result-followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23235.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23235


commit 8c115f7871d4db66b13ee21ea3a1231f7153791e
Author: Maxim Gekk 
Date:   2018-12-05T12:13:26Z

Updating the migration guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23120
  
The PR https://github.com/apache/spark/pull/23235 updates the sql migration 
guide


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-05 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r239010321
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

Our current approach for converting dates is inconsistent in a few places, 
for example:
- `UTF8String` -> `num days` uses hardcoded `GMT` and ignores SQL config: 
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L493
- `String` -> `java.util.Date` ignores Spark's time zone settings, and uses 
system time zone: 
https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L186
- In many places even a function accepts timeZone parameter, it is not 
passed (used default time zone - **not from config but from 
TimeZone.getDefault()**). For example: 
https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L187
 .
- Casting to the date type depends on type of argument, if it is 
`TimestampType`, expression-wise timezone is used, otherwise `GMT`: 
https://github.com/apache/spark/blob/d03e0af80d7659f12821cc2442efaeaee94d3985/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L403-L410

I do really think to disable new parser/formatter outside of CSV/JSON 
datasources because it is hard to guarantee consistent behavior in combination 
with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT?
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-12-04 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
> Rather than change every single call to this method, if this should 
generally be the value of the argument, then why not make it the default value 
or something?

New parameter aims to solve the problem when there are multiple callers, 
and each of them needs different maximum fields. So, a feasible approach is to 
propagate `maxFields` from callers to `truncatedString`. Changing global SQL 
config does not solve the problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-04 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238853314
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

While porting on new parser/formatter, I faced to 2problems at least:
- The time zone from SQL config is not taken into account on parsing at 
all. Basically used functions take default time zone from jvm settings. It 
could be fixed by `TimeZone.setDefault` or using absolute values.
-  Round trip in parsing a date to `DateType` and back to a date as a 
string could give different string because `DateType` stores only days (as 
`Int`) since epoch (in `UTC`). And such representation loses time zone offset. 
So, exact matching is impossible due to lack of information. For example, 
roundtrip converting for `TimestampType` works perfectly. This is the case for 
the changes. Previously, it worked because the specified time zone is not used 
at all (did not impact on number of days while converting a string to 
`DateType`). With new parser/formatter, it becomes matter, and I have to change 
time zone to `GMT` to eliminate the problem of loosing timezone offsets (it is 
zero for `GMT`). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-12-04 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
@HyukjinKwon @dongjoon-hyun @srowen @zsxwing Do you have any objections of 
this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-04 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238696317
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

Done. Please, have a look at the changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-12-04 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21363
  
@srowen There is another PR with related changes - inferring `DateType` 
from CSV: https://github.com/apache/spark/pull/23202


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-03 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238194142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

Just in case, I did exact match here as well, see 
https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174
 . If date parser didn't consume all input (`pos.getIndex != field.length`), it 
fails. If I move it up in type inferring pipeline, it should work. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238116870
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

@gatorsmile What would I you recommend to improve the text? I can add the 
links above, so, an user can figure out what is difference in their particular 
case. Our tests don't show any difference on our default timestamp/date 
patterns but the user can use something more specific and face to behaviour 
change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23202

[SPARK-26248][SQL] Infer date type from CSV

## What changes were proposed in this pull request?

The `CSVInferSchema` class is extended to support inferring of `DateType` 
from CSV input. The attempt to infer `DateType` is performed after inferring 
`TimestampType`.

## How was this patch tested?

Added new test for inferring date types from CSV . It was also tested by 
existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, 
`CsvFunctionsSuite` and `CsvSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 csv-date-inferring

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23202


commit a8d27d64d45ca4ea207f0d403dba472b2f606968
Author: Maxim Gekk 
Date:   2018-12-02T21:57:02Z

Test for inferring the date type

commit fa915fd3fcab37be2c53bc23a77b55e3024b3994
Author: Maxim Gekk 
Date:   2018-12-02T21:57:21Z

Inferring date type




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-02 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23201

[SPARK-26246][SQL] Infer date and timestamp types from JSON

## What changes were proposed in this pull request?

The `JsonInferSchema` class is extended to support `DateType` and 
`TimestampType` inferring from string fields in JSON input. It tries to infer 
`TimestampType` as tightest type first of all. If timestamp parsing fails, 
`DateType` is inferred using date pattern. As the fallback in the case of both 
failures, it invokes `DateTimeUtils.stringToTime`.

## How was this patch tested?

Added new test suite - `JsonInferSchemaSuite` to check date and timestamp 
types inferring from JSON. This changes was tested by `JsonSuite`, 
`JsonExpressionsSuite` and `JsonFunctionsSuite` as well. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-infer-time

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23201.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23201


commit 2a26e2c680b517e9e89a0f4bc4cc31884020188d
Author: Maxim Gekk 
Date:   2018-12-02T20:06:05Z

Added a test for timestamp inferring

commit bd472072a39dbec2e1eec1396196c6c5e6a659dd
Author: Maxim Gekk 
Date:   2018-12-02T20:43:48Z

Infer date and timestamp types

commit 9dbdf0a764c998875932e50faf460f36216ef58d
Author: Maxim Gekk 
Date:   2018-12-02T20:44:08Z

Test for date type




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-02 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238111358
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

New implementation and old one have slightly different pattern formats. See 
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
 and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html 
. And two Java API can have different behaviours. Besides of that, new one can 
parse timestamps with microseconds precision as a consequence of using Java 8 
java.time API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-02 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
@srowen I think this PR is ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-12-02 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r238097513
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -243,21 +243,27 @@ class UnivocityParser(
 () => getPartialResult(),
 new RuntimeException("Malformed CSV record"))
 } else {
-  try {
-// When the length of the returned tokens is identical to the 
length of the parsed schema,
-// we just need to convert the tokens that correspond to the 
required columns.
-var i = 0
-while (i < requiredSchema.length) {
+  // When the length of the returned tokens is identical to the length 
of the parsed schema,
+  // we just need to convert the tokens that correspond to the 
required columns.
+  var badRecordException: Option[Throwable] = None
+  var i = 0
+  while (i < requiredSchema.length) {
+try {
   row(i) = valueConverters(i).apply(getToken(tokens, i))
-  i += 1
+} catch {
+  case NonFatal(e) =>
+badRecordException = badRecordException.orElse(Some(e))
 }
+i += 1
+  }
+
+  if (badRecordException.isEmpty) {
 row
-  } catch {
-case NonFatal(e) =>
-  // For corrupted records with the number of tokens same as the 
schema,
-  // CSV reader doesn't support partial results. All fields other 
than the field
-  // configured by `columnNameOfCorruptRecord` are set to `null`.
-  throw BadRecordException(() => getCurrentInput, () => None, e)
+  } else {
+// For corrupted records with the number of tokens same as the 
schema,
+// CSV reader doesn't support partial results. All fields other 
than the field
+// configured by `columnNameOfCorruptRecord` are set to `null`.
--- End diff --

The comment is not actual anymore. I will remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23150#discussion_r238075711
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala
 ---
@@ -86,62 +85,74 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
 // null.
 Seq(true, false).foreach { b =>
   val options = new CSVOptions(Map("nullValue" -> "null"), false, 
"GMT")
-  val converter =
-parser.makeConverter("_1", StringType, nullable = b, options = 
options)
+  val parser = new UnivocityParser(StructType(Seq.empty), options)
+  val converter = parser.makeConverter("_1", StringType, nullable = b)
   assert(converter.apply("") == UTF8String.fromString(""))
 }
   }
 
   test("Throws exception for empty string with non null type") {
-  val options = new CSVOptions(Map.empty[String, String], false, "GMT")
+val options = new CSVOptions(Map.empty[String, String], false, "GMT")
+val parser = new UnivocityParser(StructType(Seq.empty), options)
 val exception = intercept[RuntimeException]{
-  parser.makeConverter("_1", IntegerType, nullable = false, options = 
options).apply("")
+  parser.makeConverter("_1", IntegerType, nullable = false).apply("")
 }
 assert(exception.getMessage.contains("null value found but field _1 is 
not nullable."))
   }
 
   test("Types are cast correctly") {
 val options = new CSVOptions(Map.empty[String, String], false, "GMT")
-assert(parser.makeConverter("_1", ByteType, options = 
options).apply("10") == 10)
-assert(parser.makeConverter("_1", ShortType, options = 
options).apply("10") == 10)
-assert(parser.makeConverter("_1", IntegerType, options = 
options).apply("10") == 10)
-assert(parser.makeConverter("_1", LongType, options = 
options).apply("10") == 10)
-assert(parser.makeConverter("_1", FloatType, options = 
options).apply("1.00") == 1.0)
-assert(parser.makeConverter("_1", DoubleType, options = 
options).apply("1.00") == 1.0)
-assert(parser.makeConverter("_1", BooleanType, options = 
options).apply("true") == true)
-
-val timestampsOptions =
+var parser = new UnivocityParser(StructType(Seq.empty), options)
+assert(parser.makeConverter("_1", ByteType).apply("10") == 10)
+assert(parser.makeConverter("_1", ShortType).apply("10") == 10)
+assert(parser.makeConverter("_1", IntegerType).apply("10") == 10)
+assert(parser.makeConverter("_1", LongType).apply("10") == 10)
+assert(parser.makeConverter("_1", FloatType).apply("1.00") == 1.0)
+assert(parser.makeConverter("_1", DoubleType).apply("1.00") == 1.0)
+assert(parser.makeConverter("_1", BooleanType).apply("true") == true)
+
+var timestampsOptions =
   new CSVOptions(Map("timestampFormat" -> "dd/MM/ hh:mm"), false, 
"GMT")
+parser = new UnivocityParser(StructType(Seq.empty), timestampsOptions)
 val customTimestamp = "31/01/2015 00:00"
-val expectedTime = 
timestampsOptions.timestampFormat.parse(customTimestamp).getTime
-val castedTimestamp =
-  parser.makeConverter("_1", TimestampType, nullable = true, options = 
timestampsOptions)
+var format = FastDateFormat.getInstance(
+  timestampsOptions.timestampFormat, timestampsOptions.timeZone, 
timestampsOptions.locale)
+val expectedTime = format.parse(customTimestamp).getTime
+val castedTimestamp = parser.makeConverter("_1", TimestampType, 
nullable = true)
 .apply(customTimestamp)
 assert(castedTimestamp == expectedTime * 1000L)
 
 val customDate = "31/01/2015"
 val dateOptions = new CSVOptions(Map("dateFormat" -> "dd/MM/"), 
false, "GMT")
-val expectedDate = dateOptions.dateFormat.parse(customDate).getTime
-val castedDate =
-  parser.makeConverter("_1", DateType, nullable = true, options = 
dateOptions)
-.apply(customTimestamp)
-assert(castedDate == DateTimeUtils.millisToDays(expectedDate))
+parser = new UnivocityParser(StructType(Seq.empty), dateOptions)
+format = FastDateFormat.getInstance(
+  dateOptions.dateFormat, dateOptions.timeZone, d

[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23150#discussion_r238075664
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -622,10 +623,11 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 val options = Map(
   "header" -> "true",
   "inferSchema" -> "false",
-  "dateFormat" -> "dd/MM/ hh:mm")
+  "dateFormat" -> "dd/MM/ HH:mm")
--- End diff --

According to iso 8601:
```
h   clock-hour-of-am-pm (1-12)  number12
H   hour-of-day (0-23)  number0
```
but real data is not in the allowed range:
```
date
26/08/2015 18:00
27/10/2014 18:30
28/01/2016 20:00
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23150#discussion_r238075585
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1107,7 +,7 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 
   test("SPARK-18699 put malformed records in a `columnNameOfCorruptRecord` 
field") {
 Seq(false, true).foreach { multiLine =>
-  val schema = new StructType().add("a", IntegerType).add("b", 
TimestampType)
+  val schema = new StructType().add("a", IntegerType).add("b", 
DateType)
--- End diff --

I changed the type because supposed to valid date `"1983-08-04"` cannot be 
parsed with default timestamp pattern.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
> they pass right? is there another test you were unable to add?

For now everything has been passed. I run all test localy on different 
timezones (set via jvm parameter `-Duser.timezone`). 

I updated the migration guide since I enabled new parser by default.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-01 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23196

[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from 
JSON

## What changes were proposed in this pull request?

In the PR, I propose to switch on **java.time API** for parsing timestamps 
and dates from JSON inputs with microseconds precision. The SQL config 
`spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior 
with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating 
timestamps/dates.

## How was this patch tested?

It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and 
`JsonSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-time-parser

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23196


commit fb10b91502b67b98f2904a06b017a6e56dd6e39f
Author: Maxim Gekk 
Date:   2018-12-01T11:01:47Z

Adding DateTimeFormatter

commit a9b39eccfdc3bd9da8fe52eda24ed240588b282f
Author: Maxim Gekk 
Date:   2018-12-01T11:15:31Z

Support DateTimeFormatter by JacksonParser and JacksonGenerator

commit ff589f505f7aaa8c94ce304aaf77792505998c16
Author: Maxim Gekk 
Date:   2018-12-01T12:31:43Z

Make test independent from current time zone

commit 4646dededae832185a35a85244baab6507d28f0d
Author: Maxim Gekk 
Date:   2018-12-01T18:19:31Z

Fix a test by new fallback




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238070641
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

Frankly speaking I don't see any reasons for this. For now we have 2 flags 
actually - `isGeneratorInitiated` and another one inside of lazy val. And 2 
slightly different approaches - with the `Option` type in Json and Text, and 
lazy val + flag in Orc and CSV.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
> ... if it's ready from your side @MaxGekk

@srowen I just think how I could reduce number of changes in tests. In some 
cases, test behavior depends on current time zone on my laptop. Without changes 
of this PR, such tests passed because expected and produced values are results 
of the same functions. I would try to fix that in a separate PR.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
@HyukjinKwon May I ask you to look at this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238059569
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,21 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with headers") {
+withTempPath { path =>
+  val df1 = spark.range(10).repartition(2).filter(_ < 
0).map(_.toString).toDF
+  // we have 2 partitions but they are both empty and will be filtered 
out upon writing
+  // thanks to SPARK-23271 one new empty partition will be inserted
+  df1.write.format("csv").option("header", 
true).save(path.getAbsolutePath)
+  val df2 = spark.read.format("csv").option("header", 
true).option("inferSchema", false)
+.load(path.getAbsolutePath)
+  assert(df1.rdd.getNumPartitions == 2)
+  assert(df2.rdd.getNumPartitions == 1)
--- End diff --

nit: I wouldn't check number of partition here since it is implementation 
specific and doesn't matter for behavior checked in the test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238059480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

@HyukjinKwon Do you mean creating generator in lazy val?
```
lazy val univocityGenerator = {
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), 
charset)
new UnivocityGenerator(dataSchema, os, params)
}
```
The problem is in the close method, you will have to call 
`univocityGenerator.close()` in the method. If the lazy val wasn't instantiated 
before (empty partition and the `header` option is `false`), the generator will 
be created and closed immediately. And as a result, you will get an empty file 
for the empty partition. That's why I prefer the approach with 
`Option[UnivocityGenerator]` in https://github.com/apache/spark/pull/23052. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-11-30 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-30 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237873832
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -243,21 +243,27 @@ class UnivocityParser(
 () => getPartialResult(),
 new RuntimeException("Malformed CSV record"))
 } else {
-  try {
-// When the length of the returned tokens is identical to the 
length of the parsed schema,
-// we just need to convert the tokens that correspond to the 
required columns.
-var i = 0
-while (i < requiredSchema.length) {
+  // When the length of the returned tokens is identical to the length 
of the parsed schema,
+  // we just need to convert the tokens that correspond to the 
required columns.
+  var badRecordException: Option[Throwable] = None
+  var i = 0
+  while (i < requiredSchema.length) {
--- End diff --

It depends on what kind of error we face to. If a parser is still in normal 
state and ready to continue, we could skip current error. In case of JSON, we 
parse input in stream fashion, and convert values to desired type on the fly. 
If JacksonParser is able to recognize next token, why we should stop on the 
first error?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-30 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237679872
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --

You can do something like this:
```
val df = spark.range(10).repartition(2).filter(_ < 0)
```
```
scala> df.rdd.getNumPartitions
res0: Int = 2
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237633755
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
+  df1.printSchema
+  df1.write.format("csv").option("header", 
true).save(path.getAbsolutePath)
+  val df2 = spark.read.format("csv").option("header", 
true).load(path.getAbsolutePath)
+  df2.printSchema
--- End diff --

Please, remove this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237635061
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --

This will create a dataframe with one empty partition. I would check the 
case when there are more than 2 empty partitions, and each saved file has a 
header.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237634432
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
+  df1.printSchema
+  df1.write.format("csv").option("header", 
true).save(path.getAbsolutePath)
+  val df2 = spark.read.format("csv").option("header", 
true).load(path.getAbsolutePath)
--- End diff --

I would set the `inferSchema` option to `false` explicitly because your 
test assumes that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237633638
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
+  df1.printSchema
--- End diff --

`printSchema` should be removed. This is not necessary for test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-29 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23132
  
@cloud-fan I did the same for CSV: 
https://github.com/apache/spark/pull/22979


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23132
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22979
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23173
  
It seems this is similar to @HyukjinKwon PR: 
https://github.com/apache/spark/pull/13252


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23052#discussion_r237282738
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val charset = Charset.forName(params.charset)
+  private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  private val writer = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-
-  private val gen = new UnivocityGenerator(dataSchema, writer, params)
+  override def write(row: InternalRow): Unit = {
+val gen = univocityGenerator.getOrElse {
+  val charset = Charset.forName(params.charset)
+  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+  new UnivocityGenerator(dataSchema, os, params)
+}
+univocityGenerator = Some(gen)
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
@gatorsmile @cloud-fan Could you look at the changes - extracted from 
another PR: https://github.com/apache/spark/pull/22429


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23132#discussion_r237239829
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
 
 ## Upgrading From Spark SQL 2.4 to 3.0
 
+  - In Spark version 2.4 and earlier, accepted format of decimals parsed 
from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or 
more decimal digits, optionally followed by a fraction, optionally followed by 
an exponent. Any commas were removed from the input before parsing. Since Spark 
3.0, format varies and depends on locale which can be set via JSON option 
`locale`. The default locale is `en-US`. To switch back to previous behavior, 
set `spark.sql.legacy.decimalParsing.enabled` to `true`.
--- End diff --

I removed SQL config and record in the migration guid. Also I applied 
`DecimalFormat` only for not `en-US` locales.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
Some tests didn't pass on new changes till I set time zone explicitly. The 
tests use the same functions for checking correctness as the code that is 
supposed to test. I think need more investigation here, I guess old (current) 
implementation doesn't take timezones into account in some cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
I have to correct timestamp/date pattern in a few test to follow ISO 8601 
(see [Patterns for Formatting and 
Parsing](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)
 like `hh` -> `HH` and `mm` -> `MM`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
> Are there any other behavior changes with the new code, besides being 
able to parse microseconds?

The main one is new parser doesn't have the fallback to 
`DateTimeUtils.stringToTime` if it cannot parse its input. Though I am not sure 
about this fallback ... if an user wants to parse input which doesn't fit to 
specified pattern (like `"-MM-dd'T'HH:mm:ss.SSSXXX"`) why should guess and 
try another pattern instead of forcing him/her to set appropriate one. I could 
guess this fallback was added for backward compatibility to previous versions. 
If so, should we still keep compatibility to such version?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
@srowen @HyukjinKwon @viirya @mgaido91 May I ask you to look at the PR. The 
changes are related to another PR which you have reviewed already.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23052
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237093726
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -243,21 +243,27 @@ class UnivocityParser(
 () => getPartialResult(),
 new RuntimeException("Malformed CSV record"))
 } else {
-  try {
-// When the length of the returned tokens is identical to the 
length of the parsed schema,
-// we just need to convert the tokens that correspond to the 
required columns.
-var i = 0
-while (i < requiredSchema.length) {
+  // When the length of the returned tokens is identical to the length 
of the parsed schema,
+  // we just need to convert the tokens that correspond to the 
required columns.
+  var badRecordException: Option[Throwable] = None
+  var i = 0
+  while (i < requiredSchema.length) {
--- End diff --

but we lose field values that could be converted successfully after the 
exception.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23132#discussion_r237091555
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
 
 ## Upgrading From Spark SQL 2.4 to 3.0
 
+  - In Spark version 2.4 and earlier, accepted format of decimals parsed 
from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or 
more decimal digits, optionally followed by a fraction, optionally followed by 
an exponent. Any commas were removed from the input before parsing. Since Spark 
3.0, format varies and depends on locale which can be set via JSON option 
`locale`. The default locale is `en-US`. To switch back to previous behavior, 
set `spark.sql.legacy.decimalParsing.enabled` to `true`.
--- End diff --

> Is there any performance regression with DecimalFormat?

I haven't benchmarked the changes. Looking at `JSONBenchmarks`, we even 
don't check decimals there.

> Do we need the DecimalFormat when locale is en-US?

No, we don't need it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23132#discussion_r237091495
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1610,6 +1610,13 @@ object SQLConf {
   """ "... N more fields" placeholder.""")
 .intConf
 .createWithDefault(25)
+
+  val LEGACY_DECIMAL_PARSING_ENABLED = 
buildConf("spark.sql.legacy.decimalParsing.enabled")
+.doc("If it is set to false, it enables parsing decimals in locale 
specific formats. " +
+  "To switch back to previous behaviour when parsing was performed by 
java.math.BigDecimal " +
+  "and all commas were removed from the input, set the flag to true.")
+.booleanConf
+.createWithDefault(false)
--- End diff --

> Can we have true by default?

Yes, sure. 

>  I usually recommend to keep the existing behavior during one next 
release for safety.

Would it be better to switch on new implementation in major release (3.0) 
neither in minor one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r237085148
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in non bucketed read") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions === 1)
--- End diff --

I think so, `wholetext` makes files not splittable:

https://github.com/apache/spark/blob/46110a589f4e91cd7605c5a2c34c3db6b2635830/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L57

This can guarantee ( in text datasources at least) one file -> one 
partition. 

> IIUC one partition can read multiple files. 

Do you mean this code? 
https://github.com/apache/spark/blob/8c6871828e3eb9fdb3bc665441a1aaf60b86b1e7/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L459-L464


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237065584
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ---
@@ -33,26 +33,21 @@ class FailureSafeParser[IN](
   private val corruptFieldIndex = 
schema.getFieldIndex(columnNameOfCorruptRecord)
   private val actualSchema = StructType(schema.filterNot(_.name == 
columnNameOfCorruptRecord))
   private val resultRow = new GenericInternalRow(schema.length)
-  private val nullResult = new GenericInternalRow(schema.length)
 
   // This function takes 2 parameters: an optional partial result, and the 
bad record. If the given
   // schema doesn't contain a field for corrupted record, we just return 
the partial result or a
   // row with all fields null. If the given schema contains a field for 
corrupted record, we will
   // set the bad record to this field, and set other fields according to 
the partial result or null.
   private val toResultRow: (Option[InternalRow], () => UTF8String) => 
InternalRow = {
-if (corruptFieldIndex.isDefined) {
-  (row, badRecord) => {
-var i = 0
-while (i < actualSchema.length) {
-  val from = actualSchema(i)
-  resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, 
from.dataType)).orNull
-  i += 1
-}
-resultRow(corruptFieldIndex.get) = badRecord()
-resultRow
+(row, badRecord) => {
--- End diff --

For now JSON does not support this. Need additional changes in 
JacksonParser to return partial results.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23052
  
Actually it needs similar changes like in 
https://github.com/apache/spark/pull/23130


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r237050065
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in non bucketed read") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions === 1)
--- End diff --

> does this test fail without your change?

Yes, it does due to the `wholetext`. 

>  Is JSON the only data source that may return a row for empty file?

We depend on underlying parser here. I will check CSV and Text.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23052
  
> seems like a real failure

I am looking at it. It seems the test is not deterministic. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23120
  
@cloud-fan May I ask you to take a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22429
  
Here is a PR introduces `maxFields` parameter to all function involved in 
creation of truncated strings of spark plans: 
https://github.com/apache/spark/pull/23159


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23159
  
ping @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...

2018-11-27 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23159

[SPARK-26191][SQL] Control truncation of Spark plans via maxFields parameter

## What changes were proposed in this pull request?

In the PR, I propose to add `maxFields` parameter to all functions involved 
in creation of textual representation of spark plans such as `simpleString` and 
`verboseString`. New parameter restrict number of fields produced to truncated 
strings. Any elements beyond the limit will be dropped and replaced by a `"... 
N more fields"` placeholder. The threshold is bumped up to `Int.MaxValue` for 
`toFile()`.  

## How was this patch tested?

Added a test to `QueryExecutionSuite` which checks `maxFields` impacts on 
number of truncated fields in `LocalRelation`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 to-file-max-fields

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23159.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23159


commit 756242132d665223738b177af0b838fc953c6a6b
Author: Maxim Gekk 
Date:   2018-11-27T21:34:58Z

Added a test

commit e9e789311b998d961aa8fcf76463307a410969ea
Author: Maxim Gekk 
Date:   2018-11-27T22:40:45Z

Adding maxField parameter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23132
  
@dongjoon-hyun @cloud-fan Please, take a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23127
  
> jenkins retest this please

@juliuszsompolski It won't help, you need to fix python tests it seems

```
Failed example:
df.explain()
Differences (ndiff with -expected +actual):
  == Physical Plan ==
- Scan ExistingRDD[age#0,name#1]
+ *(1) Scan ExistingRDD[age#0,name#1]
? +
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r236719477
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +143,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in load") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), "a".getBytes)
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions == 1)
--- End diff --

It seems expected value should be on right. I changed the order and got the 
following:
```scala
assert(123 === readback.rdd.getNumPartitions)
```
```
123 did not equal 1
ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at 
(SaveLoadSuite.scala:155)
Expected :1
Actual   :123
```

Current assert triggers correct message:
```scala
assert(readback.rdd.getNumPartitions == 123)
```
```
1 did not equal 123
ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at 
(SaveLoadSuite.scala:155)
Expected :123
Actual   :1
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23052#discussion_r236659185
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val charset = Charset.forName(params.charset)
+  private var univocityGenerator: Option[UnivocityGenerator] = None
--- End diff --

> ... but that it could create many generators and writers that aren't 
closed. 

Writers/generators are created inside of tasks: 
https://github.com/apache/spark/blob/ab1650d2938db4901b8c28df945d6a0691a19d31/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L228-L256
 where `dataWriter.commit()` and `dataWriter.abort()` close writers/generators. 
So, number of not closed generators is less or equal to the size of the task 
thread pool on executors at any moment.

> Unless we know writes will only happen in one thread ...

According to comments below, this is our assumption: 
https://github.com/apache/spark/blob/e8167768cfebfdb11acd8e0a06fe34ca43c14648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L33-L37





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23120
  
@HyukjinKwon Please, review the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23150
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22979
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

2018-11-27 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23052#discussion_r236584201
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val charset = Charset.forName(params.charset)
+  private var univocityGenerator: Option[UnivocityGenerator] = None
--- End diff --

We have not observe any race conditions so far. Instances of 
`UnivocityGenerator` are created per-tasks as well as `OutputStreamWriter`s. 
They share instances of schema and CSVOptions but we do not modify them while 
writing. Inside of each `UnivocityGenerator`, we create an instance of 
`CsvWriter` but I almost absolutely sure they do not share anything internally. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-11-26 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21363
  
Here is the PR: https://github.com/apache/spark/pull/23150 which allows to 
switch on java.time API (in CSV so far).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...

2018-11-26 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23150

[SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from 
CSV

## What changes were proposed in this pull request?

In the PR, I propose to use **java.time API** for parsing timestamps and 
dates from CSV content with microseconds precision. The SQL config 
`spark.sql.legacy.timeParser.enabled` allow to switch back to previous 
behaviour with using `java.text.SimpleDateFormat`/`FastDateFormat` for 
parsing/generating timestamps/dates.

## How was this patch tested?

It was tested by `UnivocityParserSuite`, `CsvExpressionsSuite`, 
`CsvFunctionsSuite` and `CsvSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 time-parser

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23150


commit 74a76c2f78ad139993f3bbe0f2ff8f1c81c3bd84
Author: Maxim Gekk 
Date:   2018-11-24T11:37:55Z

New and legacy time parser

commit 63cf6112085029c52e4aee6f9bb2e6b84ce18a96
Author: Maxim Gekk 
Date:   2018-11-24T12:00:10Z

Add config spark.sql.legacy.timeParser.enabled

commit 2a2ab83a5ecb251ce81e7f12a8c0d3067f88b2d5
Author: Maxim Gekk 
Date:   2018-11-24T13:24:07Z

Fallback legacy parser

commit 667bf9f65a90ac69b8cbbad77a17e21f9dd18733
Author: Maxim Gekk 
Date:   2018-11-24T15:54:19Z

something

commit 227a7bdc53bdd022e9c365b410810c58f56e8bea
Author: Maxim Gekk 
Date:   2018-11-25T12:15:15Z

Using instances

commit 73ee56088bf4d2856c454a7bbd4171b61cfe4614
Author: Maxim Gekk 
Date:   2018-11-25T13:52:02Z

Added generator

commit f35f6e13270eb994ac97627da79497673b4fe686
Author: Maxim Gekk 
Date:   2018-11-25T18:03:17Z

Refactoring of TimeFormatter

commit 1c09b58e6fe3e0fd565c852dcb73dc012fa56819
Author: Maxim Gekk 
Date:   2018-11-25T18:06:22Z

Renaming to DateTimeFormatter

commit 7b213d5b2ae404c87f090da622a78d3d19fee6a9
Author: Maxim Gekk 
Date:   2018-11-25T18:32:54Z

Added DateFormatter

commit 242ba474dcf112b48bd286811daed86a66366c39
Author: Maxim Gekk 
Date:   2018-11-25T19:58:08Z

Default values in parsing

commit db48ee6918eef06e19c3bdf64e3c44f4541cc294
Author: Maxim Gekk 
Date:   2018-11-25T21:09:08Z

Parse as date type because format for timestamp is not not matched to values

commit e18841b38050ac411a507a2a2643584f2c8739ce
Author: Maxim Gekk 
Date:   2018-11-25T21:53:11Z

Fix tests

commit 8db023834b680f336ff5a0e08253ba2cb3b6e3b7
Author: Maxim Gekk 
Date:   2018-11-25T23:09:20Z

CSVSuite passed

commit 0b9ed92a456d60db0934340f37e0bd428b2f6a42
Author: Maxim Gekk 
Date:   2018-11-26T22:00:10Z

Fix imports

commit 799ebb3432dec7fe1e1099d68a3f1c09e714aa8e
Author: Maxim Gekk 
Date:   2018-11-26T22:03:19Z

Revert test back

commit 5a223919439e2d22814b92c0e1e572b3c318566f
Author: Maxim Gekk 
Date:   2018-11-26T22:17:11Z

Set timeZone

commit f287b77d94de9e9f466c0ff2c2370f22a46b48f7
Author: Maxim Gekk 
Date:   2018-11-26T22:44:42Z

Removing default for micros because it causes conflicts in parsing




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...

2018-11-24 Thread MaxGekk

GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/23132

[SPARK-26163][SQL] Parsing decimals from JSON using locale

## What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse (and infer) 
decimals from JSON input. After the changes, `JacksonParser` converts input 
string to `BigDecimal` and to Spark's Decimal by using 
`java.text.DecimalFormat`. New behaviour can be switched off via SQL config 
`spark.sql.legacy.decimalParsing.enabled`.

## How was this patch tested?

Added 2 tests to `JsonExpressionsSuite` for the `en-US`, `ko-KR`, `ru-RU`, 
`de-DE` locales: 
- Inferring decimal type using locale from JSON field values
- Converting JSON field values to specified decimal type using the locales. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 json-decimal-parsing-locale

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23132


commit 506417ed4b643298560a66c043f7b31beb489da3
Author: Maxim Gekk 
Date:   2018-11-10T19:49:12Z

Test for parsing decimals using locale

commit ac25fb6ed1d3d6689ad8841476c025848c87f2a3
Author: Maxim Gekk 
Date:   2018-11-10T19:51:48Z

Parsing decimals using locale

commit b784003078270a46aaf8aceb2d86dd9f13f3500c
Author: Maxim Gekk 
Date:   2018-11-10T19:54:00Z

Updating the migration guide

commit d0522093dfe1823f91100cbec3ef5d6c8a372f27
Author: Maxim Gekk 
Date:   2018-11-24T19:32:45Z

Merge remote-tracking branch 'origin/master' into 
json-decimal-parsing-locale

# Conflicts:
#   
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

commit 722e135cf3b94e22079fd9a0fe1d309a04e76a64
Author: Maxim Gekk 
Date:   2018-11-24T19:58:39Z

Add SQL config spark.sql.legacy.decimalParsing.enabled

commit dc6c0ac0a97b31441d99ff1dd71608ae5e2eca73
Author: Maxim Gekk 
Date:   2018-11-24T20:00:31Z

Updating the migration guide

commit f15b1817fb51d453487665122473855712214692
Author: Maxim Gekk 
Date:   2018-11-24T20:38:24Z

Added a test for parsing

commit ab781d54e4f7604d64c72c3c383c549abab0a9a9
Author: Maxim Gekk 
Date:   2018-11-24T20:52:03Z

Fix test

commit 163a8b9d7d017409ae4dfa40e492680bf0e4f935
Author: Maxim Gekk 
Date:   2018-11-24T20:52:26Z

Create getDecimalParser

commit 8fb65c0db85f4bd2f76d473c5e31e772ff0d4c1d
Author: Maxim Gekk 
Date:   2018-11-24T22:11:42Z

Add a test for inferring decimals

commit 7e3a2906a96894cadc58771131d07d06ba265382
Author: Maxim Gekk 
Date:   2018-11-24T22:12:35Z

Change JsonSuite to adopt it for JsonInferSchema class

commit 83920b25f586dc242841ff0a73105ae9e43295ed
Author: Maxim Gekk 
Date:   2018-11-24T22:13:01Z

Inferring decimals from JSON




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-24 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23130
  
@cloud-fan @HyukjinKwon Please, take a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-24 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r236049216
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 .text(path)
 
   val jsonDF = spark.read.option("multiLine", true).option("mode", 
"PERMISSIVE").json(path)
-  assert(jsonDF.count() === corruptRecordCount)
+  assert(jsonDF.count() === corruptRecordCount + 1) // null row for 
empty file
--- End diff --

@cloud-fan @HyukjinKwon Here is a PR 
https://github.com/apache/spark/pull/23130 which does this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1029 matches

Mail list logo