[jira] [Created] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

2018-05-14 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24269:
--

 Summary: Infer nullability rather than declaring all columns as 
nullable
 Key: SPARK-24269
 URL: https://issues.apache.org/jira/browse/SPARK-24269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently, CSV and JSON datasource set the *nullable* flag to true 
independently from data itself during schema inferring.

JSON: 
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
CSV: 
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51

For example, source dataset has schema:
{code}
root
 |-- item_id: integer (nullable = false)
 |-- country: string (nullable = false)
 |-- state: string (nullable = false)
{code}

If we save it and read again the schema of the inferred dataset is
{code}
root
 |-- item_id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- state: string (nullable = true)
{code}
The ticket aims to set the nullable flag more precisely during schema inferring 
based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN

2018-05-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24276:
--

 Summary: semanticHash() returns different values for semantically 
the same IS IN
 Key: SPARK-24276
 URL: https://issues.apache.org/jira/browse/SPARK-24276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


When a plan is canonicalized any set-based operation, such as IS IN, should 
have its expressions ordered as the order of expressions does not matter in the 
evaluation of the operator.

For instance:

{code:scala}
val df = spark.createDataFrame(Seq((1, 2)))
val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized
val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized
val h1 = p1.semanticHash
val h2 = p2.semanticHash
{code}

{code}
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Filter '_1 IN (1,2)
+- LocalRelation [_1#0, _2#1]

p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Filter '_1 IN (2,1)
+- LocalRelation [_1#0, _2#1]

h1: Int = -1384236508
h2: Int = 939549189
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24244) Parse only required columns of CSV file

2018-05-10 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24244:
--

 Summary: Parse only required columns of CSV file
 Key: SPARK-24244
 URL: https://issues.apache.org/jira/browse/SPARK-24244
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


uniVocity parser allows to specify only required column names or indexes for 
parsing like:
{code}
// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);
{code}

Need to modify *UnivocityParser* to extract only needed columns from 
requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24190) lineSep shouldn't be required in JSON write

2018-05-05 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24190:
--

 Summary: lineSep shouldn't be required in JSON write
 Key: SPARK-24190
 URL: https://issues.apache.org/jira/browse/SPARK-24190
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently, the lineSep option is required by JSON datasource in write if 
encoding is different from UTF-8. For example, the code:
{code:scala}
df.write.option("encoding", "UTF-32BE").json(file)
{code}

throws the exception:
{code}
requirement failed: The lineSep option must be specified for the UTF-32BE 
encoding
java.lang.IllegalArgumentException: requirement failed: The lineSep option must 
be specified for the UTF-32BE encoding
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.catalyst.json.JSONOptions$$anonfun$32.apply(JSONOptions.scala:118)
at 
org.apache.spark.sql.catalyst.json.JSONOptions$$anonfun$32.apply(JSONOptions.scala:103)
at scala.Option.map(Option.scala:146)
{code}

The restriction should NOT be applied to writing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24325) Tests for Hadoop's LinesReader

2018-05-20 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24325:
--

 Summary: Tests for Hadoop's LinesReader
 Key: SPARK-24325
 URL: https://issues.apache.org/jira/browse/SPARK-24325
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently, there are no tests for [Hadoop 
LineReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42].
 For refactoring or rewriting of the class, need to add tests that cover basic 
functionality of the class like:

* Split's boundaries slice lines
* A split slices delimiters - user's specified or defaults
* No duplicates if splits slice delimiters or lines
* Checking constant limits like maximum line length
* Handling a case when internal buffers size is less than line size 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24325) Tests for Hadoop's LinesReader

2018-05-20 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24325:
---
Description: 
Currently, there are no tests for [Hadoop 
LinesReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42].
 For refactoring or rewriting of the class, need to add tests that cover basic 
functionality of the class like:
 * Split's boundaries slice lines
 * A split slices delimiters - user's specified or defaults
 * No duplicates if splits slice delimiters or lines
 * Checking constant limits like maximum line length
 * Handling a case when internal buffers size is less than line size

  was:
Currently, there are no tests for [Hadoop 
LineReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42].
 For refactoring or rewriting of the class, need to add tests that cover basic 
functionality of the class like:

* Split's boundaries slice lines
* A split slices delimiters - user's specified or defaults
* No duplicates if splits slice delimiters or lines
* Checking constant limits like maximum line length
* Handling a case when internal buffers size is less than line size 


> Tests for Hadoop's LinesReader
> --
>
> Key: SPARK-24325
> URL: https://issues.apache.org/jira/browse/SPARK-24325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, there are no tests for [Hadoop 
> LinesReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42].
>  For refactoring or rewriting of the class, need to add tests that cover 
> basic functionality of the class like:
>  * Split's boundaries slice lines
>  * A split slices delimiters - user's specified or defaults
>  * No duplicates if splits slice delimiters or lines
>  * Checking constant limits like maximum line length
>  * Handling a case when internal buffers size is less than line size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24244) Parse only required columns of CSV file

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk reopened SPARK-24244:


Previous PR was reverted due flaky UnivocityParserSuite

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24366) Improve error message for Catalyst type converters

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24366:
---
Summary: Improve error message for Catalyst type converters  (was: Improve 
error message for type converting)

> Improve error message for Catalyst type converters
> --
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Updated] (SPARK-24366) Improve error message for type converting

2018-05-23 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24366:
---
Summary: Improve error message for type converting  (was: Improve error 
message for type conversions)

> Improve error message for type converting
> -
>
> Key: SPARK-24366
> URL: https://issues.apache.org/jira/browse/SPARK-24366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> User have no way to drill down to understand which of the hundreds of fields 
> in millions records feeding into the job are causing the problem. We should 
> to show where in the schema the error is happening.
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: 
> start (of class java.lang.String)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
>   at 
> 

[jira] [Created] (SPARK-24366) Improve error message for type conversions

2018-05-23 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24366:
--

 Summary: Improve error message for type conversions
 Key: SPARK-24366
 URL: https://issues.apache.org/jira/browse/SPARK-24366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 1.6.3
Reporter: Maxim Gekk


User have no way to drill down to understand which of the hundreds of fields in 
millions records feeding into the job are causing the problem. We should to 
show where in the schema the error is happening.
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 4 in stage 344.0 failed 4 times, most recent failure: Lost task 4.3 in 
stage 344.0 (TID 2673, ip-10-31-237-248.ec2.internal): scala.MatchError: start 
(of class java.lang.String)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter$$anonfun$toCatalystImpl$1.apply(CatalystTypeConverters.scala:161)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:161)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 

[jira] [Resolved] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.

2018-05-25 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-15125.

   Resolution: Fixed
Fix Version/s: 2.4.0

The issue has been fixed by 
https://github.com/apache/spark/commit/7a2d4895c75d4c232c377876b61c05a083eab3c8

> CSV data source recognizes empty quoted strings in the input as null. 
> --
>
> Key: SPARK-15125
> URL: https://issues.apache.org/jira/browse/SPARK-15125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Priority: Major
> Fix For: 2.4.0
>
>
> CSV data source does not differentiate between empty quoted strings and empty 
> fields  as null. In some scenarios user would want  to differentiate between 
> these values,  especially in the context of SQL where NULL , and empty string 
> have different meanings  If input data happens to be dump from traditional 
> relational data source, users will see different results for the SQL queries. 
> {code}
> Repro:
> Test Data: (test.csv)
> year,make,model,comment,price
> 2017,Tesla,Mode 3,looks nice.,35000.99
> 2016,Chevy,Bolt,"",29000.00
> 2015,Porsche,"",,
> scala> val df= sqlContext.read.format("csv").option("header", 
> "true").option("inferSchema", "true").option("nullValue", 
> null).load("/tmp/test.csv")
> df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more 
> fields]
> scala> df.show
> ++---+--+---++
> |year|   make| model|comment|   price|
> ++---+--+---++
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|   null| 29000.0|
> |2015|Porsche|  null|   null|null|
> ++---+--+---++
> Expected:
> ++---+--+---++
> |year|   make| model|comment|   price|
> ++---+--+---++
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|   | 29000.0|
> |2015|Porsche|  |   null|null|
> ++---+--+---++
> {code}
> Testing a fix for the this issue. I will give a shot at submitting a PR for 
> this soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24004) Tests of from_json for MapType

2018-05-25 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24004.

Resolution: Won't Fix

> Tests of from_json for MapType
> --
>
> Key: SPARK-24004
> URL: https://issues.apache.org/jira/browse/SPARK-24004
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> There are no tests for *from_json* that check *MapType* as a value type of 
> struct fields. The MapType should be supported as non-root type according to 
> current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-21 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24329:
--

 Summary: Remove comments filtering before parsing of CSV files
 Key: SPARK-24329
 URL: https://issues.apache.org/jira/browse/SPARK-24329
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Comments and whitespace filtering has been performed by uniVocity parser 
already according to parser settings:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180

It is not necessary to do the same before parsing. Need to inspect all places 
where the filterCommentAndEmpty method is called, and remove the former one if 
it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24571:
--

 Summary: Support literals with values of the Char type
 Key: SPARK-24571
 URL: https://issues.apache.org/jira/browse/SPARK-24571
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code:scala}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character p
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514665#comment-16514665
 ] 

Maxim Gekk commented on SPARK-24571:


I am working on the improvement.

> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code:scala}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character p
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24571:
---
Description: 
Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code:java}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character o
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.

  was:
Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code:scala}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character p
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.


> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code:java}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character o
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24591) Number of cores and executors in the cluster

2018-06-18 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24591:
--

 Summary: Number of cores and executors in the cluster
 Key: SPARK-24591
 URL: https://issues.apache.org/jira/browse/SPARK-24591
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add 2 new methods. The first one should return total number of CPU 
cores of all executors in the cluster. The second one should give current 
number of executors registered in the cluster.

Main motivations for adding of those methods:

1. It is the best practice to manage job parallelism relative to available 
cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an 
anti-pattern to leave a bunch of cores on large clusters twiddling their thumb 
& doing nothing. Usually users pass predefined constants for _repartition()_ 
and _coalesce()_. Selection of the constant is based on current cluster size. 
If the code runs on another cluster and/or on the resized cluster, they need to 
modify the constant each time. This happens frequently when a job that normally 
runs on, say, an hour of data on a small cluster needs to run on a week of data 
on a much larger cluster.

2. *spark.default.parallelism* can be used to get total number of cores in the 
cluster but it can be redefined by user. The info can be taken via registration 
of a listener but repeating the same looks ugly. We should follow the DRY 
principle.

3. Regarding to executorsCount(), some jobs, e.g., local node ML training, use 
a lot of parallelism. It's a common practice to aim to distribute such jobs 
such that there is one partition for each executor. 
 
4. In some places users collect this info, as well as other settings info 
together with job timing (at the app level) for analysis. E.g., you can use ML 
to determine optimal cluster size given different objectives, e.g., fastest 
throughput vs. lowest cost per unit of processing.

5. The simpler argument is that basic cluster properties should be easily 
discoverable via APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24543) Support any DataType as DDL string for from_json's schema

2018-06-13 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24543:
--

 Summary: Support any DataType as DDL string for from_json's schema
 Key: SPARK-24543
 URL: https://issues.apache.org/jira/browse/SPARK-24543
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently, schema for from_json can be specified as DataType or a string in the 
following formats:
* in SQL, as sequence of fields like _INT a, STRING b
* in Scala, Python and etc, in JSON format or as in SQL 

The ticket aims to support arbitrary DataType as DDL string for from_json. For 
example:
{code:sql}
select from_json('{"a":1, "b":2}', 'map')
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24543) Support any DataType as DDL string for from_json's schema

2018-06-13 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510753#comment-16510753
 ] 

Maxim Gekk commented on SPARK-24543:


I am working on the feature at the moment.

> Support any DataType as DDL string for from_json's schema
> -
>
> Key: SPARK-24543
> URL: https://issues.apache.org/jira/browse/SPARK-24543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, schema for from_json can be specified as DataType or a string in 
> the following formats:
> * in SQL, as sequence of fields like _INT a, STRING b
> * in Scala, Python and etc, in JSON format or as in SQL 
> The ticket aims to support arbitrary DataType as DDL string for from_json. 
> For example:
> {code:sql}
> select from_json('{"a":1, "b":2}', 'map')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-06-12 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509667#comment-16509667
 ] 

Maxim Gekk commented on SPARK-24005:


[~smilegator] I am trying to reproduce the issue but so far I am not lucky. The 
following test is passing successfully:
{code:scala}
  test("canceling of parallel collections") {
val conf = new SparkConf()
sc = new SparkContext("local[1]", "par col", conf)

val f = sc.parallelize(0 to 1, 1).map { i =>
  val par = (1 to 100).par
  val pool = ThreadUtils.newForkJoinPool("test pool", 2)
  par.tasksupport = new ForkJoinTaskSupport(pool)
  try {
par.flatMap { j =>
  Thread.sleep(1000)
  1 to 100
}.seq
  } finally {
pool.shutdown()
  }
}.takeAsync(100)

val sem = new Semaphore(0)
sc.addSparkListener(new SparkListener {
  override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem.release()
  }
})

// Wait until some tasks were launched before we cancel the job.
sem.acquire()
// Wait until a task executes parallel collection.
Thread.sleep(1)
f.cancel()

val e = intercept[SparkException] { f.get() }.getCause
assert(e.getMessage.contains("cancelled") || 
e.getMessage.contains("killed"))
  }
{code}

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents

2018-05-27 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-14034.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Converting to Dataset causes wrong order and values in nested array of 
> documents
> 
>
> Key: SPARK-14034
> URL: https://issues.apache.org/jira/browse/SPARK-14034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Steven She
>Priority: Major
> Fix For: 2.3.0
>
>
> I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 
> in the console:
> {noformat}
> {"arr": [{"c": 1, "b": 2, "a": 3}]}
> {noformat}
> I have the following case classes:
> {noformat}
> case class X(arr: Seq[Y])
> case class Y(c: Int, b: Int, a: Int)
> {noformat}
> I run the following in the console to retrieve the value of `c` in the array, 
> which should have a value of 1 in the data file, but I get the value 3 
> instead:
> {noformat}
> scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c
> res19: Int = 3
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497075#comment-16497075
 ] 

Maxim Gekk commented on SPARK-24445:


I am working on the ticket at the moment.

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24445) Schema in json format for from_json in SQL

2018-05-31 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24445:
--

 Summary: Schema in json format for from_json in SQL
 Key: SPARK-24445
 URL: https://issues.apache.org/jira/browse/SPARK-24445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


In Spark 2.3, schema for the from_json function can be specified in JSON format 
in Scala and Python but in SQL. In SQL it is impossible to specify map type for 
example because SQL DDL parser can handle struct type only. Need to support 
schemas in JSON format as it has been already implemented 
[there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
{code:scala}
val dataType = try {
  DataType.fromJson(schema)
} catch {
  case NonFatal(_) => StructType.fromDDL(schema)
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14034) Converting to Dataset causes wrong order and values in nested array of documents

2018-05-27 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492078#comment-16492078
 ] 

Maxim Gekk commented on SPARK-14034:


I checked on Spark 2.3:
{code:scala}
case class Y(c: Long, b: Long, a: Long)
defined class X
spark.read.json("test.json").as[X].collect().head.arr.head.c
{code}
{code}
res0: Long = 1
{code}
Changing order of parameters in class Y doesn't impact on the result. It seems 
the issue doesn't exist any more.

> Converting to Dataset causes wrong order and values in nested array of 
> documents
> 
>
> Key: SPARK-14034
> URL: https://issues.apache.org/jira/browse/SPARK-14034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Steven She
>Priority: Major
>
> I'm deserializing the following JSON document into a Dataset with Spark 1.6.1 
> in the console:
> {noformat}
> {"arr": [{"c": 1, "b": 2, "a": 3}]}
> {noformat}
> I have the following case classes:
> {noformat}
> case class X(arr: Seq[Y])
> case class Y(c: Int, b: Int, a: Int)
> {noformat}
> I run the following in the console to retrieve the value of `c` in the array, 
> which should have a value of 1 in the data file, but I get the value 3 
> instead:
> {noformat}
> scala> sqlContext.read.json("../test.json").as[X].collect().head.arr.head.c
> res19: Int = 3
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23725) Improve Hadoop's LineReader to support charsets different from UTF-8

2018-06-30 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528691#comment-16528691
 ] 

Maxim Gekk commented on SPARK-23725:


[~hyukjin.kwon] I am working on the implementation and have faced to the 
problem that I cannot identify lineSep uniquely if encoding is not specified. 
For example, if a partitioned file contains:
{code}
65  00 31 00 0a 00 6c 00 69
{code}
I cannot strictly say what is the lineSep here. It could be *0x0a 0x00* if 
encoding is UTF-16LE in:
{code}
  6c 00 69 00 6e 00 65 00  31 00 0a 00 6c 00 69 00  |l.i.n.e.1...l.i.|
0010  6e 00 65 00 32 00 |n.e.2.|
0016
{code}

or *0x00 0x0a* in UTF-16BE encoding in the text:
{code}
  00 6c 00 69 00 6e 00 65  00 31 00 0a 00 6c 00 69  |.l.i.n.e.1...l.i|
0010  00 6e 00 65 00 32 |.n.e.2|
0016
{code}

So, to detect lineSep automatically we should require specified encoding.

> Improve Hadoop's LineReader to support charsets different from UTF-8
> 
>
> Key: SPARK-23725
> URL: https://issues.apache.org/jira/browse/SPARK-23725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If the record delimiter is not specified, Hadoop LineReader splits 
> lines/records by '\n', '\r' or/and '\r\n' in UTF-8 encoding: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L173-L177]
>  . The implementation should be improved to support any charset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24642.

Resolution: Won't Fix

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24709) Inferring schema from JSON string literal

2018-07-01 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24709:
--

 Summary: Inferring schema from JSON string literal
 Key: SPARK-24709
 URL: https://issues.apache.org/jira/browse/SPARK-24709
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new function - *schema_of_json()*. The function should infer schema 
of JSON string literal. The result of the function is a schema in DDL format.

One of the use cases is passing output of _schema_of_json()_ to *from_json()*. 
Currently, the _from_json()_ function requires a schema as a mandatory 
argument. An user has to pass a schema as string literal in SQL. The new 
function should allow schema inferring from an example. Let's say json_col is a 
column containing JSON string with the same schema. It should be possible to 
pass a JSON string with the same schema to _schema_of_json()_ which infers 
schema for the particular example.

{code:sql}
select from_json(json_col, schema_of_json('{"f1": 0, "f2": [0], "f2": "a"}'))
from json_table;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529030#comment-16529030
 ] 

Maxim Gekk commented on SPARK-24642:


I created new ticket SPARK-24709 which aims to add simpler function.

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24642) Add a function which infers schema from a JSON column

2018-07-01 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529030#comment-16529030
 ] 

Maxim Gekk edited comment on SPARK-24642 at 7/1/18 10:05 AM:
-

[~rxin] I created new ticket SPARK-24709 which aims to add simpler function. 
Here is the PR https://github.com/apache/spark/pull/21686 for the ticket.


was (Author: maxgekk):
I created new ticket SPARK-24709 which aims to add simpler function.

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24643) from_json should accept an aggregate function as schema

2018-06-29 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24643.

Resolution: Won't Fix

> from_json should accept an aggregate function as schema
> ---
>
> Key: SPARK-24643
> URL: https://issues.apache.org/jira/browse/SPARK-24643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, the *from_json()* function accepts only string literals as schema:
>  - Checking of schema argument inside of JsonToStructs: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L530]
>  - Accepting only string literal: 
> [https://github.com/apache/spark/blob/b8f27ae3b34134a01998b77db4b7935e7f82a4fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L749-L752]
> JsonToStructs should be modified to accept results of aggregate functions 
> like *infer_schema* (see SPARK-24642). It should be possible to write SQL 
> like:
> {code:sql}
> select from_json(json_col, infer_schema(json_col)) from json_table
> {code}
> Here is a test case with existing aggregate function - *first()*:
> {code:sql}
> create temporary view schemas(schema) as select * from values
>   ('struct'),
>   ('map');
> select from_json('{"a":1}', first(schema)) from schemas;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24445) Schema in json format for from_json in SQL

2018-06-24 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24445.

Resolution: Won't Fix

> Schema in json format for from_json in SQL
> --
>
> Key: SPARK-24445
> URL: https://issues.apache.org/jira/browse/SPARK-24445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> In Spark 2.3, schema for the from_json function can be specified in JSON 
> format in Scala and Python but in SQL. In SQL it is impossible to specify map 
> type for example because SQL DDL parser can handle struct type only. Need to 
> support schemas in JSON format as it has been already implemented 
> [there|https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3225-L3229]:
> {code:scala}
> val dataType = try {
>   DataType.fromJson(schema)
> } catch {
>   case NonFatal(_) => StructType.fromDDL(schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9775) Query Mesos for number of CPUs to set default parallelism

2018-06-25 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522064#comment-16522064
 ] 

Maxim Gekk commented on SPARK-9775:
---

Please, change another related methods like proposed in the PR: 
https://github.com/apache/spark/pull/21589

> Query Mesos for number of CPUs to set default parallelism
> -
>
> Key: SPARK-9775
> URL: https://issues.apache.org/jira/browse/SPARK-9775
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.4.1
>Reporter: Peder Ås
>Priority: Minor
>
> As highlighted in a TODO on line 400 of MesosSchedulerBackend.scala (at least 
> on 3ca995b7) we should query the Mesos master and set the default parallelism 
> based on the number of CPUs available in the cluster (and multiply by two or 
> three?)
> See code in question [here 
> (gitweb)|https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala;h=3f63ec1c;hb=HEAD#l400].
> This task should also update the documentation [here 
> (gitweb)|https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=docs/configuration.md;h=c60dd1;hb=HEAD#l789]
>  to highlight the fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24642) Add a function which infers schema from a JSON column

2018-06-27 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524802#comment-16524802
 ] 

Maxim Gekk commented on SPARK-24642:


> Do we want this as an aggregate function?

I thought of something similar to the inferSchema flag when CSV datasource 
triggers a separate job to infer schema for JSON files.

> I'm thinking it's better to just take a string and infers the schema on the 
> string.

In general it looks much more cheaper that scanning of full input by aggregate 
function but we have opportunity to minimize amount of row touched by the 
aggregate function via sampling or using just a few first row in partitions.

And what happens if some json strings are not complete like:
{code}
{"a": 1}
{"b": [1,2,3]}
{"a": 3, "b": [10, 11, 12]}
{code} 
in that case, each parsed json string will have different inferred schemas, 
right? Which schema we should assign to parsed json column?

> How would the query you provide compile if it is an aggregate function?

I am going to assign the from_json name to the FromJson case class, and write 
the following rule to trigger a job for replacing aggregate by a string literal 
like in the code snippet (thank you [~hvanhovell] for the code)
{code}
case class FromJson(child: Expression) extends Expression` {
 ...
}

class SchemaInferringRule(session: SparkSession) extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = {
plan transform {
  case node =>
node.transformExpressions {
  case FromJson(e) =>
// Kick off inference
val query = new QueryExecution(
  session,
  Project(Seq(Alias(InferSchema(e), "schema")()), node))
val Array(row) = query.executedPlan.executeCollect()
val schema = Literal(row.getUTF8String(0), StringType)
new JsonToStructs(e, schema)
}
}
  }
}
{code}

> Add a function which infers schema from a JSON column
> -
>
> Key: SPARK-24642
> URL: https://issues.apache.org/jira/browse/SPARK-24642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new aggregate function - *infer_schema()*. The function should 
> infer schema for set of JSON strings. The result of the function is a schema 
> in DDL format (or JSON format).
> One of the use cases is passing output of *infer_schema()* to *from_json()*. 
> Currently, the from_json() function requires a schema as a mandatory 
> argument. It is possible to infer schema programmatically in Scala/Python and 
> pass it as the second argument but in SQL it is not possible. An user has to 
> pass schema as string literal in SQL. The new function should allow to use it 
> in SQL like in the example:
> {code:sql}
> select from_json(json_col, infer_schema(json_col))
> from json_table;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24164) Support column list as the pivot column in Pivot

2018-07-02 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530448#comment-16530448
 ] 

Maxim Gekk commented on SPARK-24164:


[~maryannxue] Are you working on the feature, or maybe you plan to work on it?

> Support column list as the pivot column in Pivot
> 
>
> Key: SPARK-24164
> URL: https://issues.apache.org/jira/browse/SPARK-24164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, we only support a single column as the pivot column, while a 
> column list as the pivot column would look like:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24722) Column-based API for pivoting

2018-07-02 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24722:
--

 Summary: Column-based API for pivoting
 Key: SPARK-24722
 URL: https://issues.apache.org/jira/browse/SPARK-24722
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, the pivot() function accepts the pivot column as a string. It is not 
consistent to groupBy API and causes additional problem of using nested columns 
as the pivot column.

`Column` support is needed for (a) API consistency, (b) user productivity and 
(c) performance. In general, we should follow to the POLA - 
https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of 
the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24118) Support lineSep format independent from encoding

2018-04-29 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24118:
--

 Summary: Support lineSep format independent from encoding
 Key: SPARK-24118
 URL: https://issues.apache.org/jira/browse/SPARK-24118
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently, the lineSep option of JSON datasource is depend on encoding. It is 
impossible to define correct lineSep for JSON files with BOM in UTF-16 and 
UTF-32 encoding, for example. Need to propose a format of lineSep which will 
represent sequence of octets (bytes) and will be independent from encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24171) Update comments for non-deterministic functions

2018-05-03 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24171:
--

 Summary: Update comments for non-deterministic functions
 Key: SPARK-24171
 URL: https://issues.apache.org/jira/browse/SPARK-24171
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Description of non-deterministic functions like the _collect_list()_ and 
_first()_ doesn't contain information about that. Need to add a notice about it 
to show the behavior in user facing docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-13 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23410:
--

 Summary: Unable to read jsons in charset different from UTF-8
 Key: SPARK-23410
 URL: https://issues.apache.org/jira/browse/SPARK-23410
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently the Json Parser is forced to read json files in UTF-8. Such behavior 
breaks backward compatibility with Spark 2.2.1 and previous versions that can 
read json files in UTF-16, UTF-32 and other encodings due to using of the auto 
detection mechanism of the jackson library. Need to give back to users 
possibility to read json files in specified charset and/or detect charset 
automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-13 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Shepherd: Herman van Hovell

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849
 ] 

Maxim Gekk commented on SPARK-23410:


[~bersprockets] does your json contain BOM in the first 2 bytes?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364849#comment-16364849
 ] 

Maxim Gekk edited comment on SPARK-23410 at 2/14/18 10:20 PM:
--

[~bersprockets] does your json contain BOM in the first 2 bytes? By using the 
BOM, jackson detects encoding: 
https://github.com/FasterXML/jackson-core/blob/2.6/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L110-L173


was (Author: maxgekk):
[~bersprockets] does your json contain BOM in the first 2 bytes?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Attachment: utf16WithBOM.json

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364889#comment-16364889
 ] 

Maxim Gekk commented on SPARK-23410:


I am working on a fix, just in case

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-14 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364875#comment-16364875
 ] 

Maxim Gekk commented on SPARK-23410:


I attached the file on which I tested on 2.2.1:

{code:scala}
import org.apache.spark.sql.types._
val schema = new StructType().add("firstName", StringType).add("lastName", 
StringType)
spark.read.schema(schema).json("utf16WithBOM.json").show
{code}

{code}
+-++
|firstName|lastName|
+-++
|Chris|   Baird|
| null|null|
| Doug|Rood|
| null|null|
| null|null|
+-++
{code}

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366547#comment-16366547
 ] 

Maxim Gekk commented on SPARK-23410:


[~sameerag] It is not blocker anymore. I unset the blocker flag.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Priority: Major  (was: Blocker)

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24777) Refactor AVRO read/write benchmark

2018-07-28 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560896#comment-16560896
 ] 

Maxim Gekk commented on SPARK-24777:


[~Gengliang.Wang] Which benchmarks are you going to add? Just in case, I can 
gather typical use cases from our users.

> Refactor AVRO read/write benchmark
> --
>
> Key: SPARK-24777
> URL: https://issues.apache.org/jira/browse/SPARK-24777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2018-07-28 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24959:
--

 Summary: Do not invoke the CSV/JSON parser for empty schema
 Key: SPARK-24959
 URL: https://issues.apache.org/jira/browse/SPARK-24959
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently JSON and CSV parsers are called even if required schema is empty. 
Invoking the parser per each line has some non-zero overhead. The action can be 
skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24945) Switch to unoVocity 2.7.2

2018-07-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24945:
--

 Summary: Switch to unoVocity 2.7.2
 Key: SPARK-24945
 URL: https://issues.apache.org/jira/browse/SPARK-24945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The recent version 2.7.2 of uniVocity parser includes the fix: 
https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24945) Switch to uniVocity 2.7.2

2018-07-27 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24945:
---
Summary: Switch to uniVocity 2.7.2  (was: Switch to unoVocity 2.7.2)

> Switch to uniVocity 2.7.2
> -
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24945) Switch to uniVocity >= 2.7.2

2018-08-02 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24945:
---
Summary: Switch to uniVocity >= 2.7.2  (was: Switch to uniVocity 2.7.2)

> Switch to uniVocity >= 2.7.2
> 
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
> fix https://github.com/apache/spark/pull/21631 anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24945) Switch to uniVocity >= 2.7.2

2018-08-02 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24945:
---
Description: The recent version 2.7.2 of uniVocity parser includes the fix: 
https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent 
version has better performance.  (was: The recent version 2.7.2 of uniVocity 
parser includes the fix: 
https://github.com/uniVocity/univocity-parsers/issues/250 . We don't need the 
fix https://github.com/apache/spark/pull/21631 anymore)

> Switch to uniVocity >= 2.7.2
> 
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent 
> version has better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24952) Support LZMA2 compression by Avro datasource

2018-07-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24952:
--

 Summary: Support LZMA2 compression by Avro datasource
 Key: SPARK-24952
 URL: https://issues.apache.org/jira/browse/SPARK-24952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


LZMA2 (XZ) has much more better compression ratio comparing to currently 
supported snappy and deflate. Underlying Avro library supports the compression 
codec already. Need to set parameters for the codec and allow users to specify 
"xz" compression via AvroOptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25048) Pivoting by multiple columns

2018-08-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25048:
--

 Summary: Pivoting by multiple columns
 Key: SPARK-25048
 URL: https://issues.apache.org/jira/browse/SPARK-25048
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to change or extend existing API to make pivoting by multiple columns 
possible. Users should be able to use many columns and values like in the 
example:
{code:scala}
trainingSales
  .groupBy($"sales.year")
  .pivot(struct(lower($"sales.course"), $"training"), Seq(
struct(lit("dotnet"), lit("Experts")),
struct(lit("java"), lit("Dummies")))
  ).agg(sum($"sales.earnings"))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25048) Pivoting by multiple columns in Scala/Java

2018-08-07 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25048:
---
Summary: Pivoting by multiple columns in Scala/Java  (was: Pivoting by 
multiple columns)

> Pivoting by multiple columns in Scala/Java
> --
>
> Key: SPARK-25048
> URL: https://issues.apache.org/jira/browse/SPARK-25048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to change or extend existing API to make pivoting by multiple columns 
> possible. Users should be able to use many columns and values like in the 
> example:
> {code:scala}
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"), Seq(
> struct(lit("dotnet"), lit("Experts")),
> struct(lit("java"), lit("Dummies")))
>   ).agg(sum($"sales.earnings"))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-22 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589335#comment-16589335
 ] 

Maxim Gekk commented on SPARK-25195:


> Problem number 1: The from_json function accepts as a schema only StructType 
> or ArrayType(StructType), but not an ArrayType of primitives.

This was fixed recently: https://github.com/apache/spark/pull/21439

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS

2018-08-25 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25240:
---
Summary: A deadlock in ALTER TABLE RECOVER PARTITIONS  (was: Dead-lock in 
ALTER TABLE RECOVER PARTITIONS)

> A deadlock in ALTER TABLE RECOVER PARTITIONS
> 
>
> Key: SPARK-25240
> URL: https://issues.apache.org/jira/browse/SPARK-25240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Recover Partitions in ALTER TABLE is performed in recursive way by calling 
> the scanPartitions() method. scanPartitions() lists files sequentially or in 
> parallel if the 
> [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685]
>  is true:
> {code:scala}
> partitionNames.length > 1 && statuses.length > threshold || 
> partitionNames.length > 2
> {code}
> Parallel listening is executed on [the fixed thread 
> pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622]
>  which can have 8 threads in total. Dead lock occurs when all 8 cores have 
> been already occupied and recursive call of scanPartitions() submits new 
> parallel file listening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25240) Dead-lock in ALTER TABLE RECOVER PARTITIONS

2018-08-25 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25240:
--

 Summary: Dead-lock in ALTER TABLE RECOVER PARTITIONS
 Key: SPARK-25240
 URL: https://issues.apache.org/jira/browse/SPARK-25240
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Recover Partitions in ALTER TABLE is performed in recursive way by calling the 
scanPartitions() method. scanPartitions() lists files sequentially or in 
parallel if the 
[condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685]
 is true:
{code:scala}
partitionNames.length > 1 && statuses.length > threshold || 
partitionNames.length > 2
{code}
Parallel listening is executed on [the fixed thread 
pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622]
 which can have 8 threads in total. Dead lock occurs when all 8 cores have been 
already occupied and recursive call of scanPartitions() submits new parallel 
file listening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-25 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592704#comment-16592704
 ] 

Maxim Gekk commented on SPARK-25199:


I wasn't able to reproduce the issue on the current master:
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
  /_/

Using Python version 2.7.15 (default, Aug 22 2018 16:36:18)
>>> df = spark.read.format("csv").option("header", 
>>> "true").option("inferSchema", "true").load("tmp/csv/*.csv")
>>> df.printSchema()
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)
{code}
for two csv files but one of them is empty:
{code:java}
tree -h ./csv
./csv
├── [   8]  1.csv
└── [   0]  2.csv
{code}

> InferSchema "all Strings" if one of many CSVs is empty
> --
>
> Key: SPARK-25199
> URL: https://issues.apache.org/jira/browse/SPARK-25199
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
> Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>Reporter: Neil McGuigan
>Priority: Minor
>  Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-25 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-25199.

Resolution: Cannot Reproduce

> InferSchema "all Strings" if one of many CSVs is empty
> --
>
> Key: SPARK-25199
> URL: https://issues.apache.org/jira/browse/SPARK-25199
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
> Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>Reporter: Neil McGuigan
>Priority: Minor
>  Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25243) Use FailureSafeParser in from_json

2018-08-26 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25243:
--

 Summary: Use FailureSafeParser in from_json
 Key: SPARK-25243
 URL: https://issues.apache.org/jira/browse/SPARK-25243
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The 
[FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
  is used in parsing JSON, CSV files and dataset of strings. It supports the 
[PERMISSIVE, DROPMALFORMED and 
FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
 modes. The ticket aims to make the from_json function compatible to regular 
parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584357#comment-16584357
 ] 

Maxim Gekk commented on SPARK-17916:


> he default behavior in 2.3.x for csv format is that when i write out null 
>value, it comes back in as null. when i write out empty string, it also comes 
>back in as null.

[~koert] Please, have a look at the added test: 
[https://github.com/apache/spark/pull/21273/files#diff-219ac8201e443435499123f96e94d29fR1355]
 . It checks exactly what you described. If you have something different, 
please, leave the code here.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25227) Extend functionality of to_json to support arrays of differently-typed elements

2018-08-27 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593368#comment-16593368
 ] 

Maxim Gekk commented on SPARK-25227:


> I don't know about to_json. Maybe Maxim Gekk can comment more on that.
Here is the PR for that: https://github.com/apache/spark/pull/6 . Please, 
review it.

> Extend functionality of to_json to support arrays of differently-typed 
> elements
> ---
>
> Key: SPARK-25227
> URL: https://issues.apache.org/jira/browse/SPARK-25227
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'to_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'to_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25252) Support arrays of any types in to_json

2018-08-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25252:
--

 Summary: Support arrays of any types in to_json
 Key: SPARK-25252
 URL: https://issues.apache.org/jira/browse/SPARK-25252
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to improve the to_json function and make it more consistent with from_json 
by supporting arrays of any types (as root types). For now, it supports only 
arrays of structs and arrays of maps.  After the changes the following code 
should work:
{code:scala}
select to_json(array('1','2','3'))
> ["1","2","3"]
select to_json(array(array(1,2,3),array(4)))
> [[1,2,3],[4]]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589850#comment-16589850
 ] 

Maxim Gekk commented on SPARK-25195:


> 1. Does this patch also solve problem 2, as described above?
No, it doesn't.

> 2. Do you know when it will be released?
It should be in the upcoming release 2.4.

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-23 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590140#comment-16590140
 ] 

Maxim Gekk commented on SPARK-25195:


This is the ticket which combines both from_json/to_json: 
https://issues.apache.org/jira/browse/SPARK-24391 . It was closed with the PR 
[https://github.com/apache/spark/pull/21439]. It would be nice to have a 
separate ticket specifically for to_json.

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!
> ==
> UPDATE: By the way, apparently the to_json function has the same problems: it 
> cannot convert an array-typed column to a JSON-string. It would be nice for 
> it to support arrays, as well. And, speaking of problem 2, an array column of 
> different types cannot be even created in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24854) Gather all options into AvroOptions

2018-07-18 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24854:
--

 Summary: Gather all options into AvroOptions
 Key: SPARK-24854
 URL: https://issues.apache.org/jira/browse/SPARK-24854
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to gather all Avro options into a class like in another datasources - 
JSONOptions and CSVOptions. The map inside of the class should be case 
insensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24836) New option - ignoreExtension

2018-07-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24836:
--

 Summary: New option - ignoreExtension
 Key: SPARK-24836
 URL: https://issues.apache.org/jira/browse/SPARK-24836
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new option for Avro datasource - *ignoreExtension*. It should 
control ignoring of the .avro extensions. If it is set to *true* (by default), 
files with and without .avro extensions should be loaded. Example of usage:
{code:scala}
spark
  .read
  .option("ignoreExtension", false)
  .avro("path to avro files")
{code}

The option duplicates Hadoop's config 
avro.mapred.ignore.inputs.without.extension which is taken into account by Avro 
datasource now and can be set like:
{code:scala}
spark
  .sqlContext
  .sparkContext
  .hadoopConfiguration
  .set("avro.mapred.ignore.inputs.without.extension", "true")
{code}

The ignoreExtension option must override 
avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547876#comment-16547876
 ] 

Maxim Gekk commented on SPARK-24849:


I am working on the ticket.

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24849) Convert StructType to DDL string

2018-07-18 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24849:
--

 Summary: Convert StructType to DDL string
 Key: SPARK-24849
 URL: https://issues.apache.org/jira/browse/SPARK-24849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Need to add new methods which should convert a value of StructType to a schema 
in DDL format . It should be possible to use the former string in new table 
creation by just copy-pasting of new method results. The existing methods 
simpleString(), catalogString() and sql() put ':' between top level field name 
and its type, and wrap by the *struct* word

{code}
ds.schema.catalogString
struct

[jira] [Created] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24810:
--

 Summary: Fix paths to resource files in AvroSuite
 Key: SPARK-24810
 URL: https://issues.apache.org/jira/browse/SPARK-24810
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently paths to tests files from resource folder are relative in AvroSuite. 
It causes problems like impossibility for running tests from IDE. Need to wrap 
test files by:
{code:scala}
def testFile(fileName: String): String = {
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-15 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24810:
---
Attachment: Screen Shot 2018-07-15 at 15.28.13.png

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24911:
--

 Summary: SHOW CREATE TABLE drops escaping of nested column names
 Key: SPARK-24911
 URL: https://issues.apache.org/jira/browse/SPARK-24911
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Create a table with quoted nested column - *`b`*:
{code:sql}
create table `test` (`a` STRUCT<`b`:STRING>);
{code}
and show how the table was created:
{code:sql}
SHOW CREATE TABLE `test`
{code}
{code}
CREATE TABLE `test`(`a` struct)
{code}

The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24881) New options - compression and compressionLevel

2018-07-22 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24881:
--

 Summary: New options - compression and compressionLevel
 Key: SPARK-24881
 URL: https://issues.apache.org/jira/browse/SPARK-24881
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently Avro datasource takes the compression codec name from SQL config 
(config key is hard coded in AvroFileFormat): 
https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125
 . The obvious cons of it is modification of the global config can impact of 
multiple writes.

A purpose of the ticket is to add new Avro option - "compression" the same as 
we already have for other datasource like JSON, CSV and etc. If new option is 
not set by an user, we take settings from SQL config 
spark.sql.avro.compression.codec. If the former one is not set too, default 
compression codec will be snappy (this is current behavior in the master).

Besides of the compression option, need to add another option - 
compressionLevel which should reflect another SQL config in Avro: 
https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24849) Convert StructType to DDL string

2018-07-19 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548878#comment-16548878
 ] 

Maxim Gekk commented on SPARK-24849:


[~maropu] This is a part of my work on customer's issue. There are multiple 
folders of AVRO files with pretty wide and nested schemas. I need 
programmatically create tables on top of each folder. To do that I read a file 
in a folder via Scala API, take schema, convert it to DDL string (here I need 
the changes) and put the string to SQL CREATE TABLE.

> Convert StructType to DDL string
> 
>
> Key: SPARK-24849
> URL: https://issues.apache.org/jira/browse/SPARK-24849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to add new methods which should convert a value of StructType to a 
> schema in DDL format . It should be possible to use the former string in new 
> table creation by just copy-pasting of new method results. The existing 
> methods simpleString(), catalogString() and sql() put ':' between top level 
> field name and its type, and wrap by the *struct* word
> {code}
> ds.schema.catalogString
> struct {code}
> Output of new method should be
> {code}
> metaData struct {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24807) Adding files/jars twice: output a warning and add a note

2018-07-14 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24807:
--

 Summary: Adding files/jars twice: output a warning and add a note
 Key: SPARK-24807
 URL: https://issues.apache.org/jira/browse/SPARK-24807
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Maxim Gekk


In current version of Spark (2.3.x), one file/jar can be added only once. Next 
additions of the same path are silently ignored. This behavoir is not properly 
documented: 
https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.SparkContext

This confuses our users and support teams in our company. The ticket aims to 
output a warning which should clearly state that second addition of the same 
path is not supported now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24805) Don't ignore files without .avro extension by default

2018-07-14 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24805:
--

 Summary: Don't ignore files without .avro extension by default
 Key: SPARK-24805
 URL: https://issues.apache.org/jira/browse/SPARK-24805
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently to read files without .avro extension, users have to set the flag 
*avro.mapred.ignore.inputs.without.extension* to *false* (by default it is 
*true*). The ticket aims to change the default value to *false*. The reasons to 
do that are:

- Other systems can create avro files without extensions. When users try to 
read such files, they get just partitial results silently. The behaviour may 
confuse users.

- Current behavior is different behavior from another supported datasource CSV 
and JSON. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25286:
--

 Summary: Remove dangerous parmap
 Key: SPARK-25286
 URL: https://issues.apache.org/jira/browse/SPARK-25286
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Maxim Gekk


One of parmap methods accepts an execution context created outside of parmap. 
If the parmap method is called recursively on a thread pool limited by size, it 
could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 . To 
eliminate the problems in the future, need to remove parmap() with the 
signature:
{code:scala}
def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
  (in: Col[I])
  (f: I => O)
  (implicit
cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
ec: ExecutionContext
  ): Col[O]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25283) A deadlock in UnionRDD

2018-08-31 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-25283.

   Resolution: Fixed
Fix Version/s: 2.4.0

It is fixed by the PR: https://github.com/apache/spark/pull/22292

> A deadlock in UnionRDD
> --
>
> Key: SPARK-25283
> URL: https://issues.apache.org/jira/browse/SPARK-25283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel 
> collections in UnionRDD by new parmap function. This changes cause a deadlock 
> in the partitions method. The code demonstrates the problem:
> {code:scala}
> val wide = 20
> def unionRDD(num: Int): UnionRDD[Int] = {
>   val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1))
>   new UnionRDD(sc, rdds)
> }
> val level0 = (0 until wide).map { _ =>
>   val level1 = (0 until wide).map(_ => unionRDD(wide))
>   new UnionRDD(sc, level1)
> }
> val rdd = new UnionRDD(sc, level0)
> rdd.partitions.length
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25384) Removing spark.sql.fromJsonForceNullableSchema

2018-09-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25384:
--

 Summary: Removing spark.sql.fromJsonForceNullableSchema
 Key: SPARK-25384
 URL: https://issues.apache.org/jira/browse/SPARK-25384
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Disabling the spark.sql.fromJsonForceNullableSchema flag is error prone. We 
should not allow users to do that since it can lead to producing of corrupted 
output. The flag should be removed for simplicity too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25381) Stratified sampling by Column argument

2018-09-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25381:
--

 Summary: Stratified sampling by Column argument
 Key: SPARK-25381
 URL: https://issues.apache.org/jira/browse/SPARK-25381
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently the sampleBy method accepts the first argument of string type only. 
Need to provide overloaded method which accepts Column type too. So, it will 
allow sampling by multiple columns , for example:
{code:scala}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.struct
val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), 
("Bob", 17),
  ("Alice", 10))).toDF("name", "age")
val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
   +-+---+
   | name|age|
   +-+---+
   | Nico|  8|
   |Alice| 10|
   +-+---+
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25387) Malformed CSV causes NPE

2018-09-09 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25387:
--

 Summary: Malformed CSV causes NPE
 Key: SPARK-25387
 URL: https://issues.apache.org/jira/browse/SPARK-25387
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Loading a malformed CSV files or a dataset can cause NullPointerException, for 
example the code:
{code:scala}
val schema = StructType(StructField("a", IntegerType) :: Nil)
val input = spark.createDataset(Seq("\u\u\u0001234"))
spark.read.schema(schema).csv(input).collect()
{code} 
crashes with the exception:
{code:java}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
at 
org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
{code}

If schema is not specified, the following exception is thrown:
{code:java}
java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25393) Parsing CSV strings in a column

2018-09-10 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25393:
--

 Summary: Parsing CSV strings in a column
 Key: SPARK-25393
 URL: https://issues.apache.org/jira/browse/SPARK-25393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


There are use cases when content in CSV format is dumped into an external 
storage as one of columns. For example, CSV records are stored together with 
other meta-info to Kafka. Current Spark API doesn't allow to parse such columns 
directly. The existing method 
[csv()|https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L487]
 requires a dataset with one string column. The API is inconvenient in parsing 
CSV column in dataset with many columns. The ticket aims to add new function 
similar to 
[from_json()|https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3456]
 with the following signatures in Scala:
{code:scala}
def from_csv(e: Column, schema: StructType, options: Map[String, String]): 
Column
{code}
and for using from Python, R and Java:
{code:scala}
def from_csv(e: Column, schema: String, options: java.util.Map[String, 
String]): Column
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25396:
---
Description: 
If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
  ...
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 

  was:
If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 


> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25396:
--

 Summary: Read array of JSON objects via an Iterator
 Key: SPARK-25396
 URL: https://issues.apache.org/jira/browse/SPARK-25396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609469#comment-16609469
 ] 

Maxim Gekk commented on SPARK-25396:


[~hyukjin.kwon] WDYT

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609605#comment-16609605
 ] 

Maxim Gekk commented on SPARK-25396:


I have a concern regarding to when I should close Jackson parser. For now it is 
closed before returning result from the parse method there: 
[https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L394-L404]
 . If I return an *Iterator[InternalRow]* instead of *Seq[InternalRow]*, so I 
have to postpone closing of Jackson parser at least up to the end of current 
task, right? ... but it is bad for per-line mode because this could produce a 
lot of opened JSON parsers. It seems implementations for multiLine and for 
per-line mode should be different.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25273) How to install testthat v1.0.2

2018-08-29 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25273:
---
Summary: How to install testthat v1.0.2  (was: How to install testthat = 
1.0.2)

> How to install testthat v1.0.2
> --
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25273) How to install testthat = 1.0.2

2018-08-29 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25273:
--

 Summary: How to install testthat = 1.0.2
 Key: SPARK-25273
 URL: https://issues.apache.org/jira/browse/SPARK-25273
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The command installs testthat v2.0.x:
{code:R}
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
{code}
which prevents running the R tests. Need to update the section 
http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24757) Improve error message for broadcast timeouts

2018-07-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24757:
--

 Summary: Improve error message for broadcast timeouts
 Key: SPARK-24757
 URL: https://issues.apache.org/jira/browse/SPARK-24757
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently, the TimeoutException that is thrown on broadcast joins doesn't give 
any clues to user how to resolve the issue. Need to provide such help to users 
by pointing out two config parameters: *spark.sql.broadcastTimeout* and 
*spark.sql.autoBroadcastJoinThreshold*.

The ticket aims to handle the TimeoutException there: 
https://github.com/apache/spark/blob/b7a036b75b8a1d287ac014b85e90d555753064c9/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L143



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24761) Check modifiability of config parameters

2018-07-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24761:
--

 Summary: Check modifiability of config parameters
 Key: SPARK-24761
 URL: https://issues.apache.org/jira/browse/SPARK-24761
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Our customers and support team continuously face to the situation when setting 
a config parameter via *spark.conf.set()* does not may any effects. It is not 
clear from parameter's name is it static parameter or one of the parameter that 
can be set at runtime for current session state. It would be useful to have a 
method of *RuntimeConfig* which could tell to an user - does the given 
parameter may effect on the current behavior if he/she change it in the 
spark-shell or running notebook. The method can have the following signature:

{code:scala}
def isModifiable(key: String): Boolean
{code}

Any config parameter can be checked by using the syntax like this:

{code:scala}
scala> spark.conf.isModifiable("spark.sql.sources.schemaStringLengthThreshold")
res0: Boolean = false
{code}
or for Spark Core parameter:
{code:scala}
scala> spark.conf.isModifiable("spark.task.cpus")
res1: Boolean = false
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23620) Split thread dump lines by using the br tag

2018-03-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23620:
--

 Summary: Split thread dump lines by using the br tag
 Key: SPARK-23620
 URL: https://issues.apache.org/jira/browse/SPARK-23620
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Maxim Gekk


The '\n' line separator should be replaced by the  tag in the generated 
html of thread dump UI to guarantee that each class name is on separate line. 
There are some cases when the html are proxied and the '\n' could be replaced 
by another whitespaces (see the screenshot - 
https://drive.google.com/file/d/18t6yf-jnr072b-hPq4LhMZeRrw1PjpT5/view?usp=sharing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Attachment: utf8xFF.csv

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23649:
--

 Summary: CSV schema inferring fails on some UTF-8 chars
 Key: SPARK-23649
 URL: https://issues.apache.org/jira/browse/SPARK-23649
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Schema inferring of CSV files fails if the file contains a char starts from 
*0xFF.* 
{code:java}
spark.read.option("header", "true").csv("utf8xFF.csv")
{code}
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 63
  at 
org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
  at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
{code}
Here is content of the file:
{code:java}
hexdump -C ~/tmp/utf8xFF.csv
  63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
0020  2c 34 35 36 0d|,456.|
0025
{code}
Schema inferring doesn't fail in multiline mode:
{code}
spark.read.option("header", "true").option("multiline", 
"true").csv("utf8xFF.csv")
{code}
{code:java}
+---+-+
|channel|code
+---+-+
| United| 123
| ABGUN�| 456
+---+-+
{code}
and Spark is able to read the csv file if the schema is specified:
{code}
import org.apache.spark.sql.types._
val schema = new StructType().add("channel", StringType).add("code", StringType)
spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
{code}
{code:java}
+---++
|channel|code|
+---++
| United| 123|
| ABGUN�| 456|
+---++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Shepherd: Herman van Hovell

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23643:
---
Summary: XORShiftRandom.hashSeed allocates unnecessary memory  (was: 
XORShiftRandom.setSeed allocates unnecessary memory)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23643) XORShiftRandom.setSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23643:
--

 Summary: XORShiftRandom.setSeed allocates unnecessary memory
 Key: SPARK-23643
 URL: https://issues.apache.org/jira/browse/SPARK-23643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Maxim Gekk


The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the seed 
parameter into it. Other bytes are always zero and could be easily excluded 
from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23643:
---
Description: The hashSeed method allocates 64 bytes buffer and puts only 8 
bytes of the seed parameter into it. Other bytes are always zero and could be 
easily excluded from hash calculation.  (was: The setSeed method allocates 64 
bytes buffer and puts only 8 bytes of the seed parameter into it. Other bytes 
are always zero and could be easily excluded from hash calculation.)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24068:
--

 Summary: CSV schema inferring doesn't work for compressed files
 Key: SPARK-24068
 URL: https://issues.apache.org/jira/browse/SPARK-24068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Here is a simple csv file compressed by lzo
{code}
$ cat ./test.csv
col1,col2
a,1
$ lzop ./test.csv
$ ls
test.csv test.csv.lzo
{code}

Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, 
for example):
{code:scala}
scala> val ds = spark.read.option("header", true).option("inferSchema", 
true).option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
ds: org.apache.spark.sql.DataFrame = [�LZO?: string]

scala> ds.printSchema
root
 |-- �LZO: string (nullable = true)


scala> ds.show
+-+
|�LZO|
+-+
|a|
+-+
{code}
but the file can be read if the schema is specified:
{code}
scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("col1", StringType).add("col2", 
IntegerType)
scala> val ds = spark.read.schema(schema).option("header", 
true).option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
scala> ds.show
+++
|col1|col2|
+++
|   a|   1|
+++
{code}

Just in case, schema inferring works for the original uncompressed file:
{code:scala}
scala> spark.read.option("header", true).option("inferSchema", 
true).csv("test.csv").printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24004) Tests of from_json for MapType

2018-04-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24004:
--

 Summary: Tests of from_json for MapType
 Key: SPARK-24004
 URL: https://issues.apache.org/jira/browse/SPARK-24004
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


There are no tests for *from_json* that check *MapType* as a value type of 
struct fields. The MapType should be supported as non-root type according to 
current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >