date:20190211

[jira] [Updated] (SPARK-23619) Document the column names created by explode and posexplode functions

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23619:
--
Affects Version/s: (was: 2.3.0)
   3.0.0

> Document the column names created by explode and posexplode functions
> -
>
> Key: SPARK-23619
> URL: https://issues.apache.org/jira/browse/SPARK-23619
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Joe Pallas
>Priority: Minor
>  Labels: documentation
>
> The documentation for {{explode}} and {{posexplode}} neglects to mention the 
> default column names for the new columns: {{col}} and {{pos}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26855) SparkSubmitSuite fails on a clean build

2019-02-11 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765759#comment-16765759
 ] 

Felix Cheung commented on SPARK-26855:
--

IMO we have two options:
 # document tests only pass after a clean build with skipTests
 # re-order tests. suppose test A depends on module B is built, we could move 
test A to be after B (or rather, simply be the test of B)

> SparkSubmitSuite fails on a clean build
> ---
>
> Key: SPARK-26855
> URL: https://issues.apache.org/jira/browse/SPARK-26855
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Felix Cheung
>Priority: Major
>
> SparkSubmitSuite
> "include an external JAR in SparkR"
> fails consistently but the test before it, "correctly builds R packages 
> included in a jar with --packages" passes.
> the workaround is to build once with skipTests first, then everything passes.
> ran into this while testing 2.3.3 RC2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26853) Enhance expression descriptions for commonly used aggregate function functions.

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26853.
---
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23756

> Enhance expression descriptions for commonly used aggregate function 
> functions.
> ---
>
> Key: SPARK-26853
> URL: https://issues.apache.org/jira/browse/SPARK-26853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add the @since field and provide usage examples for commonly used aggregate 
> functions such as Max, Min etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26853) Add example and version for commonly used aggregate function descriptions

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26853:
--
Summary: Add example and version for commonly used aggregate function 
descriptions  (was: Enhance expression descriptions for commonly used aggregate 
function functions.)

> Add example and version for commonly used aggregate function descriptions
> -
>
> Key: SPARK-26853
> URL: https://issues.apache.org/jira/browse/SPARK-26853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add the @since field and provide usage examples for commonly used aggregate 
> functions such as Max, Min etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26853) Enhance expression descriptions for commonly used aggregate function functions.

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26853:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Enhance expression descriptions for commonly used aggregate function 
> functions.
> ---
>
> Key: SPARK-26853
> URL: https://issues.apache.org/jira/browse/SPARK-26853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> Add the @since field and provide usage examples for commonly used aggregate 
> functions such as Max, Min etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26857) Return UnsafeArrayData for date/timestamp type in ColumnarArray.copy()

2019-02-11 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26857:


Assignee: Apache Spark

> Return UnsafeArrayData for date/timestamp type in ColumnarArray.copy()
> --
>
> Key: SPARK-26857
> URL: https://issues.apache.org/jira/browse/SPARK-26857
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In https://github.com/apache/spark/issues/23569, the copy method of 
> ColumnarArray is implemented. 
> To further improve it, we can return UnsafeArrayData for date/timestamp type 
> in ColumnarArray.copy().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26857) Return UnsafeArrayData for date/timestamp type in ColumnarArray.copy()

2019-02-11 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26857:


Assignee: (was: Apache Spark)

> Return UnsafeArrayData for date/timestamp type in ColumnarArray.copy()
> --
>
> Key: SPARK-26857
> URL: https://issues.apache.org/jira/browse/SPARK-26857
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In https://github.com/apache/spark/issues/23569, the copy method of 
> ColumnarArray is implemented. 
> To further improve it, we can return UnsafeArrayData for date/timestamp type 
> in ColumnarArray.copy().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26857) Return UnsafeArrayData for date/timestamp type in ColumnarArray.copy()

2019-02-11 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26857:
--

 Summary: Return UnsafeArrayData for date/timestamp type in 
ColumnarArray.copy()
 Key: SPARK-26857
 URL: https://issues.apache.org/jira/browse/SPARK-26857
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In https://github.com/apache/spark/issues/23569, the copy method of 
ColumnarArray is implemented. 
To further improve it, we can return UnsafeArrayData for date/timestamp type in 
ColumnarArray.copy().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26509) Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader

2019-02-11 Thread Jialin Qiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765719#comment-16765719
 ] 

Jialin Qiao commented on SPARK-26509:
-

You can add this conf. It works for me

 

{{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}}

 

https://stackoverflow.com/questions/52588408/unsupported-encoding-delta-byte-array-while-writing-parquet-data-to-csv-using

> Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader
> --
>
> Key: SPARK-26509
> URL: https://issues.apache.org/jira/browse/SPARK-26509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Filipe Gonzaga Miranda
>Priority: Major
>   Original Estimate: 40h
>  Remaining Estimate: 40h
>
> I get the exception below Spark 2.4 reading parquet files where some columns 
> are DELTA_BYTE_ARRAY encoded.
>  
> {code:java}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
>  
> {code}
>  
> If the property:
> spark.sql.parquet.enableVectorizedReader is set to false that works
> The parquet files were written with Parquet V2, and as far as I understand 
> the V2 is the version used in Spark 2.x.
>  I did not find any property to change which Parquet Version Spark uses (V1, 
> V2).
> Is there anyway to benefit from the Vectorized Reader? Or this is something 
> like a new implementation to support this version? I would propose so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26509) Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader

2019-02-11 Thread Jialin Qiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765719#comment-16765719
 ] 

Jialin Qiao edited comment on SPARK-26509 at 2/12/19 6:24 AM:
--

You can  try this

 

{{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}}

 

[https://stackoverflow.com/questions/52588408/unsupported-encoding-delta-byte-array-while-writing-parquet-data-to-csv-using]


was (Author: qiaojialin):
You can add this conf. It works for me

 

{{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}}

 

https://stackoverflow.com/questions/52588408/unsupported-encoding-delta-byte-array-while-writing-parquet-data-to-csv-using

> Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader
> --
>
> Key: SPARK-26509
> URL: https://issues.apache.org/jira/browse/SPARK-26509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Filipe Gonzaga Miranda
>Priority: Major
>   Original Estimate: 40h
>  Remaining Estimate: 40h
>
> I get the exception below Spark 2.4 reading parquet files where some columns 
> are DELTA_BYTE_ARRAY encoded.
>  
> {code:java}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
>  
> {code}
>  
> If the property:
> spark.sql.parquet.enableVectorizedReader is set to false that works
> The parquet files were written with Parquet V2, and as far as I understand 
> the V2 is the version used in Spark 2.x.
>  I did not find any property to change which Parquet Version Spark uses (V1, 
> V2).
> Is there anyway to benefit from the Vectorized Reader? Or this is something 
> like a new implementation to support this version? I would propose so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-02-11 Thread luzengxiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765694#comment-16765694
 ] 

luzengxiang commented on SPARK-24374:
-

Hi [~mengxr], I am using Scala API. 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26762) Arrow optimization for conversion from Spark DataFrame to R DataFrame

2019-02-11 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26762:


Assignee: Hyukjin Kwon

> Arrow optimization for conversion from Spark DataFrame to R DataFrame
> -
>
> Key: SPARK-26762
> URL: https://issues.apache.org/jira/browse/SPARK-26762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Like SPARK-25981, {{collect(rdf)}} can be optimized via Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26762) Arrow optimization for conversion from Spark DataFrame to R DataFrame

2019-02-11 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765668#comment-16765668
 ] 

Hyukjin Kwon commented on SPARK-26762:
--

I happened to make a PR for this one first .. :). dapply needs to wait for 
gapply PR to be merged.

> Arrow optimization for conversion from Spark DataFrame to R DataFrame
> -
>
> Key: SPARK-26762
> URL: https://issues.apache.org/jira/browse/SPARK-26762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Like SPARK-25981, {{collect(rdf)}} can be optimized via Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25158) Executor accidentally exit because ScriptTransformationWriterThread throws TaskKilledException.

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25158:
---

Assignee: Yang Jie

> Executor accidentally exit because ScriptTransformationWriterThread throws 
> TaskKilledException.
> ---
>
> Key: SPARK-25158
> URL: https://issues.apache.org/jira/browse/SPARK-25158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.0.0
>
>
> In production environment, user run Spark-Sql use transform features with 
> config 'spark.speculation = true', sometimes job fails and we found many 
> Executor Dead through `Executor Tab` of Spark ui and here are some relevant 
> sample logs:
> Driver Side  Log:
> {code:java}
> 18/08/14 16:17:52 INFO TaskSetManager: Starting task 2909.1 in stage 2.0 (TID 
> 3929, executor.330, executor 7, partition 2909, PROCESS_LOCAL, 6791 bytes)
> 18/08/14 16:17:53 INFO TaskSetManager: Killing attempt 1 for task 2909.1 in 
> stage 2.0 (TID 3929) on executor.330 as the attempt 0 succeeded on executor.58
> 18/08/14 16:17:53 WARN TaskSetManager: Lost task 2909.1 in stage 2.0 (TID 
> 3929, executor.330, executor 7): TaskKilled (killed intentionally)
> 18/08/14 16:17:53 INFO TaskSetManager: Task 2909.1 in stage 2.0 (TID 3929) 
> failed, but another instance of the task has already succeeded, so not 
> re-queuing the task to be re-executed.
> {code}
>  
> Executor Side Log: 
> {code:java}
> 18/08/14 16:17:52 INFO Executor: Running task 2909.1 in stage 2.0 (TID 3929)
> 18/08/14 16:17:53 INFO Executor: Executor is trying to kill task 2909.1 in 
> stage 2.0 (TID 3929)
> 18/08/14 16:17:53 ERROR ScriptTransformationWriterThread:
> 18/08/14 16:17:53 ERROR Utils: Uncaught exception in thread 
> Thread-ScriptTransformation-Feed
> org.apache.spark.TaskKilledException
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:295)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:380)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:289)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:278)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:278)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:278)
> 18/08/14 16:17:53 INFO Executor: Executor killed task 2909.1 in stage 2.0 
> (TID 3929)
> 18/08/14 16:17:53 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Thread-ScriptTransformation-Feed,5,main]
> org.apache.spark.TaskKilledException
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:295)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:380)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at

[jira] [Resolved] (SPARK-25158) Executor accidentally exit because ScriptTransformationWriterThread throws TaskKilledException.

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25158.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22149
[https://github.com/apache/spark/pull/22149]

> Executor accidentally exit because ScriptTransformationWriterThread throws 
> TaskKilledException.
> ---
>
> Key: SPARK-25158
> URL: https://issues.apache.org/jira/browse/SPARK-25158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Yang Jie
>Priority: Major
> Fix For: 3.0.0
>
>
> In production environment, user run Spark-Sql use transform features with 
> config 'spark.speculation = true', sometimes job fails and we found many 
> Executor Dead through `Executor Tab` of Spark ui and here are some relevant 
> sample logs:
> Driver Side  Log:
> {code:java}
> 18/08/14 16:17:52 INFO TaskSetManager: Starting task 2909.1 in stage 2.0 (TID 
> 3929, executor.330, executor 7, partition 2909, PROCESS_LOCAL, 6791 bytes)
> 18/08/14 16:17:53 INFO TaskSetManager: Killing attempt 1 for task 2909.1 in 
> stage 2.0 (TID 3929) on executor.330 as the attempt 0 succeeded on executor.58
> 18/08/14 16:17:53 WARN TaskSetManager: Lost task 2909.1 in stage 2.0 (TID 
> 3929, executor.330, executor 7): TaskKilled (killed intentionally)
> 18/08/14 16:17:53 INFO TaskSetManager: Task 2909.1 in stage 2.0 (TID 3929) 
> failed, but another instance of the task has already succeeded, so not 
> re-queuing the task to be re-executed.
> {code}
>  
> Executor Side Log: 
> {code:java}
> 18/08/14 16:17:52 INFO Executor: Running task 2909.1 in stage 2.0 (TID 3929)
> 18/08/14 16:17:53 INFO Executor: Executor is trying to kill task 2909.1 in 
> stage 2.0 (TID 3929)
> 18/08/14 16:17:53 ERROR ScriptTransformationWriterThread:
> 18/08/14 16:17:53 ERROR Utils: Uncaught exception in thread 
> Thread-ScriptTransformation-Feed
> org.apache.spark.TaskKilledException
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:295)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:380)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:289)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:278)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:278)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:278)
> 18/08/14 16:17:53 INFO Executor: Executor killed task 2909.1 in stage 2.0 
> (TID 3929)
> 18/08/14 16:17:53 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Thread-ScriptTransformation-Feed,5,main]
> org.apache.spark.TaskKilledException
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:295)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:573)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:161)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:148)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:380)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at

[jira] [Assigned] (SPARK-26762) Arrow optimization for conversion from Spark DataFrame to R DataFrame

2019-02-11 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26762:


Assignee: Apache Spark

> Arrow optimization for conversion from Spark DataFrame to R DataFrame
> -
>
> Key: SPARK-26762
> URL: https://issues.apache.org/jira/browse/SPARK-26762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Like SPARK-25981, {{collect(rdf)}} can be optimized via Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26762) Arrow optimization for conversion from Spark DataFrame to R DataFrame

2019-02-11 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26762:


Assignee: (was: Apache Spark)

> Arrow optimization for conversion from Spark DataFrame to R DataFrame
> -
>
> Key: SPARK-26762
> URL: https://issues.apache.org/jira/browse/SPARK-26762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Like SPARK-25981, {{collect(rdf)}} can be optimized via Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-25823) map_filter can generate incorrect data

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-25823.
-

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25823) map_filter can generate incorrect data

2019-02-11 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25823.
---
Resolution: Duplicate

Since this is resolved by SPARK-25829, I close this as a `Duplicate`.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26696) Dataset encoder should be publicly accessible

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26696:
---

Assignee: Simeon Simeonov

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Assignee: Simeon Simeonov
>Priority: Major
>  Labels: dataset, encoding
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26696) Dataset encoder should be publicly accessible

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26696.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23620
[https://github.com/apache/spark/pull/23620]

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Assignee: Simeon Simeonov
>Priority: Major
>  Labels: dataset, encoding
> Fix For: 3.0.0
>
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26654:
---

Assignee: Maxim Gekk

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26654.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23662
[https://github.com/apache/spark/pull/23662]

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26740:
---

Assignee: Maxim Gekk

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-02-11 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26740.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23662
[https://github.com/apache/spark/pull/23662]

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-02-11 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang resolved SPARK-26795.
-
Resolution: Not A Problem

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-02-11 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang closed SPARK-26795.
---

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2019-02-11 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765487#comment-16765487
 ] 

t oo commented on SPARK-24437:
--

[~dvogelbacher] any luck on the test?

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-11 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765482#comment-16765482
 ] 

t oo commented on SPARK-22860:
--

gentle ping, fix waiting to be committed

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8659) Spark SQL Thrift Server does NOT honour hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

2019-02-11 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765481#comment-16765481
 ] 

t oo commented on SPARK-8659:
-

bump

> Spark SQL Thrift Server does NOT honour 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  
> ---
>
> Key: SPARK-8659
> URL: https://issues.apache.org/jira/browse/SPARK-8659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: Premchandra Preetham Kukillaya
>Priority: Major
>
> It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
> Hive's security  feature SQL based authorisation is not working. It ignores 
> the security settings passed through the command line. The arguments for 
> command line is given below for reference
> The problem is user X can do select on table belonging to user Y, though 
> permission for table is explicitly defined and its a data security risk.
> I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed 
> to Spark SQL Thrift Server.
> ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
> hostname.compute.amazonaws.com --hiveconf 
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>  --hiveconf 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  --hiveconf hive.server2.enable.doAs=false --hiveconf 
> hive.security.authorization.enabled=true --hiveconf 
> javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
>  --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
> --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
> javax.jdo.option.ConnectionPassword=hive



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2019-02-11 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765392#comment-16765392
 ] 

Jungtaek Lim commented on SPARK-10892:
--

[~jashgala] Just to clarify, did you also try applying workaround? If then did 
it work for Spark 2.4.0?

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|

[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2019-02-11 Thread Tao Luo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765378#comment-16765378
 ] 

Tao Luo commented on SPARK-21492:
-

I'll take a stab at this jira, should have something to review today or 
tomorrow. 

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0
>Reporter: Zhan Zhang
>Priority: Major
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency

2019-02-11 Thread Pushpendra Jaiswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765102#comment-16765102
 ] 

Pushpendra Jaiswal commented on SPARK-26045:


Its happening with spark 2.3.1 , spark 2.4.0 and hadoop 3.1.1 . Tried with 
avro-1.8.2

> Error in the spark 2.4 release package with the spark-avro_2.11 depdency
> 
>
> Key: SPARK-26045
> URL: https://issues.apache.org/jira/browse/SPARK-26045
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
> Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Oscar garcía 
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hello I have been problems with the last spark 2.4 release, the read avro 
> file feature does not seem to be working, I have fixed it in local building 
> the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ 
> dependencies.
> With the default spark 2.4 release when I try to read an avro file spark 
> raise the following exception.  
> {code:java}
> spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
> scala> spark.read.format("avro").load("file.avro")
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105
> {code}
> Checksum:  spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 
> 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 
> 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-02-11 Thread Valeria Vasylieva (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valeria Vasylieva updated SPARK-20597:
--
Attachment: Jacek Laskowski.url

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-02-11 Thread Valeria Vasylieva (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765060#comment-16765060
 ] 

Valeria Vasylieva commented on SPARK-20597:
---

Hi! 
[~Satyajit] are you still working on this task? If no, [~jlaskowski] can I give 
it a try?

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22826) [SQL] findWiderTypeForTwo Fails over StructField of Array

2019-02-11 Thread Aleksander Eskilson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765055#comment-16765055
 ] 

Aleksander Eskilson commented on SPARK-22826:
-

Yeah, I believe I saw in source code this was resolved either in 2.4.0 or 
sometime shortly after.

> [SQL] findWiderTypeForTwo Fails over StructField of Array
> -
>
> Key: SPARK-22826
> URL: https://issues.apache.org/jira/browse/SPARK-22826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> The {{findWiderTypeForTwo}} codepath in Catalyst {{TypeCoercion}} fails when 
> applied to to {{StructType}} having the following fields:
> {noformat}
>   StructType(StructField("a", ArrayType(StringType, containsNull=true)) 
> :: Nil),
>   StructType(StructField("a", ArrayType(StringType, containsNull=false)) 
> :: Nil)
> {noformat}
> When in {{findTightestCommonType}}, the function attempts to recursively find 
> the tightest common type of two arrays. These two arrays are not equal types 
> (since one would admit null elements and the other would not), but 
> {{findTightestCommonType}} has no match case for {{ArrayType}} (or 
> {{MapType}}), so the 
> [get|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L108]
>  operation on the dataType of the {{StructField}} throws a 
> {{NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-02-11 Thread Valeria Vasylieva (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valeria Vasylieva updated SPARK-20597:
--
Attachment: (was: Jacek Laskowski.url)

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2019-02-11 Thread Tamas Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764965#comment-16764965
 ] 

Tamas Nemeth commented on SPARK-26836:
--

And one more thing.

I also got this warning which I did not get if I run on the table where the 
partitions do not contain the avro.schema.url property.
{code:java}
19/02/11 14:39:40 WARN AvroDeserializer: Received different schemas. Have to 
re-encode: 
{"type":"record","name":"doctors","namespace":"testing.hive.avro.serde","fields":[{"name":"number","type":"int","doc":"Order
 of playing the 
role"},{"name":"extra_field","type":"string","default":"fishfingers and 
custard","doc:":"an extra field not in the original 
file"},{"name":"first_name","type":"string","doc":"first name of actor playing 
role"},{"name":"last_name","type":"string","doc":"last name of actor playing 
role"}]}
SIZE{-3ac2eea4:168dcc8e145:-8000=org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder@1429dfec}
 ID -3ac2eea4:168dcc8e145:-8000{code}

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamas Nemeth
>Priority: Major
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
>

[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2019-02-11 Thread Tamas Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764933#comment-16764933
 ] 

Tamas Nemeth commented on SPARK-26836:
--

On hive directly the query returns with the correct resultset regardless if 
_avro.schema.url_ is set for the partitions or not :
{code:java}
hive> select * from spark_test;
OK
6 fishfingers and custard Colin Baker 2019-02-05
3 fishfingers and custard Jon Pertwee 2019-02-05
4 fishfingers and custard Tom Baker 2019-02-05
5 fishfingers and custard Peter Davison 2019-02-05
11 fishfingers and custard Matt Smith 2019-02-05
1 fishfingers and custard William Hartnell 2019-02-05
7 fishfingers and custard Sylvester McCoy 2019-02-05
8 fishfingers and custard Paul McGann 2019-02-05
2 fishfingers and custard Patrick Troughton 2019-02-05
9 fishfingers and custard Christopher Eccleston 2019-02-05
10 fishfingers and custard David Tennant 2019-02-05
21 fishfinger Jim Baker 2019-02-06
24 fishfinger Bean Pertwee 2019-02-06
Time taken: 4.291 seconds, Fetched: 13 row(s){code}

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamas Nemeth
>Priority: Major
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
>

[jira] [Commented] (SPARK-26856) Python support for "from_avro" and "to_avro" APIs

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764819#comment-16764819
 ] 

Gabor Somogyi commented on SPARK-26856:
---

Until now nothing fancy just added a direct call to the API and referring the 
data source guide:
{code:java}
Avro is built-in but external data source module since Spark 2.4. Please 
deploy the application
as per the deployment section of "Apache Avro Data Source Guide".
{code}
similar to 
[DataSource.scala|https://github.com/apache/spark/blob/af4c59c0fb18c33e171258a28eefd5fbcf5a8487/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L648]
 but not throwing exception just describing things in the API.


> Python support for "from_avro" and "to_avro" APIs
> -
>
> Key: SPARK-26856
> URL: https://issues.apache.org/jira/browse/SPARK-26856
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL, Structured Streaming
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Built-in avro support added in 2.4 but the from_avro and to_avro 
> functionalities are only available in scala and java. It would be good to add 
> python support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26856) Python support for "from_avro" and "to_avro" APIs

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764839#comment-16764839
 ] 

Gabor Somogyi commented on SPARK-26856:
---

What can enhanced here is to add a check whether the JVM class is available or 
not and throw exception with similar message.

> Python support for "from_avro" and "to_avro" APIs
> -
>
> Key: SPARK-26856
> URL: https://issues.apache.org/jira/browse/SPARK-26856
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL, Structured Streaming
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Built-in avro support added in 2.4 but the from_avro and to_avro 
> functionalities are only available in scala and java. It would be good to add 
> python support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26760) [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 and Active Task display as 5 in Executor Tab of SPARK UI]

2019-02-11 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764803#comment-16764803
 ] 

shahid commented on SPARK-26760:


Yes. Writing store too frequently maybe a costly operation. So, after a 
particular time interval only task info write in the store. So, for a running 
job, task info may not update immediately in the UI. 

 !Screenshot from 2019-02-11 15-09-09.png! 

Regarding, how many task actually running for a running job, we need to see the 
logs or console, as UI may show slightly more or less depends on the update to 
the store.



> [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 
> and Active Task display as 5 in Executor Tab of SPARK UI]
> -
>
> Key: SPARK-26760
> URL: https://issues.apache.org/jira/browse/SPARK-26760
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: SPARK-26760.png, Screenshot from 2019-02-11 15-09-09.png
>
>
> Steps:
>  # Launch Spark Shell 
>  # bin/spark-shell --master yarn  --conf spark.dynamicAllocation.enabled=true 
> --conf spark.dynamicAllocation.initialExecutors=3 --conf 
> spark.dynamicAllocation.minExecutors=1 --conf 
> spark.dynamicAllocation.executorIdleTimeout=60s --conf 
> spark.dynamicAllocation.maxExecutors=5
>  # Submit a Job sc.parallelize(1 to 1,116000).count()
>  # Check the YARN UI Executor Tab for the RUNNING application
>  # UI display as Number of cores 4 and Active Tasks column shows as 5
> Expected:
> It Number of Active Tasks should be same as Number of Cores.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26760) [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 and Active Task display as 5 in Executor Tab of SPARK UI]

2019-02-11 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26760:
---
Attachment: Screenshot from 2019-02-11 15-09-09.png

> [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 
> and Active Task display as 5 in Executor Tab of SPARK UI]
> -
>
> Key: SPARK-26760
> URL: https://issues.apache.org/jira/browse/SPARK-26760
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: SPARK-26760.png, Screenshot from 2019-02-11 15-09-09.png
>
>
> Steps:
>  # Launch Spark Shell 
>  # bin/spark-shell --master yarn  --conf spark.dynamicAllocation.enabled=true 
> --conf spark.dynamicAllocation.initialExecutors=3 --conf 
> spark.dynamicAllocation.minExecutors=1 --conf 
> spark.dynamicAllocation.executorIdleTimeout=60s --conf 
> spark.dynamicAllocation.maxExecutors=5
>  # Submit a Job sc.parallelize(1 to 1,116000).count()
>  # Check the YARN UI Executor Tab for the RUNNING application
>  # UI display as Number of cores 4 and Active Tasks column shows as 5
> Expected:
> It Number of Active Tasks should be same as Number of Cores.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26856) Python support for "from_avro" and "to_avro" APIs

2019-02-11 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764801#comment-16764801
 ] 

Hyukjin Kwon commented on SPARK-26856:
--

Yea, actually I was thinking about that. I guess it's good to have but thing is 
avro is optional as you said .. What kind of approach are you currently 
thinking? I need to take a look as well but was wondering if you have an idea 
already.

> Python support for "from_avro" and "to_avro" APIs
> -
>
> Key: SPARK-26856
> URL: https://issues.apache.org/jira/browse/SPARK-26856
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL, Structured Streaming
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Built-in avro support added in 2.4 but the from_avro and to_avro 
> functionalities are only available in scala and java. It would be good to add 
> python support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26856) Python support for "from_avro" and "to_avro" APIs

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764790#comment-16764790
 ] 

Gabor Somogyi commented on SPARK-26856:
---

I've created a patch but hesitant to create a PR because avro is an external 
package.
As an external package it requires --packages ...jar command line parameters to 
make scala/java API available.
[~Gengliang.Wang] did you have any discussion related this area before or it's 
a green area?
cc [~hyukjin.kwon] what do you think?


> Python support for "from_avro" and "to_avro" APIs
> -
>
> Key: SPARK-26856
> URL: https://issues.apache.org/jira/browse/SPARK-26856
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL, Structured Streaming
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Built-in avro support added in 2.4 but the from_avro and to_avro 
> functionalities are only available in scala and java. It would be good to add 
> python support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26856) Python support for "from_avro" and "to_avro" APIs

2019-02-11 Thread Gabor Somogyi (JIRA)

Gabor Somogyi created SPARK-26856:
-

 Summary: Python support for "from_avro" and "to_avro" APIs
 Key: SPARK-26856
 URL: https://issues.apache.org/jira/browse/SPARK-26856
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL, Structured Streaming
Affects Versions: 2.4.0, 3.0.0
Reporter: Gabor Somogyi


Built-in avro support added in 2.4 but the from_avro and to_avro 
functionalities are only available in scala and java. It would be good to add 
python support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26760) [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 and Active Task display as 5 in Executor Tab of SPARK UI]

2019-02-11 Thread ABHISHEK KUMAR GUPTA (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764777#comment-16764777
 ] 

ABHISHEK KUMAR GUPTA commented on SPARK-26760:
--

So u mean this is the issue with small running job?

Also my question if jobs page, stage page all are having different tasks number 
why this is so?

And most imp if it is in queue then why Active Task shows only one more than 
number of cores? Why not 7 or 8 because many tasks are in there?

What should display correctly as per code in Active Task column?

 

 

 

> [Spark Incorrect display in SPARK UI Executor Tab when number of cores is 4 
> and Active Task display as 5 in Executor Tab of SPARK UI]
> -
>
> Key: SPARK-26760
> URL: https://issues.apache.org/jira/browse/SPARK-26760
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: SPARK-26760.png
>
>
> Steps:
>  # Launch Spark Shell 
>  # bin/spark-shell --master yarn  --conf spark.dynamicAllocation.enabled=true 
> --conf spark.dynamicAllocation.initialExecutors=3 --conf 
> spark.dynamicAllocation.minExecutors=1 --conf 
> spark.dynamicAllocation.executorIdleTimeout=60s --conf 
> spark.dynamicAllocation.maxExecutors=5
>  # Submit a Job sc.parallelize(1 to 1,116000).count()
>  # Check the YARN UI Executor Tab for the RUNNING application
>  # UI display as Number of cores 4 and Active Tasks column shows as 5
> Expected:
> It Number of Active Tasks should be same as Number of Cores.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764772#comment-16764772
 ] 

Gabor Somogyi commented on SPARK-26783:
---

[~dongjoon] at the moment waiting on [~sindiri] to report back what has he done 
in SPARK-23685. If not reporting back in a week or so, I'm going to try the 
repro once again...

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764769#comment-16764769
 ] 

Gabor Somogyi edited comment on SPARK-26845 at 2/11/19 8:50 AM:


[~Gengliang.Wang] Thanks for the confirmation! Hope you're refreshed :) I've 
asked things in mail (yeah, mail because not a bug not a feature).


was (Author: gsomogyi):
[~Gengliang.Wang] Thanks for the confirmation! Hope you're refreshed :) I've 
asked things in mail (yeah, mail because not a bug no a feature).

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-11 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764769#comment-16764769
 ] 

Gabor Somogyi commented on SPARK-26845:
---

[~Gengliang.Wang] Thanks for the confirmation! Hope you're refreshed :) I've 
asked things in mail (yeah, mail because not a bug no a feature).

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-11 Thread Gengliang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764747#comment-16764747
 ] 

Gengliang Wang commented on SPARK-26845:


[~attilapiros]Thanks for the help!
[~gsomogyi] Sorry for the late reply. I was on vacation. You can see the Avro 
schema by 

{code:java}
SchemaConverters.toAvroType(df.schema).toString(true)
{code}


> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

51 matches

Mail list logo