date:20181130

[jira] [Created] (SPARK-26242) Leading slash breaks proxying

2018-11-30 Thread Ryan Lovett (JIRA)

Ryan Lovett created SPARK-26242:
---

 Summary: Leading slash breaks proxying
 Key: SPARK-26242
 URL: https://issues.apache.org/jira/browse/SPARK-26242
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Ryan Lovett


The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which 
breaks navigation when attempting to proxy the app at another URL. In my case, 
a pyspark user creates a SparkContext within a JupyterHub-hosted notebook and 
attempts to proxy it with nbserverproxy off of 
address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its pages 
to begin with "/", the browser sends the user to address.of.jupyter.hub/jobs, 
address.of.jupyter.hub/stages, etc.

 

Similar: 
[https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26189) Fix the doc of unionAll in SparkR

2018-11-30 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-26189.
--
  Resolution: Fixed
Assignee: Huaxin Gao
   Fix Version/s: 3.0.0
Target Version/s: 3.0.0

> Fix the doc of unionAll in SparkR
> -
>
> Key: SPARK-26189
> URL: https://issues.apache.org/jira/browse/SPARK-26189
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> We should fix the doc of unionAll in SparkR. See the discussion: 
> https://github.com/apache/spark/pull/23131/files#r236760822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705626#comment-16705626
 ] 

Apache Spark commented on SPARK-26221:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23192

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705627#comment-16705627
 ] 

Apache Spark commented on SPARK-26221:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23192

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705629#comment-16705629
 ] 

Apache Spark commented on SPARK-26226:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23193

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26221:


Assignee: Apache Spark  (was: Reynold Xin)

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26221:


Assignee: Reynold Xin  (was: Apache Spark)

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-30 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-26226.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26241) Add queryId to IncrementalExecution

2018-11-30 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-26241:
---

 Summary: Add queryId to IncrementalExecution
 Key: SPARK-26241
 URL: https://issues.apache.org/jira/browse/SPARK-26241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


It'd be useful to have the streaming query uuid in IncrementalExecution, when 
we look at the QueryExecution in isolation to trace back the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26241) Add queryId to IncrementalExecution

2018-11-30 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26241:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-26221

> Add queryId to IncrementalExecution
> ---
>
> Key: SPARK-26241
> URL: https://issues.apache.org/jira/browse/SPARK-26241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> It'd be useful to have the streaming query uuid in IncrementalExecution, when 
> we look at the QueryExecution in isolation to trace back the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705620#comment-16705620
 ] 

Apache Spark commented on SPARK-26219:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23191

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23647) extend hint syntax to support any expression for Python

2018-11-30 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23647:
-
Fix Version/s: 3.0.0

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python

2018-11-30 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23647:


Assignee: Dylan Guedes

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Assignee: Dylan Guedes
>Priority: Major
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23647) extend hint syntax to support any expression for Python

2018-11-30 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23647.
--
Resolution: Fixed

Issue resolved by pull request 20788
[https://github.com/apache/spark/pull/20788]

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21030) extend hint syntax to support any expression for Python and R

2018-11-30 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21030.
--
Resolution: Done

> extend hint syntax to support any expression for Python and R
> -
>
> Key: SPARK-21030
> URL: https://issues.apache.org/jira/browse/SPARK-21030
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Major
>
> See SPARK-20854
> we need to relax checks in 
> https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422
> and
> https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25876) Simplify configuration types in k8s backend

2018-11-30 Thread Matt Cheah (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-25876.

   Resolution: Fixed
Fix Version/s: 3.0.0

> Simplify configuration types in k8s backend
> ---
>
> Key: SPARK-25876
> URL: https://issues.apache.org/jira/browse/SPARK-25876
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a child of SPARK-25874 to deal with the current issues with the 
> different configuration objects used in the k8s backend. Please refer to the 
> parent for further discussion of what this means.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-30 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26219.

   Resolution: Fixed
 Assignee: shahid
Fix Version/s: 3.0.0

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources

2018-11-30 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705373#comment-16705373
 ] 

Thincrs commented on SPARK-26225:
-

A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 10:42 PM

> Scan: track decoding time for row-based data sources
> 
>
> Key: SPARK-26225
> URL: https://issues.apache.org/jira/browse/SPARK-26225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much 
> overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26240) [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException

2018-11-30 Thread Ying Wang (JIRA)

Ying Wang created SPARK-26240:
-

 Summary: [pyspark] Updating illegal column names with 
withColumnRenamed does not change schema changes, causing 
pyspark.sql.utils.AnalysisException
 Key: SPARK-26240
 URL: https://issues.apache.org/jira/browse/SPARK-26240
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
 Environment: Ubuntu 16.04 LTS (x86_64/deb)

 
Reporter: Ying Wang


I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet 
file with illegal column headers, and when I had called df = 
df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called df.show(), 
pyspark errored out with the failed attribute being the old column name.

Steps to reproduce:

- Create a Parquet file from Pandas using this dataframe schema:

```python

In [10]: df.info()

Int64Index: 1000 entries, 0 to 999
Data columns (total 16 columns):
Record_ID 1000 non-null int64
registration_dttm 1000 non-null object
id 1000 non-null int64
first_name 984 non-null object
last_name 1000 non-null object
email 984 non-null object
gender 933 non-null object
ip_address 1000 non-null object
cc 709 non-null float64
country 1000 non-null object
birthdate 803 non-null object
salary 932 non-null float64
title 803 non-null object
comments 179 non-null object
Unnamed: 14 10 non-null object
Unnamed: 15 9 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 132.8+ KB

```
 * Open pyspark shell with `pyspark` and read in the Parquet file with 
`spark.read.format('parquet').load('/path/to/file.parquet')

Call `spark_df.show()` Note the error with column 'Unnamed: 14'.

Rename column, replacing illegal space character with underscore character: 
`spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')`

Call `spark_df.show()` again, and note that the error still shows attribute 
'Unnamed: 14' in the error message:

```python

>>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet')
>>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')
>>> newdf.show()
Traceback (most recent call last):
 File 
"/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py",
 line 63, in deco
 return f(*a, **kw)
 File 
"/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString.
: org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains 
invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

...

```

I would have thought that there would be a way in order to read in Parquet 
files such that illegal column names can be changed after the fact with the 
spark dataframe was generated, and thus this is unintended behavior. Please let 
me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-30 Thread Damien Doucet-Girard (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705307#comment-16705307
 ] 

Damien Doucet-Girard commented on SPARK-26188:
--

[~ste...@apache.org] thanks for the tip, I brought it up with the team and 
we're going to check out a solution for our existing jobs

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.4.1, 3.0.0
>
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
>

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-11-30 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705282#comment-16705282
 ] 

Marcelo Vanzin commented on SPARK-23078:


Is this actually needed now that k8s supports client mode and also arbitrary 
commands to run in the image's entry point?

You could set up your STS pod with a YAML file containing everything you need, 
running the thrift server from there directly in client mode.

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-11-30 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705277#comment-16705277
 ] 

Marcelo Vanzin commented on SPARK-26239:


That can work but it doesn't address the 3rd bullet.

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-11-30 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705269#comment-16705269
 ] 

Dongjoon Hyun commented on SPARK-25145:
---

I used your configuration `zerocopy` like the followings in Apache Spark 2.4.0 
and 2.3.2 with Python3. But, it seems to be not reproducible. Could you reopen 
with a reproducible example next time?

*Spark 2.4.0*
{code}
~/A/s/spark-2.4.0-bin-hadoop2.7$ bin/pyspark --conf 
spark.hadoop.hive.exec.orc.zerocopy=true
Python 3.7.1 (default, Nov  6 2018, 18:46:03)
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Python version 3.7.1 (default, Nov  6 2018 18:46:03)
SparkSession available as 'spark'.
>>> import numpy as np
>>> import pandas as pd
>>> spark.conf.set('spark.sql.orc.impl', 'native')
>>> spark.conf.set('spark.sql.orc.filterPushdown', True)
>>> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
>>> sdf = spark.createDataFrame(df)
>>> sdf.write.saveAsTable(format='orc', mode='overwrite', name='t1', 
>>> compression='zlib')
>>> spark.sql('SELECT * FROM t1 WHERE a > 5').show()
+---+---+
|  a|  b|
+---+---+
|  8|4.0|
|  9|4.5|
|  6|3.0|
|  7|3.5|
+---+---+
{code}

*Spark 2.3.2*
{code}
~/A/s/spark-2.3.2-bin-hadoop2.7$ bin/pyspark --conf 
spark.hadoop.hive.exec.orc.zerocopy=true
Python 3.7.1 (default, Nov  6 2018, 18:46:03)
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
  /_/

Using Python version 3.7.1 (default, Nov  6 2018 18:46:03)
SparkSession available as 'spark'.
>>> import numpy as np
>>> import pandas as pd
>>> spark.conf.set('spark.sql.orc.impl', 'native')
>>> spark.conf.set('spark.sql.orc.filterPushdown', True)
>>> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
>>> sdf = spark.createDataFrame(df)
>>> sdf.write.saveAsTable(format='orc', mode='overwrite', name='t1', 
>>> compression='zlib')
>>> spark.sql('SELECT * FROM t1 WHERE a > 5').show()
+---+---+
|  a|  b|
+---+---+
|  8|4.0|
|  9|4.5|
|  6|3.0|
|  7|3.5|
+---+---+
{code}

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is

[jira] [Commented] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client

2018-11-30 Thread Qi Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705275#comment-16705275
 ] 

Qi Shao commented on SPARK-26237:
-

Figured out that if pyspark configs needs to be done before running pyspark. 
Creating a new sparkSession after logging into console won't work.

> [K8s] Unable to switch python version in executor when running pyspark client
> -
>
> Key: SPARK-26237
> URL: https://issues.apache.org/jira/browse/SPARK-26237
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
> Google Kubernetes Engines
>Reporter: Qi Shao
>Priority: Major
>
> Error message:
> {code:java}
> Exception: Python in worker has different version 2.7 than that in driver 
> 3.6, PySpark cannot run with different minor versions.Please check 
> environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly 
> set.{code}
> Neither
> {code:java}
> spark.kubernetes.pyspark.pythonVersion{code}
> nor 
> {code:java}
> spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code}
> works.
> This happens when I'm running a Notebook with pyspark+python3 and also in a 
> pod which has pyspark+python3.
> For notebook, the code is:
> {code:java}
> ```
> from _future_ import print_function
> import sys
> from random import random
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder\
>  .master("k8s://https://kubernetes.default.svc;)\
>  .appName("PySpark Testout")\
>  .config("spark.submit.deployMode","client")\
>  .config("spark.executor.instances", "2")\
>  .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\
>  .config("spark.driver.host","jupyter-notebook-headless")\
>  .config("spark.driver.pod.name","jupyter-notebook-headless")\
>  .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\
>  .config("spark.kubernetes.pyspark.pythonVersion","3")\
>  .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\
>  .getOrCreate()
> n = 10
> def f(_):
> x = random() * 2 - 1
> y = random() * 2 - 1
> return 1 if x ** 2 + y ** 2 <= 1 else 0
> count = spark.sparkContext.parallelize(range(1, n + 1), 
> partitions).map(f).reduce(add)
> print("Pi is roughly %f" % (4.0 * count / n))
> {code}
>  For pyspark shell, the command is:
>  
> {code:java}
> $SPARK_HOME/bin/pyspark --master \ 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
>  --deploy-mode client \
>  --conf spark.executor.instances=5 \
>  --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>  --conf spark.driver.host=spark-client-mode-headless \
>  --conf spark.kubernetes.pyspark.pythonVersion=3 \
>  --conf spark.driver.pod.name=spark-client-mode-headless{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client

2018-11-30 Thread Qi Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Shao resolved SPARK-26237.
-
Resolution: Invalid

> [K8s] Unable to switch python version in executor when running pyspark client
> -
>
> Key: SPARK-26237
> URL: https://issues.apache.org/jira/browse/SPARK-26237
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
> Google Kubernetes Engines
>Reporter: Qi Shao
>Priority: Major
>
> Error message:
> {code:java}
> Exception: Python in worker has different version 2.7 than that in driver 
> 3.6, PySpark cannot run with different minor versions.Please check 
> environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly 
> set.{code}
> Neither
> {code:java}
> spark.kubernetes.pyspark.pythonVersion{code}
> nor 
> {code:java}
> spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code}
> works.
> This happens when I'm running a Notebook with pyspark+python3 and also in a 
> pod which has pyspark+python3.
> For notebook, the code is:
> {code:java}
> ```
> from _future_ import print_function
> import sys
> from random import random
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder\
>  .master("k8s://https://kubernetes.default.svc;)\
>  .appName("PySpark Testout")\
>  .config("spark.submit.deployMode","client")\
>  .config("spark.executor.instances", "2")\
>  .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\
>  .config("spark.driver.host","jupyter-notebook-headless")\
>  .config("spark.driver.pod.name","jupyter-notebook-headless")\
>  .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\
>  .config("spark.kubernetes.pyspark.pythonVersion","3")\
>  .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\
>  .getOrCreate()
> n = 10
> def f(_):
> x = random() * 2 - 1
> y = random() * 2 - 1
> return 1 if x ** 2 + y ** 2 <= 1 else 0
> count = spark.sparkContext.parallelize(range(1, n + 1), 
> partitions).map(f).reduce(add)
> print("Pi is roughly %f" % (4.0 * count / n))
> {code}
>  For pyspark shell, the command is:
>  
> {code:java}
> $SPARK_HOME/bin/pyspark --master \ 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
>  --deploy-mode client \
>  --conf spark.executor.instances=5 \
>  --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>  --conf spark.driver.host=spark-client-mode-headless \
>  --conf spark.kubernetes.pyspark.pythonVersion=3 \
>  --conf spark.driver.pod.name=spark-client-mode-headless{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-11-30 Thread Matt Cheah (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705273#comment-16705273
 ] 

Matt Cheah edited comment on SPARK-26239 at 11/30/18 8:59 PM:
--

Could we add a simple version that just points to file paths for the executor 
and driver to load, with the secret contents being inside? The user can decide 
how those files are mounted into the containers.


was (Author: mcheah):
Would a simple addition just to point to file paths for the executor and driver 
to load, with the secret contents being inside? The user can decide how those 
files are mounted into the containers.

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-11-30 Thread Matt Cheah (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705273#comment-16705273
 ] 

Matt Cheah commented on SPARK-26239:


Would a simple addition just to point to file paths for the executor and driver 
to load, with the secret contents being inside? The user can decide how those 
files are mounted into the containers.

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-11-30 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25145.
---
Resolution: Cannot Reproduce

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> +---+---+
> Read entire table with "filterPushdown"=True
> +---+---+
> | a| b|
> +---+---+
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> | 4|2.0|
> | 0|0.0|
>

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-11-30 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705236#comment-16705236
 ] 

Thincrs commented on SPARK-26239:
-

A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 8:18 PM

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-30 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-26200.
--
Resolution: Duplicate

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-30 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705224#comment-16705224
 ] 

Bryan Cutler commented on SPARK-26200:
--

Thanks [~davidlyness], I'll mark this as a duplicate since the root cause is 
the same. I've been tracking this and similar issues with PySpark Rows, but 
since addressing these will cause a behavior change, we can only make the fix 
in Spark 3.0. If you didn't see the workaround in the other JIRA, I would 
recommend creating your Rows this way when using a schema and you care about 
field ordering, or use a list as you pointed out

{code}
In [10]: MyRow = Row("field2", "field1")

In [11]: data = [
...: MyRow(Row(sub_field='world'), "hello")
...: ]

In [12]: df = spark.createDataFrame(data, schema=schema)

In [13]: df.show()
+---+--+
| field2|field1|
+---+--+
|[world]| hello|
+---+--+
{code}

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids

[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early

2018-11-30 Thread Julian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705222#comment-16705222
 ] 

Julian commented on SPARK-21453:


I also get the following error around the same time (shortly after above one). 
Maybe it is related?

18/11/30 18:04:17 ERROR Executor: Exception in task 2.0 in stage 115.0 (TID 
2377)
java.lang.NullPointerException
    at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:436)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:395)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:205)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:137)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:307)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1138)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1103)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
    at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
    at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
    at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at 
org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
    at 
org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
    at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
    at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)

> Cached Kafka consumer may be closed too early
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
> .option("checkpointLocation", 
> s"${config.checkpointDir}/${topicConfig._1}/")
> .format("kafka")
> .option("kafka.bootstrap.servers", config.bootstrapServers)
>

[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early

2018-11-30 Thread Julian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705215#comment-16705215
 ] 

Julian commented on SPARK-21453:


I see this issue is still open (although no one has posted here for a while), 
as I get exactly the same error periodically on our Spark 2.2 on our spark 
structured streaming job that is running on our HDP 2.6 cluster with Kafka 
(HDF). It is a pain, as it is causing tasks to fail and with a load currently 
of 1.5GB per 5 mins (to be upscaled 5+ times in the coming weeks/months) and it 
increases the overall time for a micro batch to complete. Any ideas would be 
good (I'll check link above also). Note, we are not using the experimental 
Continuous Mode in this version...

18/11/30 18:04:02 WARN SslTransportLayer: Failed to send SSL Close message
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
    at 
org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:212)
    at 
org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:170)
    at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:703)
    at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
    at org.apache.kafka.common.network.Selector.doClose(Selector.java:717)
    at org.apache.kafka.common.network.Selector.close(Selector.java:708)
    at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:500)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1156)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1103)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
    at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
    at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
    at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
    at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
    …….

 

> Cached Kafka consumer may be closed too early
> -
>
> Key: SPARK-21453
> URL: https://issues.apache.org/jira/browse/SPARK-21453
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 and kafka 0.10.2.0
>Reporter: Pablo Panero
>Priority: Minor
>
> On a streaming job using built-in kafka source and sink (over SSL), with  I 
> am getting the following exception:
> Config of the source:
> {code:java}
> val df = spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", config.bootstrapServers)
>   .option("failOnDataLoss", value = false)
>   .option("kafka.connections.max.idle.ms", 360)
>   //SSL: this only applies to communication between Spark and Kafka 
> brokers; you are still responsible for separately securing Spark inter-node 
> communication.
>   .option("kafka.security.protocol", "SASL_SSL")
>   .option("kafka.sasl.mechanism", "GSSAPI")
>   .option("kafka.sasl.kerberos.service.name", "kafka")
>   .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts")
>   .option("kafka.ssl.truststore.password", "changeit")
>   .option("subscribe", config.topicConfigList.keys.mkString(","))
>   .load()
> {code}
> Config of the sink:
> {code:java}
> .writeStream
>

[jira] [Created] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S

2018-11-30 Thread Ilan Filonenko (JIRA)

Ilan Filonenko created SPARK-26238:
--

 Summary: Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
 Key: SPARK-26238
 URL: https://issues.apache.org/jira/browse/SPARK-26238
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0, 3.0.0
Reporter: Ilan Filonenko


Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to /opt/spark/conf 
which is hard-coded into the Constants. This is expected behavior according to 
spark docs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-11-30 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-26239:
--

 Summary: Add configurable auth secret source in k8s backend
 Key: SPARK-26239
 URL: https://issues.apache.org/jira/browse/SPARK-26239
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
similar to the YARN backend.

There's a desire to support different ways to generate and propagate these auth 
secrets (e.g. using things like Vault). Need to investigate:

- exposing configuration to support that
- changing SecurityManager so that it can delegate some of the secret-handling 
logic to custom implementations
- figuring out whether this can also be used in client-mode, where the driver 
is not created by the k8s backend in Spark.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-30 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-26201.
---
   Resolution: Fixed
 Assignee: Sanket Chintapalli
Fix Version/s: 3.0.0
   2.4.1
   2.3.3

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Assignee: Sanket Chintapalli
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client

2018-11-30 Thread Qi Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Shao updated SPARK-26237:

Summary: [K8s] Unable to switch python version in executor when running 
pyspark client  (was: [K8s] Unable to switch python version in executor when 
running pyspark shell.)

> [K8s] Unable to switch python version in executor when running pyspark client
> -
>
> Key: SPARK-26237
> URL: https://issues.apache.org/jira/browse/SPARK-26237
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
> Google Kubernetes Engines
>Reporter: Qi Shao
>Priority: Major
>
> Error message:
> {code:java}
> Exception: Python in worker has different version 2.7 than that in driver 
> 3.6, PySpark cannot run with different minor versions.Please check 
> environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly 
> set.{code}
> Neither
> {code:java}
> spark.kubernetes.pyspark.pythonVersion{code}
> nor 
> {code:java}
> spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code}
> works.
> This happens when I'm running a Notebook with pyspark+python3 and also in a 
> pod which has pyspark+python3.
> For notebook, the code is:
> {code:java}
> ```
> from _future_ import print_function
> import sys
> from random import random
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder\
>  .master("k8s://https://kubernetes.default.svc;)\
>  .appName("PySpark Testout")\
>  .config("spark.submit.deployMode","client")\
>  .config("spark.executor.instances", "2")\
>  .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\
>  .config("spark.driver.host","jupyter-notebook-headless")\
>  .config("spark.driver.pod.name","jupyter-notebook-headless")\
>  .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\
>  .config("spark.kubernetes.pyspark.pythonVersion","3")\
>  .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\
>  .getOrCreate()
> n = 10
> def f(_):
> x = random() * 2 - 1
> y = random() * 2 - 1
> return 1 if x ** 2 + y ** 2 <= 1 else 0
> count = spark.sparkContext.parallelize(range(1, n + 1), 
> partitions).map(f).reduce(add)
> print("Pi is roughly %f" % (4.0 * count / n))
> {code}
>  For pyspark shell, the command is:
>  
> {code:java}
> $SPARK_HOME/bin/pyspark --master \ 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
>  --deploy-mode client \
>  --conf spark.executor.instances=5 \
>  --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>  --conf spark.driver.host=spark-client-mode-headless \
>  --conf spark.kubernetes.pyspark.pythonVersion=3 \
>  --conf spark.driver.pod.name=spark-client-mode-headless{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark shell.

2018-11-30 Thread Qi Shao (JIRA)

Qi Shao created SPARK-26237:
---

 Summary: [K8s] Unable to switch python version in executor when 
running pyspark shell.
 Key: SPARK-26237
 URL: https://issues.apache.org/jira/browse/SPARK-26237
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
 Environment: Spark 2.4.0

Google Kubernetes Engines
Reporter: Qi Shao


Error message:
{code:java}
Exception: Python in worker has different version 2.7 than that in driver 3.6, 
PySpark cannot run with different minor versions.Please check environment 
variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.{code}
Neither
{code:java}
spark.kubernetes.pyspark.pythonVersion{code}
nor 
{code:java}
spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code}
works.

This happens when I'm running a Notebook with pyspark+python3 and also in a pod 
which has pyspark+python3.

For notebook, the code is:
{code:java}
```
from _future_ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession.builder\
 .master("k8s://https://kubernetes.default.svc;)\
 .appName("PySpark Testout")\
 .config("spark.submit.deployMode","client")\
 .config("spark.executor.instances", "2")\
 .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\
 .config("spark.driver.host","jupyter-notebook-headless")\
 .config("spark.driver.pod.name","jupyter-notebook-headless")\
 .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\
 .config("spark.kubernetes.pyspark.pythonVersion","3")\
 .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\
 .getOrCreate()
n = 10
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), 
partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
{code}
 For pyspark shell, the command is:

 
{code:java}
$SPARK_HOME/bin/pyspark --master \ 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
 --deploy-mode client \
 --conf spark.executor.instances=5 \
 --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
 --conf spark.driver.host=spark-client-mode-headless \
 --conf spark.kubernetes.pyspark.pythonVersion=3 \
 --conf spark.driver.pod.name=spark-client-mode-headless{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705045#comment-16705045
 ] 

Marcelo Vanzin commented on SPARK-26043:


It would be better if you can show what you mean with an example. In general, 
it's good to avoid having this kind of logic in executors; and if needed, have 
the driver provide the needed info. (e.g. by broadcasting the config, which is 
done internally by Spark).

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-11-30 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705005#comment-16705005
 ] 

Dongjoon Hyun commented on SPARK-25145:
---

Thank you for the enriching the information. I'll try to check with that 
options, [~bjensen].

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> +---+---+
> Read entire table with "filterPushdown"=True
> +---+---+
> | a| b|
> +---+---+
> | 1|0.5|
> |

[jira] [Comment Edited] (SPARK-26214) Add "broadcast" method to DataFrame

2018-11-30 Thread Thomas Decaux (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704978#comment-16704978
 ] 

Thomas Decaux edited comment on SPARK-26214 at 11/30/18 4:24 PM:
-

 

Well, not so different :
{code:java}
public void broadcast() {
this.sqlContext().sparkContext().broadcast(this);
}
{code}
 

{{VERSUS}}
{code:java}
public void registerTempTable(String tableName)
{ this.sqlContext().registerDataFrameAsTable(this, tableName); }
{code}
 

 


was (Author: ebuildy):
 

Well, not so different :
{code:java}
public void broadcast() {
this.sqlContext().sparkContext().broadcast(this);
}
{code}
 

{{VERSUS}}
public void registerTempTable(String tableName) { 
this.sqlContext().registerDataFrameAsTable(this, tableName);
}
 

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame

2018-11-30 Thread Thomas Decaux (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704978#comment-16704978
 ] 

Thomas Decaux commented on SPARK-26214:
---

 

Well, not so different :
{code:java}
public void broadcast() {
this.sqlContext().sparkContext().broadcast(this);
}
{code}
 

{{VERSUS}}
public void registerTempTable(String tableName) { 
this.sqlContext().registerDataFrameAsTable(this, tableName);
}
 

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame

2018-11-30 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704958#comment-16704958
 ] 

Marco Gaido commented on SPARK-26214:
-

I don't think it is really the same. I don't think it is a good idea to add 
this new API.

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26236) Kafka delegation token support documentation

2018-11-30 Thread Gabor Somogyi (JIRA)

Gabor Somogyi created SPARK-26236:
-

 Summary: Kafka delegation token support documentation
 Key: SPARK-26236
 URL: https://issues.apache.org/jira/browse/SPARK-26236
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


Because SPARK-25501 merged to master now it's time to update the docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Ian O Connell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704926#comment-16704926
 ] 

Ian O Connell commented on SPARK-26043:
---

Dumping into a map though would leak it into the API of the methods, nothing in 
the user land api/submitter code needs to be aware of the configuration right 
now. SparkEnv/HadoopConf's coming from the 'ether' magically has just been a 
useful side channel. I can try pull it out of the SparkConf instead, though it 
makes the code then less portable between scalding/spark unfortunately. 

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26236) Kafka delegation token support documentation

2018-11-30 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704903#comment-16704903
 ] 

Gabor Somogyi commented on SPARK-26236:
---

Started to work on this.

> Kafka delegation token support documentation
> 
>
> Key: SPARK-26236
> URL: https://issues.apache.org/jira/browse/SPARK-26236
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Because SPARK-25501 merged to master now it's time to update the docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704877#comment-16704877
 ] 

Sean Owen commented on SPARK-26043:
---

I see. Typically you'd just instantiate Configuration and it would pick up all 
the Hadoop config from your cluster env. I see you're setting the props locally 
on the command line though. Configuration is "Writable" but not directly 
"Serializable" unfortunately, or else you could just use it in the closure you 
send. You could dump the Configuration key-values into a Map and use that I 
suppose? Probably a couple lines of code.

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25857) Document delegation token code in Spark

2018-11-30 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704894#comment-16704894
 ] 

Gabor Somogyi commented on SPARK-25857:
---

+1, it's not easy topic.

> Document delegation token code in Spark
> ---
>
> Key: SPARK-25857
> URL: https://issues.apache.org/jira/browse/SPARK-25857
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> By this I mean not user documentation, but documenting the functionality 
> provided in the {{org.apache.spark.deploy.security}} and related packages, so 
> that other developers making changes there can refer to it.
> It seems to be a source of confusion every time somebody needs touch that 
> code, so it would be good to have a document explaining how it all works, 
> including how it's hooked up to different resource managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Ian O Connell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704869#comment-16704869
 ] 

Ian O Connell commented on SPARK-26043:
---

I can't use the hadoopConfiguration from the spark context on the executors i 
didn't think? 

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Ian O Connell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704864#comment-16704864
 ] 

Ian O Connell commented on SPARK-26043:
---

[~srowen] i just edited the comment right this sec to add that context sry, 
race condition, but:

 
(context is via the command line/in spark i had been setting hadoop 
configuration options – but i need to pick those up in some libraries on the 
executors to see what was set (if s3guard is enabled in my case). I need some 
means to hook into what the submitter though the hadoop conf should be to turn 
on/off reporting to dynamodb for s3guard info)
 

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Ian O Connell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704829#comment-16704829
 ] 

Ian O Connell edited comment on SPARK-26043 at 11/30/18 3:02 PM:
-

With this change it makes it difficult to get a fully populated hadoop 
configuration on executor hosts. If spark properties were applied to the hadoop 
conf in the driver those don't show up in a `new Configuration`. This let one 
go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there 
a nicer way to achieve this possibly?

 

(context is via the command line/in spark i had been setting hadoop 
configuration options – but i need to pick those up in some libraries on the 
executors to see what was set (if s3guard is enabled in my case). I need some 
means to hook into what the submitter though the hadoop conf should be to turn 
on/off reporting to dynamodb for s3guard info)


was (Author: ianoc):
With this change it makes it difficult to get a fully populated hadoop 
configuration on executor hosts. If spark properties were applied to the hadoop 
conf in the driver those don't show up in a `new Configuration`. This let one 
go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there 
a nicer way to achieve this possibly?

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704856#comment-16704856
 ] 

Sean Owen commented on SPARK-26043:
---

What is the use case for that? I think this class was intended to be internal 
to Spark. Does SparkContext.hadoopConfiguration() give you what you need?

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-30 Thread David Lyness (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704839#comment-16704839
 ] 

David Lyness commented on SPARK-26200:
--

Yep, I believe the root cause for this and SPARK-24915 are the same. (You see 
the behaviour in SPARK-24915 if the mis-ordered columns are of 
incompatible/different types; you see the behaviour in this ticket if the 
mis-ordered columns are of the same type.)

Note that the impact is more severe in the case of this ticket - rather than a 
function call failing during development which can then be worked around, this 
has the potential to be silently causing data correctness issues for users of 
PySpark.

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail:

[jira] [Closed] (SPARK-26234) Column list specification in INSERT statement

2018-11-30 Thread Joby Joje (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joby Joje closed SPARK-26234.
-

A card SPARK-20845 is already in place to resolve this, hence closing the 
ticket.

> Column list specification in INSERT statement
> -
>
> Key: SPARK-26234
> URL: https://issues.apache.org/jira/browse/SPARK-26234
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joby Joje
>Priority: Major
>
> While trying to OVERWRITE the Hive table with specific columns from 
> Spark(Pyspark) using a dataframe getting the below error
> {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
>  Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
> 'REDUCE'}
>  (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, 
> Col3) select Col1, Col2, Col3 FROM 
> dataframe\n^^^\n"
> {quote}
> {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
> Col1, Col2, Col3 FROM dataframe")}}
> {{But on trying the same via _Hive Terminal_ goes through fine.}}
> Please check the below link to get more info on the same.
> [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-30 Thread Ian O Connell (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704829#comment-16704829
 ] 

Ian O Connell commented on SPARK-26043:
---

With this change it makes it difficult to get a fully populated hadoop 
configuration on executor hosts. If spark properties were applied to the hadoop 
conf in the driver those don't show up in a `new Configuration`. This let one 
go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there 
a nicer way to achieve this possibly?

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26235:


Assignee: Apache Spark

> Change log level for ClassNotFoundException/NoClassDefFoundError in 
> SparkSubmit to Error
> 
>
> Key: SPARK-26235
> URL: https://issues.apache.org/jira/browse/SPARK-26235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Trivial
>
> In my local setup, I set log4j root category as ERROR 
> (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
>  , first item show up if we google search "set spark log level".)
> When I run such command
> ```
> spark-submit --class foo bar.jar
> ```
> Nothing shows up, and the script exits.
> After quick investigation, I think the log level for 
> ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
> instead of WARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704821#comment-16704821
 ] 

Apache Spark commented on SPARK-26235:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23189

> Change log level for ClassNotFoundException/NoClassDefFoundError in 
> SparkSubmit to Error
> 
>
> Key: SPARK-26235
> URL: https://issues.apache.org/jira/browse/SPARK-26235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> In my local setup, I set log4j root category as ERROR 
> (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
>  , first item show up if we google search "set spark log level".)
> When I run such command
> ```
> spark-submit --class foo bar.jar
> ```
> Nothing shows up, and the script exits.
> After quick investigation, I think the log level for 
> ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
> instead of WARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26235:


Assignee: (was: Apache Spark)

> Change log level for ClassNotFoundException/NoClassDefFoundError in 
> SparkSubmit to Error
> 
>
> Key: SPARK-26235
> URL: https://issues.apache.org/jira/browse/SPARK-26235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Trivial
>
> In my local setup, I set log4j root category as ERROR 
> (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
>  , first item show up if we google search "set spark log level".)
> When I run such command
> ```
> spark-submit --class foo bar.jar
> ```
> Nothing shows up, and the script exits.
> After quick investigation, I think the log level for 
> ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
> instead of WARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-30 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704804#comment-16704804
 ] 

Steve Loughran commented on SPARK-26188:


bq. > My team uses spark to partition and output parquet files to amazon S3. We 
typically use 256 partitions, from 00 to ff.

Independent of the patch, you'd better be using something to deliver the 
consistency which commit-via-rename requires for directory listing (on S3A: 
S3Guard), or better, an output committer designed from the ground up for S3 
(S3A committers), or you are at risk of data loss due to inconsistent listings.

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.4.1, 3.0.0
>
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range:

[jira] [Created] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-11-30 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26235:
--

 Summary: Change log level for 
ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
 Key: SPARK-26235
 URL: https://issues.apache.org/jira/browse/SPARK-26235
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In my local setup, I set log4j root category as ERROR 
(https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
 , first item show up if we google search "set spark log level".)
When I run such command
```
spark-submit --class foo bar.jar
```
Nothing shows up, and the script exits.

After quick investigation, I think the log level for 
ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
instead of WARN.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26234) Column list specification in INSERT statement

2018-11-30 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-26234.
-
Resolution: Duplicate

> Column list specification in INSERT statement
> -
>
> Key: SPARK-26234
> URL: https://issues.apache.org/jira/browse/SPARK-26234
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joby Joje
>Priority: Major
>
> While trying to OVERWRITE the Hive table with specific columns from 
> Spark(Pyspark) using a dataframe getting the below error
> {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
>  Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
> 'REDUCE'}
>  (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, 
> Col3) select Col1, Col2, Col3 FROM 
> dataframe\n^^^\n"
> {quote}
> {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
> Col1, Col2, Col3 FROM dataframe")}}
> {{But on trying the same via _Hive Terminal_ goes through fine.}}
> Please check the below link to get more info on the same.
> [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE

2018-11-30 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26198:
--
Priority: Minor  (was: Major)

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> How to reproduce this issue:
> {code}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-26232.

Resolution: Won't Fix

One of the third party tool uses Netty3, so as a transitive dependency Netty3 
is really needed.

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26234) Column list specification in INSERT statement

2018-11-30 Thread Joby Joje (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joby Joje updated SPARK-26234:
--
Description: 
While trying to OVERWRITE the Hive table with specific columns from 
Spark(Pyspark) using a dataframe getting the below error
{quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
 Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
'REDUCE'}
 (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) 
select Col1, Col2, Col3 FROM 
dataframe\n^^^\n"
{quote}
{{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
Col1, Col2, Col3 FROM dataframe")}}

{{But on trying the same via _Hive Terminal_ goes through fine.}}

Please check the below link to get more info on the same.

[https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]

 

  was:
While trying to OVERWRITE the Hive table with specific columns from 
Spark(Pyspark) using a dataframe getting the below error
{quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
'REDUCE'}
(line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, 
Col3) select Col1, Col2, Col3 FROM 
dataframe\n^^^\n"
{quote}
{{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
Col1, Col2, Col3 FROM dataframe")}}

{{But on trying the same via _Hive Terminal_ goes through fine.}}

Please check the below link to get more info on the same.

https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement

 


> Column list specification in INSERT statement
> -
>
> Key: SPARK-26234
> URL: https://issues.apache.org/jira/browse/SPARK-26234
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joby Joje
>Priority: Major
>
> While trying to OVERWRITE the Hive table with specific columns from 
> Spark(Pyspark) using a dataframe getting the below error
> {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
>  Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
> 'REDUCE'}
>  (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) 
> select Col1, Col2, Col3 FROM 
> dataframe\n^^^\n"
> {quote}
> {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
> Col1, Col2, Col3 FROM dataframe")}}
> {{But on trying the same via _Hive Terminal_ goes through fine.}}
> Please check the below link to get more info on the same.
> [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26234) Column list specification in INSERT statement

2018-11-30 Thread Joby Joje (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joby Joje updated SPARK-26234:
--
Description: 
While trying to OVERWRITE the Hive table with specific columns from 
Spark(Pyspark) using a dataframe getting the below error
{quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
 Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
'REDUCE'}
 (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, 
Col3) select Col1, Col2, Col3 FROM 
dataframe\n^^^\n"
{quote}
{{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
Col1, Col2, Col3 FROM dataframe")}}

{{But on trying the same via _Hive Terminal_ goes through fine.}}

Please check the below link to get more info on the same.

[https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]

 

  was:
While trying to OVERWRITE the Hive table with specific columns from 
Spark(Pyspark) using a dataframe getting the below error
{quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
 Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
'REDUCE'}
 (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) 
select Col1, Col2, Col3 FROM 
dataframe\n^^^\n"
{quote}
{{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
Col1, Col2, Col3 FROM dataframe")}}

{{But on trying the same via _Hive Terminal_ goes through fine.}}

Please check the below link to get more info on the same.

[https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]

 


> Column list specification in INSERT statement
> -
>
> Key: SPARK-26234
> URL: https://issues.apache.org/jira/browse/SPARK-26234
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joby Joje
>Priority: Major
>
> While trying to OVERWRITE the Hive table with specific columns from 
> Spark(Pyspark) using a dataframe getting the below error
> {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
>  Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
> 'REDUCE'}
>  (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, 
> Col3) select Col1, Col2, Col3 FROM 
> dataframe\n^^^\n"
> {quote}
> {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
> Col1, Col2, Col3 FROM dataframe")}}
> {{But on trying the same via _Hive Terminal_ goes through fine.}}
> Please check the below link to get more info on the same.
> [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26234) Column list specification in INSERT statement

2018-11-30 Thread Joby Joje (JIRA)

Joby Joje created SPARK-26234:
-

 Summary: Column list specification in INSERT statement
 Key: SPARK-26234
 URL: https://issues.apache.org/jira/browse/SPARK-26234
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Joby Joje


While trying to OVERWRITE the Hive table with specific columns from 
Spark(Pyspark) using a dataframe getting the below error
{quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting
Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 
'REDUCE'}
(line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, 
Col3) select Col1, Col2, Col3 FROM 
dataframe\n^^^\n"
{quote}
{{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select 
Col1, Col2, Col3 FROM dataframe")}}

{{But on trying the same via _Hive Terminal_ goes through fine.}}

Please check the below link to get more info on the same.

https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26232:


Assignee: Apache Spark

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-11-30 Thread Miquel (JIRA)

Miquel created SPARK-26233:
--

 Summary: Incorrect decimal value with java beans and 
first/last/max... functions
 Key: SPARK-26233
 URL: https://issues.apache.org/jira/browse/SPARK-26233
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.0, 2.3.1
Reporter: Miquel


Decimal values from Java beans are incorrectly scaled when used with functions 
like first/last/max...

This problem came because Encoders.bean always set Decimal values as 
_DecimalType(this.MAX_PRECISION(), 18)._

Usually it's not a problem if you use numeric functions like *sum* but for 
functions like *first*/*last*/*max*... it is a problem.

How to reproduce this error:

Using this class as an example:
{code:java}
public class Foo implements Serializable {

  private String group;
  private BigDecimal var;

  public BigDecimal getVar() {
return var;
  }

  public void setVar(BigDecimal var) {
this.var = var;
  }

  public String getGroup() {
return group;
  }

  public void setGroup(String group) {
this.group = group;
  }
}
{code}
 

And a dummy code to create some objects:
{code:java}
Dataset ds = spark.range(5)
.map(l -> {
  Foo foo = new Foo();
  foo.setGroup("" + l);
  foo.setVar(BigDecimal.valueOf(l + 0.));
  return foo;
}, Encoders.bean(Foo.class));
ds.printSchema();
ds.show();

+-+--+
|group| var|
+-+--+
| 0|0.|
| 1|1.|
| 2|2.|
| 3|3.|
| 4|4.|
+-+--+
{code}
We can see that the DecimalType is precision 38 and 18 scale and all values are 
show correctly.

But if we use a first function, they are scaled incorrectly:
{code:java}
ds.groupBy(col("group"))
.agg(
first("var")
)
.show();


+-+-+
|group|first(var, false)|
+-+-+
| 3| 3.E-14|
| 0| 1.111E-15|
| 1| 1.E-14|
| 4| 4.E-14|
| 2| 2.E-14|
+-+-+
{code}
This incorrect behavior cannot be reproduced if we use "numerical "functions 
like sum or if the column is cast a new Decimal Type.
{code:java}
ds.groupBy(col("group"))
.agg(
sum("var")
)
.show();

+-++
|group| sum(var)|
+-++
| 3|3.00|
| 0|0.00|
| 1|1.00|
| 4|4.00|
| 2|2.00|
+-++

ds.groupBy(col("group"))
.agg(
first(col("var").cast(new DecimalType(38, 8)))
)
.show();

+-++
|group|first(CAST(var AS DECIMAL(38,8)), false)|
+-++
| 3| 3.|
| 0| 0.|
| 1| 1.|
| 4| 4.|
| 2| 2.|
+-++
{code}
   

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704700#comment-16704700
 ] 

Apache Spark commented on SPARK-26232:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23188

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26232:


Assignee: (was: Apache Spark)

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704699#comment-16704699
 ] 

Apache Spark commented on SPARK-26232:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23188

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26232:
---
Summary: Remove old Netty3 dependency from the build   (was: Remove old 
Netty3 artifact from the build )

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26232) Remove old Netty3 artifact from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)

Attila Zsolt Piros created SPARK-26232:
--

 Summary: Remove old Netty3 artifact from the build 
 Key: SPARK-26232
 URL: https://issues.apache.org/jira/browse/SPARK-26232
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Old Netty artifact (3.9.9.Final) is unused.

The reason it is not collide with Netty4 is they are using different package 
names:
 * Netty3: org.jboss.netty.*
 * Netty4: io.netty.*

Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeExceptio

2018-11-30 Thread Oleg V Korchagin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704668#comment-16704668
 ] 

Oleg V Korchagin commented on SPARK-24661:
--

Why PySpark in components list?

> Window API - using multiple fields for partitioning with WindowSpec API and 
> dataset that is cached causes 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException
> 
>
> Key: SPARK-24661
> URL: https://issues.apache.org/jira/browse/SPARK-24661
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Java API, PySpark
>Affects Versions: 2.3.0
>Reporter: David Mavashev
>Priority: Major
>
> Steps to reproduce:
> Creating a data set:
>  
> {code:java}
> List simpleWindowColumns = new ArrayList();
> simpleWindowColumns.add("column1");
> simpleWindowColumns.add("column2");
> Map expressionsWithAliasesEntrySet = new HashMap String>);
> expressionsWithAliasesEntrySet.put("count(id)", "count_column");
> DataFrameReader reader = sparkSession.read().format("csv");
> Dataset sparkDataSet = reader.option("header", 
> "true").load("/path/to/data/data.csv");
> //Invoking cached:
> sparkDataSet = sparkDataSet.cache()
> //Creating window spec with 2 columns:
> WindowSpec window = 
> Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq());
> sparkDataSet = 
> sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(),
>   
> JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new
>  
> Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq());
> sparkDataSet.show();{code}
> Expected:
>  
> Results are shown
>  
>  
> Actual: the following exception is thrown
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, 
> unboundedpreceding$(), unboundedfollowing$())) at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189)
>  at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
>

[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704591#comment-16704591
 ] 

Apache Spark commented on SPARK-26211:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23187

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704590#comment-16704590
 ] 

Apache Spark commented on SPARK-26211:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23187

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-11-30 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørnar Jensen reopened SPARK-25145:


New information that I believe will make it reproducible:

The "zerocopy" option is what triggers this crash in our system. With 
"zerocopy" set to "false" the reads produces results. spark.sql().explain() 
show that it pushes the filters down.

Turning on "zerocopy" causes the query with filter pushdown to crash with 
buffer size too small. Furthermore, orc.compress.size and 
orc.buffer.size.enforce does not seem to have any effect/stick when tried.
{code:java}
pyspark --conf 'spark.hadoop.hive.exec.orc.zerocopy=true" => Crashes

pyspark --conf 'spark.hadoop.hive.exec.orc.zerocopy=false" => Succeeds
{code}
 

I do not know why, though.

 

Best regards,

Bjørnar.

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.

[jira] [Created] (SPARK-26231) Dataframes inner join on double datatype columns resulting in Cartesian product

2018-11-30 Thread Shrikant (JIRA)

Shrikant created SPARK-26231:


 Summary: Dataframes inner join on double datatype columns 
resulting in Cartesian product
 Key: SPARK-26231
 URL: https://issues.apache.org/jira/browse/SPARK-26231
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.6.0
Reporter: Shrikant


Following code snippet explains the bug. The join on the Double columns results 
in catersian , when both columns typecasted to String it works.

please see the explain plan belolw

Error: scala> cartesianJoinErr.explain()
== Physical Plan ==
CartesianProduct
:- ConvertToSafe
:  +- Project [name#143,group#144,data#145,name#143 AS name1#146]
: +- Filter (name#143 = name#143)
:    +- Scan ExistingRDD[name#143,group#144,data#145]
+- Scan ExistingRDD[name#147,group#148,data#149]

---

After conversion to String explain plan

stringColJoinWorks.explain()
== Physical Plan ==
SortMergeJoin [name1String#151], [name2String#152]
:- Sort [name1String#151 ASC], false, 0
:  +- TungstenExchange hashpartitioning(name1String#151,200), None
: +- Project [name#143,group#144,data#145,cast(name#143 as string) AS 
name1String#151]
:    +- Scan ExistingRDD[name#143,group#144,data#145]
+- Sort [name2String#152 ASC], false, 0
   +- TungstenExchange hashpartitioning(name2String#152,200), None
  +- Project [name#153,group#154,data#155,cast(name#153 as string) AS 
name2String#152]
 +- Scan ExistingRDD[name#153,group#154,data#155]

 

 

import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions

val doubleRDD = sc.parallelize(Seq(
    Row(1.0, 2, 1),
    Row(2.0, 8, 2),
    Row(3.0, 10, 3),
    Row(4.0, 10, 4)))
    
val testSchema = StructType(Seq(
    StructField("name", DoubleType, nullable = true),
    StructField("group", IntegerType, nullable = true),
    StructField("data", IntegerType, nullable = true)))
    
val doubleRDDCartesian = sqlContext.createDataFrame(doubleRDD, testSchema)

val cartNewCol = doubleRDDCartesian.select($"name" , $"group", $"data")

val newColName1DF = cartNewCol.withColumn("name1", $"name")
val cartesianJoinErr = newColName1DF.join(doubleRDDCartesian, 
newColName1DF("name1")===(doubleRDDCartesian("name")))
cartesianJoinErr.show
cartesianJoinErr.explain()

//Convert both into StringType
val stringColDF1 = 
doubleRDDCartesian.withColumn("name1String",$"name".cast("String"))
val stringColDF2 = cartNewCol.withColumn("name2String", $"name".cast("String"))

val stringColJoinWorks = stringColDF1.join(stringColDF2, 
stringColDF1("name1String")===(stringColDF2("name2String")))
stringColJoinWorks.show
stringColJoinWorks.explain()

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26214) Add "broadcast" method to DataFrame

2018-11-30 Thread Thomas Decaux (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704475#comment-16704475
 ] 

Thomas Decaux edited comment on SPARK-26214 at 11/30/18 9:32 AM:
-

Sure, it's already possible to do it (like I said).

The same thing for "cache" / "persist" DataFrame method:
{code:java}
public DataFrame persist(StorageLevel newLevel) {
 this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, 
newLevel);
 return this;
}{code}
This is more a "short-cut" method as you can see, it's possible to use 
cacheManager *OR* the DataFrame method.

Same thing for registerTempTable:
{code:java}
public void registerTempTable(String tableName) {
 this.sqlContext().registerDataFrameAsTable(this, tableName);
}{code}
You can see here, this is a short-cut method.

I propose to do the same thing for broadcast.


was (Author: ebuildy):
Sure, it's already to do this (like I said).

The same thing for "cache" / "persist" DataFrame method:
{code:java}
public DataFrame persist(StorageLevel newLevel) {
 this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, 
newLevel);
 return this;
}{code}
This is more a "short-cut" method as you can see, it's possible to use 
cacheManager *OR* the DataFrame method.

Same thing for registerTempTable:
{code:java}
public void registerTempTable(String tableName) {
 this.sqlContext().registerDataFrameAsTable(this, tableName);
}{code}
You can see here, this is a short-cut method.

I propose to do the same thing for broadcast.

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame

2018-11-30 Thread Thomas Decaux (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704475#comment-16704475
 ] 

Thomas Decaux commented on SPARK-26214:
---

Sure, it's already to do this (like I said).

The same thing for "cache" / "persist" DataFrame method:
{code:java}
public DataFrame persist(StorageLevel newLevel) {
 this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, 
newLevel);
 return this;
}{code}
This is more a "short-cut" method as you can see, it's possible to use 
cacheManager *OR* the DataFrame method.

Same thing for registerTempTable:
{code:java}
public void registerTempTable(String tableName) {
 this.sqlContext().registerDataFrameAsTable(this, tableName);
}{code}
You can see here, this is a short-cut method.

I propose to do the same thing for broadcast.

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names

2018-11-30 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26230:
--

 Summary: FileIndex: if case sensitive, validate partitions with 
original column names
 Key: SPARK-26230
 URL: https://issues.apache.org/jira/browse/SPARK-26230
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Partition column name is required to be unique under the same directory. The 
following paths are invalid partitioned directory:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/b=2
```

If case sensitive, the following paths should be invalid too:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/A=2
```
Since column 'a' and 'A' are different, and it is wrong to use either one as 
the column name in partition schema.

Also, there is a `TODO` in the code. This PR is to resolve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26230:


Assignee: (was: Apache Spark)

> FileIndex: if case sensitive, validate partitions with original column names
> 
>
> Key: SPARK-26230
> URL: https://issues.apache.org/jira/browse/SPARK-26230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Partition column name is required to be unique under the same directory. The 
> following paths are invalid partitioned directory:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/b=2
> ```
> If case sensitive, the following paths should be invalid too:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/A=2
> ```
> Since column 'a' and 'A' are different, and it is wrong to use either one as 
> the column name in partition schema.
> Also, there is a `TODO` in the code. This PR is to resolve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names

2018-11-30 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704470#comment-16704470
 ] 

Apache Spark commented on SPARK-26230:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23186

> FileIndex: if case sensitive, validate partitions with original column names
> 
>
> Key: SPARK-26230
> URL: https://issues.apache.org/jira/browse/SPARK-26230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Partition column name is required to be unique under the same directory. The 
> following paths are invalid partitioned directory:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/b=2
> ```
> If case sensitive, the following paths should be invalid too:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/A=2
> ```
> Since column 'a' and 'A' are different, and it is wrong to use either one as 
> the column name in partition schema.
> Also, there is a `TODO` in the code. This PR is to resolve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names

2018-11-30 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26230:


Assignee: Apache Spark

> FileIndex: if case sensitive, validate partitions with original column names
> 
>
> Key: SPARK-26230
> URL: https://issues.apache.org/jira/browse/SPARK-26230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Partition column name is required to be unique under the same directory. The 
> following paths are invalid partitioned directory:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/b=2
> ```
> If case sensitive, the following paths should be invalid too:
> ```
> hdfs://host:9000/path/a=1
> hdfs://host:9000/path/A=2
> ```
> Since column 'a' and 'A' are different, and it is wrong to use either one as 
> the column name in partition schema.
> Also, there is a `TODO` in the code. This PR is to resolve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark

2018-11-30 Thread Ranjith Pulluru (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranjith Pulluru updated SPARK-26229:

Component/s: (was: Spark Core)

> Expose SizeEstimator as a developer API in pyspark
> --
>
> Key: SPARK-26229
> URL: https://issues.apache.org/jira/browse/SPARK-26229
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ranjith Pulluru
>Priority: Minor
>
> Expose SizeEstimator as a developer API in pyspark.
> SizeEstmator is not available in pyspark.
> This api will be helpful for understanding the memory footprint of an object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark

2018-11-30 Thread Ranjith Pulluru (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ranjith Pulluru updated SPARK-26229:

Description: 
SizeEstmator is not available in pyspark.
This api will be helpful for understanding the memory footprint of an object.

  was:
Expose SizeEstimator as a developer API in pyspark.

SizeEstmator is not available in pyspark.
This api will be helpful for understanding the memory footprint of an object.


> Expose SizeEstimator as a developer API in pyspark
> --
>
> Key: SPARK-26229
> URL: https://issues.apache.org/jira/browse/SPARK-26229
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ranjith Pulluru
>Priority: Minor
>
> SizeEstmator is not available in pyspark.
> This api will be helpful for understanding the memory footprint of an object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark

2018-11-30 Thread Ranjith Pulluru (JIRA)

Ranjith Pulluru created SPARK-26229:
---

 Summary: Expose SizeEstimator as a developer API in pyspark
 Key: SPARK-26229
 URL: https://issues.apache.org/jira/browse/SPARK-26229
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 2.3.0
Reporter: Ranjith Pulluru


Expose SizeEstimator as a developer API in pyspark.

SizeEstmator is not available in pyspark.
This api will be helpful for understanding the memory footprint of an object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25528) data source V2 API refactoring (batch read)

2018-11-30 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25528.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> data source V2 API refactoring (batch read)
> ---
>
> Key: SPARK-25528
> URL: https://issues.apache.org/jira/browse/SPARK-25528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> refactor the read side API according to this abstraction
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

88 matches

Mail list logo