[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739306#comment-15739306
 ] 

yuhao yang commented on SPARK-18813:


Thanks for the response. Those are great metrics.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
> Committers should set those.
> Writing and reviewing PRs
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.*
> h2. For Committers
> Adding to this roadmap
> * You can update the roadmap by (a) adding issues to this list and (b) 
> setting Target Versions.  Only Committers may make these changes.
> * *If you add an issue to this roadmap or set a Target Version, you _must_ 
> assign yourself or another Committer as Shepherd.*
> * This list should be actively managed during the release.
> * If you 

[jira] [Assigned] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18810:


Assignee: Apache Spark  (was: Felix Cheung)

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739257#comment-15739257
 ] 

Apache Spark commented on SPARK-18810:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16248

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18810:


Assignee: Felix Cheung  (was: Apache Spark)

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739210#comment-15739210
 ] 

Hyukjin Kwon commented on SPARK-18799:
--

Could we resolve this as a duplicate if it does?

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18815:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-16026

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18815.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.2.0
   2.1.1

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739095#comment-15739095
 ] 

Joseph K. Bradley commented on SPARK-18332:
---

[~felixcheung], [~yanboliang] I noticed that the SparkR API docs contain many 
duplicate links with suffixes {{-method}} or {{-class}} in the [index | 
http://spark.apache.org/docs/latest/api/R/index.html].  E.g., {{atan}} and 
{{atan-method}} link to the same doc.  Is there a way to fix that?

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739090#comment-15739090
 ] 

Joseph K. Bradley commented on SPARK-18813:
---

How do these look?

MLlib, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]

SparkR, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]

These are definitely rough metrics, but perhaps they will become more useful 
and meaningful if more people vote and watch issues.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding 

[jira] [Resolved] (SPARK-4587) Model export/import

2016-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-4587.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

I'm going to mark this complete.  This was for the RDD-based API, and the last 
item we will likely add was completed in 2.0.  Thanks a lot everyone!

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.0.0
>
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6722) Model import/export for StreamingKMeansModel

2016-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6722.

Resolution: Won't Fix

I'm going to close this.  While it would be nice to have, it may take a back 
seat because of the shift to DataFrames and Structured Streaming.  Now to add 
this alg to the DataFrame-based API...

> Model import/export for StreamingKMeansModel
> 
>
> Key: SPARK-6722
> URL: https://issues.apache.org/jira/browse/SPARK-6722
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
> (which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5991) Python API for ML model import/export

2016-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5991.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

I'm going to mark this complete.  We finished the last subtask in 2.0.  Thanks 
everyone!

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.0.0
>
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739071#comment-15739071
 ] 

Joseph K. Bradley commented on SPARK-18813:
---

I definitely agree that it's becoming increasingly difficult to aggregate all 
of the feedback from mailing lists, JIRA, Github, events, and other sources.  I 
actually don't think that this needs to live outside of current ASF 
infrastructure.  My wish would be that we could take better advantage of JIRA 
as a form of voting: the number of Watchers or Votes are often a good metric of 
interest.  The main thing lacking is a good way to search based on interest, as 
well as encouragement for the community to use those mechanisms.

I'll try to add some more search links above for ordering based on interest.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
> Committers should set those.
> Writing and reviewing PRs
> * Remember to add the 

[jira] [Assigned] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-18810:


Assignee: Felix Cheung

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files

2016-12-10 Thread Michael Kamprath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-18819:
-
Description: 
When I create a data frame in PySpark with a small row count (less than number 
executors), then write it to a parquet file, then load that parquet file into a 
new data frame, and finally do any sort of read against the loaded new data 
frame, Spark fails with an {{ExecutorLostFailure}}.

Example code to replicate this issue:

{code}
from pyspark.sql.types import *

rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
my_schema = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", DoubleType(), True),
StructField("name",StringType(), True)
])
df = spark.createDataFrame( rdd, schema=my_schema)
df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')

newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
newdf.take(1)
{code}

The error I get when the {{take}} step runs is:

{code}
Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{code}

I have tested this against HDFS 2.7 and QFS 1.2. All have the same results. 
However, it doesn't break when running spark locally and reading/writing to the 
local file system.


[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files

2016-12-10 Thread Michael Kamprath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kamprath updated SPARK-18819:
-
Description: 
When I create a data frame in PySpark with a small row count (less than number 
executors), then write it to a parquet file, then load that parquet file into a 
new data frame, and finally do any sort of read against the loaded new data 
frame, Spark fails with an {{ExecutorLostFailure}}.

Example code to replicate this issue:

{code}
from pyspark.sql.types import *

rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
my_schema = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", DoubleType(), True),
StructField("name",StringType(), True)
])
df = spark.createDataFrame( rdd, schema=my_schema)
df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')

newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
newdf.take(1)
{code}

The error I get when the {{take}} step runs is:

{code}
Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{code}

I have tested this against HDFS 2.7, local file system, and QFS 1.2. All have 
the same results.

I generally discovered this when processing larger files that have individual 

[jira] [Created] (SPARK-18819) Failure to read single-row Parquet files

2016-12-10 Thread Michael Kamprath (JIRA)
Michael Kamprath created SPARK-18819:


 Summary: Failure to read single-row Parquet files
 Key: SPARK-18819
 URL: https://issues.apache.org/jira/browse/SPARK-18819
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark
Affects Versions: 2.0.2
 Environment: Ubuntu 14.04 LTS on ARM 7.1
Reporter: Michael Kamprath
Priority: Critical


When I create a data frame in PySpark with a small row count (less than number 
executors), then write it to a parquet file, then load that parquet file into a 
new data frame, and finally do any sort of read against the loaded new data 
frame, Spark fails with an {{ExecutorLostFailure}}.

Example code to replicate this issue:

{code}
from pyspark.sql.types import *

rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')])
my_schema = StructType([
StructField("id", StringType(), True),
StructField("value1", IntegerType(), True),
StructField("value2", DoubleType(), True),
StructField("name",StringType(), True)
])
df = spark.createDataFrame( rdd, schema=my_schema)
df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite')

newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/')
newdf.take(1)
{code}

The error I get is:

{code}
Py4JJavaError: An error occurred while calling o54.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of the 
running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 

[jira] [Created] (SPARK-18818) Window...orderBy() should accept an 'ascending' parameter just like DataFrame.orderBy()

2016-12-10 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18818:


 Summary: Window...orderBy() should accept an 'ascending' parameter 
just like DataFrame.orderBy()
 Key: SPARK-18818
 URL: https://issues.apache.org/jira/browse/SPARK-18818
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor


It seems inconsistent that {{Window...orderBy()}} does not accept an 
{{ascending}} parameter, when {{DataFrame.orderBy()}} does.

It's also slightly inconvenient since to specify a descending sort order you 
have to build a column object, whereas with the {{ascending}} parameter you 
don't.

For example:

{code}
from pyspark.sql.functions import row_number

df.select(
row_number()
.over(
Window
.partitionBy(...)
.orderBy('timestamp', ascending=False)))
{code}

vs.

{code}
from pyspark.sql.functions import row_number, col

df.select(
row_number()
.over(
Window
.partitionBy(...)
.orderBy(col('timestamp').desc(
{code}

It would be better if {{Window...orderBy()}} supported an {{ascending}} 
parameter just like {{DataFrame.orderBy()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738558#comment-15738558
 ] 

Apache Spark commented on SPARK-18817:
--

User 'bdwyer2' has created a pull request for this issue:
https://github.com/apache/spark/pull/16247

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> "Packages should not write in the users’ home filespace, nor anywhere else on 
> the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace)."
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18817:


Assignee: (was: Apache Spark)

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> "Packages should not write in the users’ home filespace, nor anywhere else on 
> the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace)."
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18817:


Assignee: Apache Spark

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Assignee: Apache Spark
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> "Packages should not write in the users’ home filespace, nor anywhere else on 
> the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace)."
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-10 Thread Brendan Dwyer (JIRA)
Brendan Dwyer created SPARK-18817:
-

 Summary: Ensure nothing is written outside R's tempdir() by default
 Key: SPARK-18817
 URL: https://issues.apache.org/jira/browse/SPARK-18817
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Brendan Dwyer


Per CRAN policies
https://cran.r-project.org/web/packages/policies.html
"Packages should not write in the users’ home filespace, nor anywhere else on 
the file system apart from the R session’s temporary directory (or during 
installation in the location pointed to by TMPDIR: and such usage should be 
cleaned up). Installing into the system’s R installation (e.g., scripts to its 
bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package 
obtains confirmation from the user.

- Packages should not modify the global environment (user’s workspace)."

Currently "spark-warehouse" gets created in the working directory when 
sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738477#comment-15738477
 ] 

Saikat Kanjilal commented on SPARK-9487:


Then I would suggest keeping it open and focus on a particular module and make 
the unit tests robust in that module, is there a specific module that's in dire 
need of robustness of unit tests, I was thinking of picking the sql module and 
moving forward to make the unit tests under that be more robust as a first 
goal, thoughts?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738424#comment-15738424
 ] 

Sean Owen commented on SPARK-9487:
--

Well, this JIRA is implicitly about making a test or two more robust in order 
to effect this change. I don't see what opening another JIRA does. 

This isn't a must-have JIRA anyway. I think it's solvable and we've discussed 
here general strategies for debugging failures, and on the PR I suggested 
specific fixes to specific failures. I don't think anything is blocking this 
other than just doing it. It's not trivial. But if it isn't something you or 
anyone else can get working with reasonable effort i think it's better to just 
abandon this.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738419#comment-15738419
 ] 

Saikat Kanjilal commented on SPARK-9487:


I'm ok closing it actually but it does outline issues with robustness around 
the unit tests, should we open up another jira or reframe this effort to make 
the unit tests more robust, that may require some more thought/redesign to 
produce identical results locally as well as in jenkins, my vote would be to 
close this out and recreate another jira that I can take on to make the unit 
tests more robust for 1 specific component with very narrowly defined goals, 
what do you think?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18803) Fix path-related and JarEntry-related test failures and skip some tests failed on Windows due to path length limitation

2016-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18803:
--
Assignee: Hyukjin Kwon

> Fix path-related and JarEntry-related test failures and skip some tests 
> failed on Windows due to path length limitation
> ---
>
> Key: SPARK-18803
> URL: https://issues.apache.org/jira/browse/SPARK-18803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are some tests being failed on Windows as below for several problems.
> *Incorrect path handling*
> - {{FileSuite}}
> {code}
> [info] - binary file input as byte array *** FAILED *** (500 milliseconds)
> [info]   
> "file:/C:/projects/spark/target/tmp/spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624/record-bytestream-0.bin"
>  did not contain 
> "C:\projects\spark\target\tmp\spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624\record-bytestream-0.bin"
>  (FileSuite.scala:258)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$14.apply$mcV$sp(FileSuite.scala:258)
> [info]   at org.apache.spark.FileSuite$$anonfun$14.apply(FileSuite.scala:239)
> [info]   at org.apache.spark.FileSuite$$anonfun$14.apply(FileSuite.scala:239)
> ...
> {code}
> {code}
> [info] - Get input files via old Hadoop API *** FAILED *** (1 second, 94 
> milliseconds)
> [info]   
> Set("/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-0",
>  
> "/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-1")
>  did not equal 
> Set("C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-0",
>  
> "C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-1")
>  (FileSuite.scala:535)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$29.apply$mcV$sp(FileSuite.scala:535)
> [info]   at org.apache.spark.FileSuite$$anonfun$29.apply(FileSuite.scala:524)
> [info]   at org.apache.spark.FileSuite$$anonfun$29.apply(FileSuite.scala:524)
> ...
> {code}
> {code}
> [info] - Get input files via new Hadoop API *** FAILED *** (313 milliseconds)
> [info]   
> Set("/C:/projects/spark/target/tmp/spark-12bc1540--4df6-9c4d-79e0e614407c/output/part-0",
>  
> "/C:/projects/spark/target/tmp/spark-12bc1540--4df6-9c4d-79e0e614407c/output/part-1")
>  did not equal 
> Set("C:\projects\spark\target\tmp\spark-12bc1540--4df6-9c4d-79e0e614407c\output/part-0",
>  
> "C:\projects\spark\target\tmp\spark-12bc1540--4df6-9c4d-79e0e614407c\output/part-1")
>  (FileSuite.scala:549)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$30.apply$mcV$sp(FileSuite.scala:549)
> [info]   at org.apache.spark.FileSuite$$anonfun$30.apply(FileSuite.scala:538)
> [info]   at org.apache.spark.FileSuite$$anonfun$30.apply(FileSuite.scala:538)
> ...
> {code}
> - {{TaskResultGetterSuite}}
> {code}
> [info] - handling results larger than max RPC message size *** FAILED *** (1 
> second, 579 milliseconds)
> [info]   1 did not equal 0 Expect result to be removed from the block 
> manager. (TaskResultGetterSuite.scala:129)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.scheduler.TaskResultGetterSuite$$anonfun$4.apply$mcV$sp(TaskResultGetterSuite.scala:129)
> [info]   at 
> 

[jira] [Resolved] (SPARK-18803) Fix path-related and JarEntry-related test failures and skip some tests failed on Windows due to path length limitation

2016-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18803.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16234
[https://github.com/apache/spark/pull/16234]

> Fix path-related and JarEntry-related test failures and skip some tests 
> failed on Windows due to path length limitation
> ---
>
> Key: SPARK-18803
> URL: https://issues.apache.org/jira/browse/SPARK-18803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are some tests being failed on Windows as below for several problems.
> *Incorrect path handling*
> - {{FileSuite}}
> {code}
> [info] - binary file input as byte array *** FAILED *** (500 milliseconds)
> [info]   
> "file:/C:/projects/spark/target/tmp/spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624/record-bytestream-0.bin"
>  did not contain 
> "C:\projects\spark\target\tmp\spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624\record-bytestream-0.bin"
>  (FileSuite.scala:258)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$14.apply$mcV$sp(FileSuite.scala:258)
> [info]   at org.apache.spark.FileSuite$$anonfun$14.apply(FileSuite.scala:239)
> [info]   at org.apache.spark.FileSuite$$anonfun$14.apply(FileSuite.scala:239)
> ...
> {code}
> {code}
> [info] - Get input files via old Hadoop API *** FAILED *** (1 second, 94 
> milliseconds)
> [info]   
> Set("/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-0",
>  
> "/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-1")
>  did not equal 
> Set("C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-0",
>  
> "C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-1")
>  (FileSuite.scala:535)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$29.apply$mcV$sp(FileSuite.scala:535)
> [info]   at org.apache.spark.FileSuite$$anonfun$29.apply(FileSuite.scala:524)
> [info]   at org.apache.spark.FileSuite$$anonfun$29.apply(FileSuite.scala:524)
> ...
> {code}
> {code}
> [info] - Get input files via new Hadoop API *** FAILED *** (313 milliseconds)
> [info]   
> Set("/C:/projects/spark/target/tmp/spark-12bc1540--4df6-9c4d-79e0e614407c/output/part-0",
>  
> "/C:/projects/spark/target/tmp/spark-12bc1540--4df6-9c4d-79e0e614407c/output/part-1")
>  did not equal 
> Set("C:\projects\spark\target\tmp\spark-12bc1540--4df6-9c4d-79e0e614407c\output/part-0",
>  
> "C:\projects\spark\target\tmp\spark-12bc1540--4df6-9c4d-79e0e614407c\output/part-1")
>  (FileSuite.scala:549)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.FileSuite$$anonfun$30.apply$mcV$sp(FileSuite.scala:549)
> [info]   at org.apache.spark.FileSuite$$anonfun$30.apply(FileSuite.scala:538)
> [info]   at org.apache.spark.FileSuite$$anonfun$30.apply(FileSuite.scala:538)
> ...
> {code}
> - {{TaskResultGetterSuite}}
> {code}
> [info] - handling results larger than max RPC message size *** FAILED *** (1 
> second, 579 milliseconds)
> [info]   1 did not equal 0 Expect result to be removed from the block 
> manager. (TaskResultGetterSuite.scala:129)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
> [info]   at 
> org.apache.spark.scheduler.TaskResultGetterSuite$$anonfun$4.apply$mcV$sp(TaskResultGetterSuite.scala:129)
> [info]   at 
> 

[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738389#comment-15738389
 ] 

Sean Owen commented on SPARK-18806:
---

Are you saying it's not a problem?

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738383#comment-15738383
 ] 

Sean Owen commented on SPARK-9487:
--

Yes, that probably means the test changes aren't quite robust in their new 
form. Getting them to pass locally and on Jenkins indicates they're at least 
general enough to pass across both envs. And of course we have to get them to 
pass on Jenkins. It can be hard to debug; try a different machine? try 
loosening conditions? you can push changes to a WIP PR to see how Jenkins 
treats them. I think we need to bring this to a conclusion though. Right now 
I'm not clear this solves enough of a problem to bother with, so I'm inclined 
to close it.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738309#comment-15738309
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] I think the above plan is great minus one fundamental flaw, I already 
have tests passing uniformly across multiple components locally, the issue I am 
running into is trying to get the tests working in jenkins, currently every 
change I've made locally passes unit tests.Until the issue with my local 
environment and jenkins gets resolved I don't see a clever way to get tests to 
pass , let me know your thoughts on a good way to get past this.  After we 
figure this out I can pick a set of components to work with a uniform number of 
threads.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18766) Push Down Filter Through BatchEvalPython

2016-12-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18766:
---

Assignee: Xiao Li

> Push Down Filter Through BatchEvalPython
> 
>
> Key: SPARK-18766
> URL: https://issues.apache.org/jira/browse/SPARK-18766
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always 
> generated below {{FilterExec}}. However, not all the predicates need to be 
> evaluated after Python UDF execution. Thus, we can push down the predicates 
> through {{BatchEvalPython}} .
> {noformat}
> >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> >>> ["key", "value"])
> >>> from pyspark.sql.functions import udf, col
> >>> from pyspark.sql.types import BooleanType
> >>> my_filter = udf(lambda a: a < 2, BooleanType())
> >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) 
> >>> & (df.value < "2"))
> >>> sel.explain(True)
> {noformat}
> {noformat}
> == Physical Plan ==
> *Project [key#0L, value#1]
> +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
>+- BatchEvalPython [(key#0L)], [key#0L, value#1, pythonUDF0#9]
>   +- Scan ExistingRDD[key#0L,value#1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18766) Push Down Filter Through BatchEvalPython

2016-12-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18766.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16193
[https://github.com/apache/spark/pull/16193]

> Push Down Filter Through BatchEvalPython
> 
>
> Key: SPARK-18766
> URL: https://issues.apache.org/jira/browse/SPARK-18766
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
> Fix For: 2.2.0
>
>
> Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always 
> generated below {{FilterExec}}. However, not all the predicates need to be 
> evaluated after Python UDF execution. Thus, we can push down the predicates 
> through {{BatchEvalPython}} .
> {noformat}
> >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> >>> ["key", "value"])
> >>> from pyspark.sql.functions import udf, col
> >>> from pyspark.sql.types import BooleanType
> >>> my_filter = udf(lambda a: a < 2, BooleanType())
> >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) 
> >>> & (df.value < "2"))
> >>> sel.explain(True)
> {noformat}
> {noformat}
> == Physical Plan ==
> *Project [key#0L, value#1]
> +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
>+- BatchEvalPython [(key#0L)], [key#0L, value#1, pythonUDF0#9]
>   +- Scan ExistingRDD[key#0L,value#1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18606:
--
  Assignee: Tao Wang
Issue Type: Improvement  (was: Bug)

> [HISTORYSERVER]It will check html elems while searching HistoryServer
> -
>
> Key: SPARK-18606
> URL: https://issues.apache.org/jira/browse/SPARK-18606
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> When we search applications in HistoryServer, it will include all contents 
> between  tag, which including useless elemtns like " href" and making results confused. 
> We should remove those to make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18606.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16031
[https://github.com/apache/spark/pull/16031]

> [HISTORYSERVER]It will check html elems while searching HistoryServer
> -
>
> Key: SPARK-18606
> URL: https://issues.apache.org/jira/browse/SPARK-18606
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Tao Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> When we search applications in HistoryServer, it will include all contents 
> between  tag, which including useless elemtns like " href" and making results confused. 
> We should remove those to make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18814:


Assignee: Herman van Hovell  (was: Apache Spark)

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Herman van Hovell
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738058#comment-15738058
 ] 

Apache Spark commented on SPARK-18814:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/16246

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Herman van Hovell
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Assigned] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18814:


Assignee: Apache Spark  (was: Herman van Hovell)

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Assigned] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-18814:
-

Assignee: Herman van Hovell

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Herman van Hovell
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Updated] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception

2016-12-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17460:

Assignee: Huaxin Gao

> Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
> --
>
> Key: SPARK-17460
> URL: https://issues.apache.org/jira/browse/SPARK-17460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
>Assignee: Huaxin Gao
> Fix For: 2.1.0
>
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception

2016-12-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17460.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16175
[https://github.com/apache/spark/pull/16175]

> Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
> --
>
> Key: SPARK-17460
> URL: https://issues.apache.org/jira/browse/SPARK-17460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
> Fix For: 2.1.0
>
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737865#comment-15737865
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

I have a potential fix; it works but it's not pretty. I want to step back and 
think about it more but if a fix is urgently needed, I can submit a PR for it.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737710#comment-15737710
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

[~cloud_fan] FYI, part of the problem here is the deDuplicate logic that 
rewires the groupby column to a new ExprId in the middle of the checking, a 
similar problem tracked by SPARK-17154.

{code}
1251   // Make sure the inner and the outer query attributes do not collide.
1252   val outputSet = outer.map(_.outputSet).reduce(_ ++ _)
1253   val duplicates = basePlan.outputSet.intersect(outputSet)
1254   val (plan, deDuplicatedConditions) = if (duplicates.nonEmpty) {
1255 val aliasMap = AttributeMap(duplicates.map { dup =>
1256   dup -> Alias(dup, dup.toString)()
1257 }.toSeq)
1258 val aliasedExpressions = basePlan.output.map { ref =>
1259   aliasMap.getOrElse(ref, ref)
1260 }
1261 val aliasedProjection = Project(aliasedExpressions, basePlan)
1262 val aliasedConditions = baseConditions.map(_.transform {
1263   case ref: Attribute => aliasMap.getOrElse(ref, ref).toAttribute
1264 })
1265 (aliasedProjection, aliasedConditions)
1266   } else {
1267 (basePlan, baseConditions)
1268   }
{code}

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> 

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737686#comment-15737686
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

I can reproduce with a simple script now.

{code}
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 
where c1.ck = p.pk)").show
{code}

The requirements are:
1. We need to reference the same table twice in both the parent and the 
subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is 
from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to {{ck##}} at 
{{Project}} above {{Aggregate}} of {{avg}}. Then when we compare 
{{ck##}} and the original group by column {{ck#}} by their 
canonicalized form, which is # != #. That's how we trigger the 
exception I added.

I will continue working on a fix.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> 

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-10 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737640#comment-15737640
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

To get the extra `#111` of  `cs_item_sk#39#111`, we need to reference the same 
table in both the parent side and the subquery side (as the catalog_sales in 
Q35) so it will run thru the deduplicate logic to add `#111` to distinguish the 
same column of the two contexts.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>