date:20161209

[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737454#comment-15737454
 ] 

Alex Bozarth commented on SPARK-18816:
--

That's odd, I tested my code on Safari, Chrome and FF when I made that change. 
I can look into it Monday if you don't have a fix.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737454#comment-15737454
 ] 

Alex Bozarth edited comment on SPARK-18816 at 12/10/16 7:57 AM:


That's odd, I tested my code on Safari, Chrome and FF when I made that change. 
I can look into it Monday if you don't have a fix.

Also this is not a blocker.


was (Author: ajbozarth):
That's odd, I tested my code on Safari, Chrome and FF when I made that change. 
I can look into it Monday if you don't have a fix.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Alex Bozarth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth updated SPARK-18816:
-
Priority: Major  (was: Blocker)

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-09 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737368#comment-15737368
 ] 

Wenchen Fan commented on SPARK-18209:
-

Hmm, `AnalysisContext` maybe an overkill, can we just transform the parsed 
logical plan for a view and find `UnresolvedRelation`, and fill in the  
database part with the database hint, if it's not set?

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18811.
--
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.2.0
   2.1.1

> Stream Source resolution should happen in StreamExecution thread, not main 
> thread
> -
>
> Key: SPARK-18811
> URL: https://issues.apache.org/jira/browse/SPARK-18811
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> When you start a stream, if we are trying to resolve the source of the 
> stream, for example if we need to resolve partition columns, this could take 
> a long time. This long execution time should not block the main thread where 
> `query.start()` was called on. It should happen in the stream execution 
> thread possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737254#comment-15737254
 ] 

Yin Huai commented on SPARK-18816:
--

btw, my testing was done with chrome.


I then terminated the cluster and started a new one. I first launched workers. 
Then, I still could not see the log links on the page. But, I can see the links 
from safari. 

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18816:
-
Attachment: screenshot-1.png

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Yin Huai (JIRA)

Yin Huai created SPARK-18816:


 Summary: executor page fails to show log links if executors are 
added after an app is launched
 Key: SPARK-18816
 URL: https://issues.apache.org/jira/browse/SPARK-18816
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Yin Huai
Priority: Blocker
 Attachments: screenshot-1.png

How to reproduce with standalone mode:
1. Launch a spark master
2. Launch a spark shell. At this point, there is no executor associated with 
this application. 
3. Launch a slave. Now, there is an executor assigned to the spark shell. 
However, there is no link to stdout/stderr on the executor page.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18816:
-
Description: 
How to reproduce with standalone mode:
1. Launch a spark master
2. Launch a spark shell. At this point, there is no executor associated with 
this application. 
3. Launch a slave. Now, there is an executor assigned to the spark shell. 
However, there is no link to stdout/stderr on the executor page (please see 
https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



  was:
How to reproduce with standalone mode:
1. Launch a spark master
2. Launch a spark shell. At this point, there is no executor associated with 
this application. 
3. Launch a slave. Now, there is an executor assigned to the spark shell. 
However, there is no link to stdout/stderr on the executor page.




> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736953#comment-15736953
 ] 

yuhao yang edited comment on SPARK-18813 at 12/10/16 5:06 AM:
--

The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. I'd like to hear other ideas. The main idea is to 
improve the transparency and diversity in the community and make everyone feel 
more involved but not isolated. 


was (Author: yuhaoyan):
The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. Hopefully it will improve the transparency and 
diversity in the community and make everyone feel more involved but not 
isolated. 

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737148#comment-15737148
 ] 

Eric Liang commented on SPARK-18814:


It seems that the references of an Alias expression should include the
referenced attribute, so I would expect #39 to still show up. I could be
misunderstanding the behavior of Alias though.

On Fri, Dec 9, 2016, 7:50 PM Nattavut Sutyanyong (JIRA) 



> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
>

[jira] [Comment Edited] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737142#comment-15737142
 ] 

Nattavut Sutyanyong edited comment on SPARK-18814 at 12/10/16 3:50 AM:
---

It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column {{p.k}} as it is in the TPCDS-Q32.

I will keep trying.


was (Author: nsyca):
It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  :

[jira] [Comment Edited] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737142#comment-15737142
 ] 

Nattavut Sutyanyong edited comment on SPARK-18814 at 12/10/16 3:50 AM:
---

It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.


was (Author: nsyca):
It looks like the `Project` between `Aggregate` and `Filter scalar-subquery` 
maps `cs_item_sk#39` to `cs_item_sk#39#111`. The logic in the code is not 
robust enough to recognize that the two symbols are equivalent. I tried to 
simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  :

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737142#comment-15737142
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

It looks like the `Project` between `Aggregate` and `Filter scalar-subquery` 
maps `cs_item_sk#39` to `cs_item_sk#39#111`. The logic in the code is not 
robust enough to recognize that the two symbols are equivalent. I tried to 
simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
>

[jira] [Assigned] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18815:


Assignee: (was: Apache Spark)

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737134#comment-15737134
 ] 

Apache Spark commented on SPARK-18815:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16243

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18815:


Assignee: Apache Spark

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-09 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18815:
-
Summary: NPE when collecting column stats for string/binary column having 
only null values  (was: NPE when collecting column stats for column having only 
null values)

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18815) NPE when collecting column stats for column having only null values

2016-12-09 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-18815:
-
Affects Version/s: 2.1.1

> NPE when collecting column stats for column having only null values
> ---
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18815) NPE when collecting column stats for column having only null values

2016-12-09 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-18815:


 Summary: NPE when collecting column stats for column having only 
null values
 Key: SPARK-18815
 URL: https://issues.apache.org/jira/browse/SPARK-18815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-09 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18807.
---
   Resolution: Fixed
Fix Version/s: 2.1.1

Resolved by https://github.com/apache/spark/pull/16237

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.1.1
>
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737084#comment-15737084
 ] 

Shivaram Venkataraman commented on SPARK-18788:
---

Hmm I am not sure but I think the reason could be that the number of partitions 
isn't know statically and is determined by the query planner during execution. 
Or to put it another way it might not be cheap to get this 

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737079#comment-15737079
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

I am looking at this.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
>

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737074#comment-15737074
 ] 

Reynold Xin commented on SPARK-18814:
-

cc [~hvanhovell] and [~nsyca]

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
>

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737068#comment-15737068
 ] 

Eric Liang commented on SPARK-18814:


[~rxin]

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
>

[jira] [Created] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18814:
--

 Summary: CheckAnalysis rejects TPCDS query 32
 Key: SPARK-18814
 URL: https://issues.apache.org/jira/browse/SPARK-18814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem to 
be any obvious error in the query or the check rule though: in the plan below, 
the scalar subquery's condition field is "scalar-subquery#24 
[(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by the 
scalar subquery predicates.

analysis error:
{code}
== Query: q32-v1.4 ==
 Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
in a scalar correlated subquery cannot contain non-correlated columns: 
cs_item_sk#39;;
GlobalLimit 100
+- LocalLimit 100
   +- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
  +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = cs_item_sk#39)) 
&& ((d_date#83 >= 2000-01-27) && (d_date#83 <= cast(cast(cast(cast(2000-01-27 
as date) as timestamp) + interval 12 weeks 6 days as date) as string && 
((d_date_sk#81 = cs_sold_date_sk#58) && (cast(cs_ext_discount_amt#46 as 
decimal(14,7)) > cast(scalar-subquery#24 [(cs_item_sk#39#111 = i_item_sk#59)] 
as decimal(14,7)
 :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
cs_item_sk#39#111]
 : +- Aggregate [cs_item_sk#39], 
[CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
 :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
 :   +- Join Inner
 :  :- SubqueryAlias catalog_sales
 :  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
 :  +- SubqueryAlias date_dim
 : +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet
 +- Join Inner
:- Join Inner
:  :- SubqueryAlias catalog_sales
:  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
:  +- SubqueryAlias item
: +- 
Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
 parquet
+- SubqueryAlias date_dim
   +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet

{code}

query text:
{code}
select sum(cs_ext_discount_amt) as `excess discount amount`
 from
catalog_sales, item, date_dim
 where

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737022#comment-15737022
 ] 

Felix Cheung commented on SPARK-18788:
--

I looked and didn't see that on DataFrame? I'm not sure what the reason is 
though.



> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736953#comment-15736953
 ] 

yuhao yang edited comment on SPARK-18813 at 12/10/16 1:55 AM:
--

The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. Hopefully it will improve the transparency and 
diversity in the community and make everyone feel more involved but not 
isolated. 


was (Author: yuhaoyan):
The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. We should find a mechanism to collect 
and respond to users' requirements and complaints. One idea is to have a voting 
website as a wish list from Spark users. Users can create or vote for the 
features or improvements they need in Spark. This helps committers collect the 
requirements and also give everybody a channel to express their priorities. 
Hopefully it will improve the transparency and diversity in the community and 
make everyone feel more involved but not isolated. 

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2.

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736957#comment-15736957
 ] 

Saisai Shao commented on SPARK-13955:
-

Yes, forgot to mention this zip file doesn't support nested directory.

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736953#comment-15736953
 ] 

yuhao yang commented on SPARK-18813:


The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. We should find a mechanism to collect 
and respond to users' requirements and complaints. One idea is to have a voting 
website as a wish list from Spark users. Users can create or vote for the 
features or improvements they need in Spark. This helps committers collect the 
requirements and also give everybody a channel to express their priorities. 
Hopefully it will improve the transparency and diversity in the community and 
make everyone feel more involved but not isolated. 

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same

[jira] [Resolved] (SPARK-18812) Clarify "Spark ML"

2016-12-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18812.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16241
[https://github.com/apache/spark/pull/16241]

> Clarify "Spark ML"
> --
>
> Key: SPARK-18812
> URL: https://issues.apache.org/jira/browse/SPARK-18812
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.1.1, 2.2.0
>
>
> It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-09 Thread liujianhui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736943#comment-15736943
 ] 

liujianhui commented on SPARK-18806:


when you stop worker by ssh from the client machine, if the driverwrapper 
belong to the worker not exit after worker stopped, the master will relaunch 
the driver on another worker,  eventually, there will exist two driver on 
different worker, but now i have not found the root cause of this issue.  now 
when i stop the worker, i must kill all the executor and driver on this worker 
to avoid this issue 

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-09 Thread liujianhui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736941#comment-15736941
 ] 

liujianhui commented on SPARK-18806:


when you stop worker by ssh from the client machine, if the driverwrapper 
belong to the worker not exit after worker stopped, the master will relaunch 
the driver on another worker,  eventually, there will exist two driver on 
different worker, but now i have not found the root cause of this issue.  now 
when i stop the worker, i must kill all the executor and driver on this worker 
to avoid this issue 

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-09 Thread liujianhui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujianhui updated SPARK-18806:
---
Comment: was deleted

(was: when you stop worker by ssh from the client machine, if the driverwrapper 
belong to the worker not exit after worker stopped, the master will relaunch 
the driver on another worker,  eventually, there will exist two driver on 
different worker, but now i have not found the root cause of this issue.  now 
when i stop the worker, i must kill all the executor and driver on this worker 
to avoid this issue )

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736911#comment-15736911
 ] 

Shivaram Venkataraman commented on SPARK-18788:
---

Sorry I didn't complete the question - the natural thing to do would be to add 
this to the DataFrame API, but I wanted to check if we support something like 
this in Scala Dataset API.

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736907#comment-15736907
 ] 

Shivaram Venkataraman commented on SPARK-18788:
---

Is there an equivalent Scala or Python method in Spark ?

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2016-12-09 Thread Mark Grover (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-18120:

Description: QueryExecutionListener is a class that has methods named 
onSuccess() and onFailure() that gets called when a query is executed. Each of 
those methods takes a QueryExecution object as a parameter which can be used 
for metrics analysis. It gets called for several of the DataSet methods like 
take, head, first, collect etc. but doesn't get called for any of the 
DataFrameWriter methods like saveAsTable, save etc.   (was: 
QueryExecutionListener is a class that has methods named onSuccess() and 
onFailure() that gets called when a query is executed. Each of those methods 
takes a QueryExecution object as a parameter which can be used for metrics 
analysis. It gets called for several of the DataSet methods like take, head, 
first, collect etc. but doesn't get called for any of hte DataFrameWriter 
methods like saveAsTable, save etc. )

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18628) Update handle invalid documentation string

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18628:


Assignee: Apache Spark

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18628) Update handle invalid documentation string

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18628:


Assignee: (was: Apache Spark)

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18628) Update handle invalid documentation string

2016-12-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736843#comment-15736843
 ] 

Apache Spark commented on SPARK-18628:
--

User 'krishnakalyan3' has created a pull request for this issue:
https://github.com/apache/spark/pull/16242

> Update handle invalid documentation string
> --
>
> Key: SPARK-18628
> URL: https://issues.apache.org/jira/browse/SPARK-18628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> The handleInvalid paramater documentation string currently doesn't have 
> quotes around the options, after SPARK-18366 is in, it would be good to 
> update both the Scala param and Python param to have quotes around the 
> options making it easier for users to read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-09 Thread Raela Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736836#comment-15736836
 ] 

Raela Wang commented on SPARK-18788:


Yes, RDD support in SparkR has been removed, which is why I think it is worth 
it to wrap this and add it to the current SparkR API. This would be really 
useful to users who are trying out the new UDFs too (dapply - apply a function 
to each partition of a SparkDataFrame...but how many partitions do I have?).

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736807#comment-15736807
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

Hmm I think I read it wrong. It does the reverse. We might need to make some 
Scala side change as well I guess

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736801#comment-15736801
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

There is already some code which considers the Hive conf value first and if 
that is missing it uses the SparkConf value. At least thats my reading of 
https://github.com/apache/spark/blob/cf33a86285629abe72c1acf235b8bfa6057220a8/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18812) Clarify "Spark ML"

2016-12-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736799#comment-15736799
 ] 

Apache Spark commented on SPARK-18812:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16241

> Clarify "Spark ML"
> --
>
> Key: SPARK-18812
> URL: https://issues.apache.org/jira/browse/SPARK-18812
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16815) Dataset[List[T]] leads to ArrayStoreException

2016-12-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736804#comment-15736804
 ] 

Michal Šenkýř commented on SPARK-16815:
---

I made a PR that should fix this issue 
[#16240|https://github.com/apache/spark/pull/16240]

> Dataset[List[T]] leads to ArrayStoreException
> -
>
> Key: SPARK-16815
> URL: https://issues.apache.org/jira/browse/SPARK-16815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>Priority: Minor
>
> {noformat}
> scala> spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect
> java.lang.ArrayStoreException: scala.collection.mutable.WrappedArray$ofRef
>   
>   at 
> scala.collection.mutable.ArrayBuilder$ofRef.$plus$eq(ArrayBuilder.scala:87)
>   at 
> scala.collection.mutable.ArrayBuilder$ofRef.$plus$eq(ArrayBuilder.scala:56)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2568)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2217)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2581)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2198)
>   ... 48 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736790#comment-15736790
 ] 

Felix Cheung edited comment on SPARK-15799 at 12/10/16 12:23 AM:
-

That certainly would be the option.


Typically when hive-site.xml this would be coming from that, correct? It won't 
be from a spark properties in that case. In such case we also should not 
override the value?



was (Author: felixcheung):
That certainly would be the option. Typically when hive-site.xml this would be 
coming from that, correct? In such case we should not override the value?


> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736790#comment-15736790
 ] 

Felix Cheung commented on SPARK-15799:
--

That certainly would be the option. Typically when hive-site.xml this would be 
coming from that, correct? In such case we should not override the value?


> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16792) Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16792:


Assignee: (was: Apache Spark)

> Dataset containing a Case Class with a List type causes a CompileException 
> (converting sequence to list)
> 
>
> Key: SPARK-16792
> URL: https://issues.apache.org/jira/browse/SPARK-16792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> The issue occurs when we run a .map over a dataset containing Case Class with 
> a List in it. A self contained test case is below:
> case class TestCC(key: Int, letters: List[String]) //List causes the issue - 
> a Seq/Array works fine
> /*simple test data*/
> val ds1 = sc.makeRDD(Seq(
> (List("D")),
> (List("S","H")),
> (List("F","H")),
> (List("D","L","L"))
> )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
> //This will fail
> val test1=ds1.map{_.key}
> test1.show
> Error: 
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 72, Column 70: No applicable constructor/method found 
> for actual parameters "int, scala.collection.Seq"; candidates are: 
> "TestCC(int, scala.collection.immutable.List)"
> It seems to be internally converting the List to a sequence, then it cant 
> convert it back...
> If you change the List[String] to Seq[String] or Array[String] the issue 
> doesnt appear



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16792) Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)

2016-12-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736773#comment-15736773
 ] 

Apache Spark commented on SPARK-16792:
--

User 'michalsenkyr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16240

> Dataset containing a Case Class with a List type causes a CompileException 
> (converting sequence to list)
> 
>
> Key: SPARK-16792
> URL: https://issues.apache.org/jira/browse/SPARK-16792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> The issue occurs when we run a .map over a dataset containing Case Class with 
> a List in it. A self contained test case is below:
> case class TestCC(key: Int, letters: List[String]) //List causes the issue - 
> a Seq/Array works fine
> /*simple test data*/
> val ds1 = sc.makeRDD(Seq(
> (List("D")),
> (List("S","H")),
> (List("F","H")),
> (List("D","L","L"))
> )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
> //This will fail
> val test1=ds1.map{_.key}
> test1.show
> Error: 
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 72, Column 70: No applicable constructor/method found 
> for actual parameters "int, scala.collection.Seq"; candidates are: 
> "TestCC(int, scala.collection.immutable.List)"
> It seems to be internally converting the List to a sequence, then it cant 
> convert it back...
> If you change the List[String] to Seq[String] or Array[String] the issue 
> doesnt appear



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16792) Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16792:


Assignee: Apache Spark

> Dataset containing a Case Class with a List type causes a CompileException 
> (converting sequence to list)
> 
>
> Key: SPARK-16792
> URL: https://issues.apache.org/jira/browse/SPARK-16792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Assignee: Apache Spark
>Priority: Critical
>
> The issue occurs when we run a .map over a dataset containing Case Class with 
> a List in it. A self contained test case is below:
> case class TestCC(key: Int, letters: List[String]) //List causes the issue - 
> a Seq/Array works fine
> /*simple test data*/
> val ds1 = sc.makeRDD(Seq(
> (List("D")),
> (List("S","H")),
> (List("F","H")),
> (List("D","L","L"))
> )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
> //This will fail
> val test1=ds1.map{_.key}
> test1.show
> Error: 
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 72, Column 70: No applicable constructor/method found 
> for actual parameters "int, scala.collection.Seq"; candidates are: 
> "TestCC(int, scala.collection.immutable.List)"
> It seems to be internally converting the List to a sequence, then it cant 
> convert it back...
> If you change the List[String] to Seq[String] or Array[String] the issue 
> doesnt appear



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736753#comment-15736753
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

Sure

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736748#comment-15736748
 ] 

Joseph K. Bradley commented on SPARK-15799:
---

[~shivaram]  Could I list you as the shepherd for this feature?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15573) Backwards-compatible persistence for spark.ml

2016-12-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15573:
--
Shepherd: Joseph K. Bradley

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15572) MLlib in R format: compatibility with other languages

2016-12-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736744#comment-15736744
 ] 

Joseph K. Bradley commented on SPARK-15572:
---

Would you be interested in shepherding this feature for the 2.2 release?

> MLlib in R format: compatibility with other languages
> -
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2016-12-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Shepherd: Joseph K. Bradley

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Subtasks cover major algorithm groups.  To pick up a review subtask, please:
> * Comment that you are working on it.
> * Compare the public APIs of spark.ml vs. spark.mllib.
> * Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> * Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * single-Row prediction: [SPARK-10413]
> Also, this does not include the following items (but will eventually):
> * User-facing:
> ** Streaming ML
> ** evaluation
> ** pmml
> ** stat
> ** linalg [SPARK-13944]
> * Developer-facing:
> ** optimization
> ** random, rdd
> ** util



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18813:
--
Description: 
*PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
The roadmap process described below is significantly updated since the 2.1 
roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
the basis for this proposal, and comment in this JIRA if you have suggestions 
for improvements.

h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during 
this release.  This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See 
details below in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue 
on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
least one must agree to shepherd the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific 
issues, please comment on those issues or the mailing list.

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.  _These 
meanings have been updated in this proposal for the 2.2 process._

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next 
release? ||
| 1 | next release | Blocker | *must* | *must* | *must* |
| 2 | next release | Critical | *must* | yes, unless small | *best effort* |
| 3 | next release | Major | *must* | optional | *best effort* |
| 4 | next release | Minor | optional | no | maybe |
| 5 | next release | Trivial | optional | no | maybe |
| 6 | (empty) | (any) | yes | no | maybe |
| 7 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.
2-3. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.  The issue may slip to the 
next release if development is slower than expected.
4-5. A committer has promised interest in this issue.  Contributions *will* 
receive attention.  The issue may slip to another release.
6. A committer has promised interest in this issue and should respond, but no 
promises are made about priorities or releases.
7. This issue is open for discussion, but it needs a committer to promise 
interest to proceed.

h1. Instructions

h2. For contributors

Getting started
* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time contributor, please always start with a small 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a larger feature.

Coordinating on JIRA
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start work. This is to avoid duplicate work. For small patches, you do not 
need to get the JIRA assigned to you to begin work.
* For medium/large features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there is no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
Committers should set those.

Writing and reviewing PRs
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
review greatly helps to improve others' code as well as yours.*

h2. For Committers

Adding to this roadmap
* You can update the roadmap by (a) adding issues to this list and (b) setting 
Target Versions.  Only Committers may make these changes.
* *If you add an issue to this roadmap or set a Target Version, you _must_ 
assign yourself or another Committer as Shepherd.*
* This list should be actively managed during the release.
* If you target a significant item for the next release, please list the item 
on this roadmap.
* If you commit to shepherding a new public API, you implicitly commit to 
shepherding the follow-up issues as well (Python/R APIs, docs).

Creating JIRA issues
* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add a "starter" label to starter tasks.
* Put a rough time estimate for medium/big features and track the progress.
* Set Priority carefully.  Priority should not be mixed with size of effort for

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-12-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736731#comment-15736731
 ] 

Joseph K. Bradley commented on SPARK-15581:
---

Everyone, I just posted a proposal for the 2.2 process here: [SPARK-18813].
* I'd really like feedback on it since it is a significant change, in that it 
will require more rigor in setting expectations from committers & contributors.
* I left a few items from the current roadmap but very few.  Once we agree on 
the process, let's audit items with Target Version or Shepherd already set.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence

[jira] [Created] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-18813:
-

 Summary: MLlib 2.2 Roadmap
 Key: SPARK-18813
 URL: https://issues.apache.org/jira/browse/SPARK-18813
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Priority: Blocker


*PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
The roadmap process described below is significantly updated since the 2.1 
roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
the basis for this proposal, and comment in this JIRA if you have suggestions 
for improvements.

h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during 
this release.  This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See 
details below in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue 
on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
least one must agree to shepherd the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific 
issues, please comment on those issues or the mailing list.

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.  _These 
meanings have been updated in this proposal for the 2.2 process._

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next 
release? ||
| 1 | next release | Blocker | *must* | *must* | *must* |
| 2 | next release | Critical | *must* | yes, unless small | *best effort* |
| 3 | next release | Major | *must* | optional | *best effort* |
| 4 | next release | Minor | optional | no | maybe |
| 5 | next release | Trivial | optional | no | maybe |
| 6 | (empty) | (any) | yes | no | maybe |
| 7 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.
2-3. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.  The issue may slip to the 
next release if development is slower than expected.
4-5. A committer has promised interest in this issue.  Contributions *will* 
receive attention.  The issue may slip to another release.
6. A committer has promised interest in this issue and should respond, but no 
promises are made about priorities or releases.
7. This issue is open for discussion, but it needs a committer to promise 
interest to proceed.

h1. Instructions

h2. For contributors

Getting started
* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time contributor, please always start with a small 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a larger feature.

Coordinating on JIRA
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start work. This is to avoid duplicate work. For small patches, you do not 
need to get the JIRA assigned to you to begin work.
* For medium/large features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there is no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
Committers should set those.

Writing and reviewing PRs
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
review greatly helps to improve others' code as well as yours.*

h2. For Committers

Adding to this roadmap
* You can update the roadmap by (a) adding issues to this list and (b) setting 
Target Versions.  Only Committers may make these changes.
* *If you add an issue to this roadmap or set a Target Version, you _must_ 
assign yourself or another Committer as Shepherd.*
* This list should be actively managed during the release.
* If you target a significant item for the next release, please list the item 
on this roadmap.
* If you commit to shepherding a new public API, you implicitly commit to 
shepherding the follow-up issues as well (Python/R APIs, docs).

Creating JIRA issues
* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add a "starter" label to starter tasks.

[jira] [Resolved] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-12-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-4105.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-12-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736685#comment-15736685
 ] 

Joseph K. Bradley commented on SPARK-15581:
---

I like a lot of the points made here.  A few thoughts with each:

* Clearly messaging what we WILL get done + limiting promises based on reviewer 
bandwidth
** I'll try to draft some ideas for how to do this.  I'd really like to make 
good use of JIRA fields like Target Version, Priority, and labels in order to 
make it easy to write searches to help contributors explore JIRA.

* Umbrella vs. specific JIRAs.  Broad efforts vs. targeted efforts.
** [~holdenk] I like umbrellas for organization, coverage, and coordination, 
and I agree with you that we should not get rid of them---and that the answer 
is to be stricter about specifying Priority.

* Short-term (next minor release) vs long-term (next major release) efforts
** I worry about promising specific JIRAs by the next major release because 
those JIRAs could easily pile up to make the final list huge.  We will have to 
limit those to critical or breaking changes.

* Open JIRAs not on roadmaps
** The roadmap could have links to tags to help users find and participate on 
these conversations.

* Spark R: I don't have full solutions but do have a few concrete suggestions:
** Committers (myself included) need to be more diligent about creating 
follow-up tasks.  When any new API is added in Scala, the committer or 
contributor should create follow-up tasks for Python, R, and documentation, and 
those should be targeted at the same release.  I.e., when a committer agrees to 
shepherd a feature, they agree to shepherd all language APIs and docs.
** As far as how to make R easier to work with, I'll take your suggestions!
** Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR.  [~felixcheung] would you be interested in 
leading some discussion?  I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages.  That should help inform what APIs should 
look like in Spark R.

[~sethah] Thanks for aggregating that list of issues!  I do think it's a pretty 
ambitious list for one release, but I'll definitely use it to help identify 
items I'd like to mark myself down for shepherding in the next release.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736671#comment-15736671
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

Sure - That will be great.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Brendan Dwyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736657#comment-15736657
 ] 

Brendan Dwyer commented on SPARK-15799:
---

I think that's the best solution. Can I create a pull request to make this 
change?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18812) Clarify "Spark ML"

2016-12-09 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18812:
-

 Summary: Clarify "Spark ML"
 Key: SPARK-18812
 URL: https://issues.apache.org/jira/browse/SPARK-18812
 Project: Spark
  Issue Type: Documentation
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736616#comment-15736616
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

I see - Looks like its controlled by this `spark.sql.warehouse.dir` flag [1]. 
One change we can make is that we see if the user has supplied a value for this 
config flag in sparkR.session() [2] and if not we can set it to tmpdir() ? 

The one question this raises is that if the user wants to access some of these 
tables after the end of their session then it won't be possible. 


[1] 
https://github.com/apache/spark/blob/d60ab5fd9b6af9aa5080a2d13b3589d8b79c5c5c/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L968
[2] 
https://github.com/apache/spark/blob/d60ab5fd9b6af9aa5080a2d13b3589d8b79c5c5c/R/pkg/R/sparkR.R#L365

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Brendan Dwyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736591#comment-15736591
 ] 

Brendan Dwyer commented on SPARK-15799:
---

On my machine it gets created in my home directory when I call sparkR.session()

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736588#comment-15736588
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

Where does it get created right now ? We could set the default value for that 
config flag to be the R tmp dir in SparkR 

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-12-09 Thread Brendan Dwyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736572#comment-15736572
 ] 

Brendan Dwyer commented on SPARK-15799:
---

[CRAN policy|https://cran.r-project.org/web/packages/policies.html] states:
{quote}
- Packages should not write in the users’ home filespace, nor anywhere else on 
the file system apart from the R session’s temporary directory (or during 
installation in the location pointed to by TMPDIR: and such usage should be 
cleaned up). Installing into the system’s R installation (e.g., scripts to its 
bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package 
obtains confirmation from the user.
- Packages should not modify the global environment (user’s workspace). 
{quote}

Do we need to move the location of spark-warehouse to a temporary directory?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18802) java.lang.ClassCastException in a simple spark application

2016-12-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18802.
---
Resolution: Duplicate

Please don't reopen without a meaningful change in the discussion. It's still a 
duplicate, and if you search for this issue in JIRA you will find more info 
about the problem and resolution. This is something users need to do. It's a 
developer framework. The mini-rant is irrelevant here.

> java.lang.ClassCastException in a simple spark application
> --
>
> Key: SPARK-18802
> URL: https://issues.apache.org/jira/browse/SPARK-18802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Bingozz
>
> I installed spark-2.0.1-bin-hadoop2.7 on my spark cluster with a master and 
> four workers.
> Both scala versions are 2.11.8 on my local machine and the spark cluster 
> machines, and it both runs well if I use the spark-shell to run apps such as 
> WordCount on local and remote master.
> On my local machine, I  added dependencies  simplily from directory 
> `spark-2.0.1-bin-hadoop2.7/jars` in my project on intellij IDEA.It runs well 
> if I just load the file from the hdfs, but fails if I do some WordCount based 
> on the loaded file.
> My codes are blew:
> ```
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> object topK {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("test_spark")
>   .setMaster("spark://10.112.29.56:7077")
> val sc = new SparkContext(conf)
> val lines = sc.textFile("hdfs://10.112.28.38:9000/user/root/covtype")
> println(lines.count())
> //val count = lines.flatMap(s=>s.split(",")).map(s=>(s, 
> 1)).reduceByKey((a, b) => a+b)
> //println(count.count() + "\n")
> sc.stop()
> println("helloworld")
>   }
> }
> ```
> And the error is blew:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 0.3 in stage 0.0 (TID 5, 10.112.29.80): 
> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
>

[jira] [Closed] (SPARK-18802) java.lang.ClassCastException in a simple spark application

2016-12-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-18802.
-

> java.lang.ClassCastException in a simple spark application
> --
>
> Key: SPARK-18802
> URL: https://issues.apache.org/jira/browse/SPARK-18802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Bingozz
>
> I installed spark-2.0.1-bin-hadoop2.7 on my spark cluster with a master and 
> four workers.
> Both scala versions are 2.11.8 on my local machine and the spark cluster 
> machines, and it both runs well if I use the spark-shell to run apps such as 
> WordCount on local and remote master.
> On my local machine, I  added dependencies  simplily from directory 
> `spark-2.0.1-bin-hadoop2.7/jars` in my project on intellij IDEA.It runs well 
> if I just load the file from the hdfs, but fails if I do some WordCount based 
> on the loaded file.
> My codes are blew:
> ```
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> object topK {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("test_spark")
>   .setMaster("spark://10.112.29.56:7077")
> val sc = new SparkContext(conf)
> val lines = sc.textFile("hdfs://10.112.28.38:9000/user/root/covtype")
> println(lines.count())
> //val count = lines.flatMap(s=>s.split(",")).map(s=>(s, 
> 1)).reduceByKey((a, b) => a+b)
> //println(count.count() + "\n")
> sc.stop()
> println("helloworld")
>   }
> }
> ```
> And the error is blew:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 0.3 in stage 0.0 (TID 5, 10.112.29.80): 
> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
>

[jira] [Resolved] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-09 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18745.
---
   Resolution: Fixed
Fix Version/s: 2.1.1
   2.0.3

> java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
> -
>
> Key: SPARK-18745
> URL: https://issues.apache.org/jira/browse/SPARK-18745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Kazuaki Ishizaki
>Priority: Critical
> Fix For: 2.0.3, 2.1.1
>
>
> Running query 68 with decreased executor memory (using 12GB executors instead 
> of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave 
> IndexOutOfBoundsException.
> The query is as follows:
> {noformat}
> [select  c_last_name
>,c_first_name
>,ca_city
>,bought_city
>,ss_ticket_number
>,extended_price
>,extended_tax
>,list_price
>  from (select ss_ticket_number
>  ,ss_customer_sk
>  ,ca_city bought_city
>  ,sum(ss_ext_sales_price) extended_price 
>  ,sum(ss_ext_list_price) list_price
>  ,sum(ss_ext_tax) extended_tax 
>from store_sales
>,date_dim
>,store
>,household_demographics
>,customer_address 
>where store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  and store_sales.ss_store_sk = store.s_store_sk  
> and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
> and store_sales.ss_addr_sk = customer_address.ca_address_sk
> and date_dim.d_dom between 1 and 2 
> and (household_demographics.hd_dep_count = 8 or
>  household_demographics.hd_vehicle_count= -1)
> and date_dim.d_year in (2000,2000+1,2000+2)
> and store.s_city in ('Plainview','Rogers')
>group by ss_ticket_number
>,ss_customer_sk
>,ss_addr_sk,ca_city) dn
>   ,customer
>   ,customer_address current_addr
>  where ss_customer_sk = c_customer_sk
>and customer.c_current_addr_sk = current_addr.ca_address_sk
>and current_addr.ca_city <> bought_city
>  order by c_last_name
>  ,ss_ticket_number
>   limit 100]
> {noformat}
> Spark output that showed the exception:
> {noformat}
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>   at 
>

[jira] [Closed] (SPARK-18798) Expose the kill Executor in Yarn Mode

2016-12-09 Thread Narendra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narendra closed SPARK-18798.


> Expose the kill Executor in Yarn Mode
> -
>
> Key: SPARK-18798
> URL: https://issues.apache.org/jira/browse/SPARK-18798
> Project: Spark
>  Issue Type: Improvement
>Reporter: Narendra
>
> Expose the kill Executor in Yarn Mode
> I can see spark already has exposed the kill Executor method through spark 
> context  for Mesos , fi spark can expose the same method for Yarn it will 
> good feature if some want to test application stability by killing randomly 
> Executor
> I see spark have kill Executor in YarnAllocator, it won't we much time 
> consuming to expose this , anyone can work I can also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18811:


Assignee: Apache Spark

> Stream Source resolution should happen in StreamExecution thread, not main 
> thread
> -
>
> Key: SPARK-18811
> URL: https://issues.apache.org/jira/browse/SPARK-18811
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> When you start a stream, if we are trying to resolve the source of the 
> stream, for example if we need to resolve partition columns, this could take 
> a long time. This long execution time should not block the main thread where 
> `query.start()` was called on. It should happen in the stream execution 
> thread possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736427#comment-15736427
 ] 

Apache Spark commented on SPARK-18811:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16238

> Stream Source resolution should happen in StreamExecution thread, not main 
> thread
> -
>
> Key: SPARK-18811
> URL: https://issues.apache.org/jira/browse/SPARK-18811
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> When you start a stream, if we are trying to resolve the source of the 
> stream, for example if we need to resolve partition columns, this could take 
> a long time. This long execution time should not block the main thread where 
> `query.start()` was called on. It should happen in the stream execution 
> thread possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18811:


Assignee: (was: Apache Spark)

> Stream Source resolution should happen in StreamExecution thread, not main 
> thread
> -
>
> Key: SPARK-18811
> URL: https://issues.apache.org/jira/browse/SPARK-18811
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> When you start a stream, if we are trying to resolve the source of the 
> stream, for example if we need to resolve partition columns, this could take 
> a long time. This long execution time should not block the main thread where 
> `query.start()` was called on. It should happen in the stream execution 
> thread possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18808:
--
Issue Type: Improvement  (was: Bug)

Agree, the private predict() method should accept a reference to a broadcast if 
possible, and transform should create that broadcast. Go ahead and try it.

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18620.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16114
[https://github.com/apache/spark/pull/16114]

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: kinesis
> Fix For: 2.2.0
>
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18620:
--
Assignee: Takeshi Yamamuro

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: kinesis
> Fix For: 2.2.0
>
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed

2016-12-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736396#comment-15736396
 ] 

Sean Owen commented on SPARK-18806:
---

Can you elaborate -- what is the effect of this? can you reproduce on master? 
what's the possible resolution?

> driverwrapper and executor doesn't exit when worker killed
> --
>
> Key: SPARK-18806
> URL: https://issues.apache.org/jira/browse/SPARK-18806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>
> submit an application with standlone-cluster mode,  and then the master will 
> launch executor and driverwrapper on worker. They are all start WorkerWatcher 
> to watch the worker, as a result, when the worker killed  by manual, the 
> driverwrapper and executor sometimes will not exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18316) Spark MLlib, GraphX 2.1 QA umbrella

2016-12-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18316.
---
   Resolution: Done
Fix Version/s: 2.1.0

Marking everything done!  Thanks very much to everyone who helped out.

> Spark MLlib, GraphX 2.1 QA umbrella
> ---
>
> Key: SPARK-18316
> URL: https://issues.apache.org/jira/browse/SPARK-18316
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.1.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-18329].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new algorithms: MinHash, RandomProjection
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-18779) Messages being received only from one partition when using Spark Streaming integration for Kafka 0.10 with kafka client library at 0.10.1

2016-12-09 Thread Pranav Nakhe (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Nakhe reopened SPARK-18779:
--

Well there seems to be an issue in spark after all.

In the ConsumerStrategy class there is a pause on the consumer

  // we've called poll, we must pause or next poll may consume messages and set 
position
  consumer.pause(consumer.assignment())
  
We never resume the consumer and that seems to causing the issue. The 
KafkaConsumer implementation has changed between 10.0.1 and 10.1.0 which has 
exposed this issue. The solution to this issue is to resume the consumer before 
we find the position in DirectKafkaInputDStream class in the latestOffsets 
method.

c.pause(newPartitions.asJava)
// find latest available offsets
c.seekToEnd(currentOffsets.keySet.asJava)
c.resume(newPartitions.asJava) /* part of fix */
c.resume(c.assignment())   /*part of fix - resuming what paused in the 
ConsumerStrategy class */
parts.map(tp => tp -> c.position(tp)).toMap

I have tested this fix and it works fine. The reason the issue is not seen in 
the current setup is because pause/resume logic is changed in the latest kafka 
version. We dont seem to have a resume for the pause and hence this fix is 
necessary. I would be happy to make these changes if they seem fine.

> Messages being received only from one partition when using Spark Streaming 
> integration for Kafka 0.10 with kafka client library at 0.10.1
> -
>
> Key: SPARK-18779
> URL: https://issues.apache.org/jira/browse/SPARK-18779
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Pranav Nakhe
>
> I apologize for the earlier descripion which wasnt very clear about the 
> issue. I would give a detailed description and my usecase now -
> I have a spark application running which is consuming kafka messages using 
> Spark Kafka 0.10 integration. I now need to stop my spark application and the 
> user would then tell what timestamp in the past the spark application should 
> start reading messages from (replaying messages). The timestamp is mapped to 
> kafka offset by using the 'offsetsForTimes' API in KafkaConsumer introduced 
> in 10.1.0 client of Kafka. That offset is then used to create DStream
> Because Kafka 10.0.1 des not have API 'offsetsForTimes' I need to use Kafka 
> 10.1.0. 
> So to achieve that behavior I replaced the 10.0.1 jar in Spark environment 
> with 10.1.0 jar. Things started working for me but the application could read 
> only messages from the first partition.
> To recreate the issue I wrote a local program and had 10.1.0 jar in the 
> classpath
> 
> val topics = Set("Z1Topic")
> val topicPartitionOffsetMap = new HashMap[TopicPartition, Long]()
> topicPartitionOffsetMap.put(new TopicPartition("Z1Topic",0), 10L) //hardcoded 
> offset to 10 instead of getting the offset from 'offsetsForTimes'
> topicPartitionOffsetMap.put(new TopicPartition("Z1Topic",1), 10L)
> import scala.collection.JavaConversions._
> val stream = KafkaUtils.createDirectStream[String, String](ssc, 
> PreferBrokers, Subscribe[String, String](topics, kafkaParams, 
> topicPartitionOffsetMap))
> val x = stream.map(x => x.value())
> x.print()
> 
> This printed only the messages in the first topic from offset 10.  (This is 
> with 10.1.0 client)
> If I am to use Kafka 10.0.1 client for the above program, things work fine 
> and I receive messages from all partitions but I cant use the 
> 'offsetsForTimes' API (because it doesnt exist in 10.0.1 client). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18455) General support for correlated subquery processing

2016-12-09 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736300#comment-15736300
 ] 

Nattavut Sutyanyong commented on SPARK-18455:
-

Thanks for the comments. I will respond to them in the document. I will start 
opening new JIRAs as sub-tasks of this work. As part of this work, I plan to 
extend to support deep correlation by introducing extra joins to de-correlate 
(which is termed unnesting in Neumann's paper). There are subtle details on 
cases of potential incorrect results that I want to work through in my thought 
processing before I describe this de-correlation technique in details (hence 
the plan to write another separate document).

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18811) Stream Source resolution should happen in StreamExecution thread, not main thread

2016-12-09 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18811:
---

 Summary: Stream Source resolution should happen in StreamExecution 
thread, not main thread
 Key: SPARK-18811
 URL: https://issues.apache.org/jira/browse/SPARK-18811
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.0.2, 2.1.0
Reporter: Burak Yavuz


When you start a stream, if we are trying to resolve the source of the stream, 
for example if we need to resolve partition columns, this could take a long 
time. This long execution time should not block the main thread where 
`query.start()` was called on. It should happen in the stream execution thread 
possibly before starting any triggers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18323) Update MLlib, GraphX websites for 2.1

2016-12-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18323.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Update MLlib, GraphX websites for 2.1
> -
>
> Key: SPARK-18323
> URL: https://issues.apache.org/jira/browse/SPARK-18323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.1.0
>
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18808) ml.KMeansModel.transform is very inefficient

2016-12-09 Thread Michel Lemay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736254#comment-15736254
 ] 

Michel Lemay commented on SPARK-18808:
--

Subclassing/overriding/adding methods in KMeans/Model is a pain because of all 
the private stuff.  
I cannot even add methods implicitly because parentModel is private and I have 
no way of calling the proper method on it.  I've seen other JIRA complaining 
about that lack of flexibility as well.

Right now, the only option I have is to code brand new KMeans* from scratch.

> ml.KMeansModel.transform is very inefficient
> 
>
> Key: SPARK-18808
> URL: https://issues.apache.org/jira/browse/SPARK-18808
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Michel Lemay
>
> The function ml.KMeansModel.transform will call the 
> parentModel.predict(features) method on each row which in turns will 
> normalize all clusterCenters from mllib.KMeansModel.clusterCentersWithNorm 
> every time!
> This is a serious waste of resources!  In my profiling, 
> clusterCentersWithNorm represent 99% of the sampling!  
> This should have been implemented with a broadcast variable as it is done in 
> other functions like computeCost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736230#comment-15736230
 ] 

Felix Cheung commented on SPARK-18810:
--

Also to expand on the earlier note above, I think the main thing to be able to 
run existing tests, build vignettes and so on
- without having to change any code
or
- without having to manually call install.spark in a separate session first to 
cache the spark jar

this is why I think it makes sense to have an environment override instead of 
an API parameter switch.


> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-09 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736236#comment-15736236
 ] 

liyunzhang_intel commented on SPARK-13955:
--

[~jerryshao]:
After testing, i can use "spark.yarn.archive" in following steps:
 1.cd $SPARK_HOME/jars
2. zip spark-archive.zip ./*  #enter the directory, then zip
3. when you test the spark-archive.zip by "unzip -t spark-archive.zip, you will 
see
{code}
 unzip -t spark-archive.zip 
Archive:  spark-archive.zip
testing: activation-1.1.1.jar OK
testing: antlr4-runtime-4.5.3.jar   OK
testing: aopalliance-1.0.jar  OK
testing: aopalliance-repackaged-2.4.0-b34.jar   OK
testing: apacheds-i18n-2.0.0-M15.jar   OK
testing: apacheds-kerberos-codec-2.0.0-M15.jar   OK
testing: api-asn1-api-1.0.0-M20.jar   OK
testing: api-util-1.0.0-M20.jar   OK
testing: arpack_combined_all-0.1.jar   OK
{code}
4. copy spark-archive.zip to hdfs like "hadoop fs -copyFromLocal 
spark-archive.zip hdfs://bdpe42:8020/"
5. append "spark.yarn.archive=hdfs://bdpe42:8020/spark-archive.zip" to 
conf/spark-defaults.conf


> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736233#comment-15736233
 ] 

Shivaram Venkataraman commented on SPARK-18810:
---

Yeah I think that sounds good. This need not be an advertised feature that we 
tell users about but more of a flag we use for testing

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736217#comment-15736217
 ] 

Felix Cheung commented on SPARK-18810:
--

For RC, it actually expects to have a subdirectory `spark-2.1.0` (==version) so 
it doesn't exactly match `spark-2.1.0-rc2-bin`
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/install.R#L71


> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736217#comment-15736217
 ] 

Felix Cheung edited comment on SPARK-18810 at 12/9/16 8:06 PM:
---

For RC, it actually expects to have a subdirectory `spark-2.1.0` (==version) so 
it doesn't exactly match `spark-2.1.0-rc2-bin` in 
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/

https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/install.R#L71



was (Author: felixcheung):
For RC, it actually expects to have a subdirectory `spark-2.1.0` (==version) so 
it doesn't exactly match `spark-2.1.0-rc2-bin`
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/R/pkg/R/install.R#L71


> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18455) General support for correlated subquery processing

2016-12-09 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736195#comment-15736195
 ] 

Herman van Hovell commented on SPARK-18455:
---

Thanks for adding the doc. I look forward to the PRs!

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736182#comment-15736182
 ] 

Shivaram Venkataraman commented on SPARK-18810:
---

I think the snapshot case and the RC case are probably a bit different.
- In the case of RCs the artifact name matches what would be the final release 
(for example 
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/) so we 
only need to change the base url (environment variable could work for this)
- For nightly builds the artifact name also changes and this probably needs 
some more thought. I guess having a way to override the entire URL would solve 
both the cases ? 

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-09 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736179#comment-15736179
 ] 

Herman van Hovell commented on SPARK-18799:
---

It is way too late to put this in Spark 2.1. So yeah 2.2.

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18807:


Assignee: Apache Spark  (was: Felix Cheung)

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18807:


Assignee: Felix Cheung  (was: Apache Spark)

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18807:


Assignee: Felix Cheung  (was: Apache Spark)

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18807) Should suppress output print for calls to JVM methods with void return values

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18807:


Assignee: Apache Spark  (was: Felix Cheung)

> Should suppress output print for calls to JVM methods with void return values
> -
>
> Key: SPARK-18807
> URL: https://issues.apache.org/jira/browse/SPARK-18807
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> Several SparkR API calling into JVM methods that have void return values are 
> getting printed out, especially when running in a REPL or IDE.
> example:
> > setLogLevel("WARN")
> NULL
> We should fix this to make the result more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736161#comment-15736161
 ] 

Felix Cheung commented on SPARK-18810:
--

I've found the same issue while testing as well, and was going to propose a 
change to support this.

Essentially for snapshot and RC build, since the jar is not on the Apache 
mirror, install.spark is unable to download it. We need to have a way to 
override the url (details: as it is constructing the url from a base url and a 
version path, it is expecting the source as a certain directory structure - 
currently this structure does not match how the snapshot and RC build are 
published, so we need a way to override the entire url)

I propose we have an environment variable instead of a parameter since we want 
to be able to run everything the same way without having to make code changes.


> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18798) Expose the kill Executor in Yarn Mode

2016-12-09 Thread Narendra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736153#comment-15736153
 ] 

Narendra commented on SPARK-18798:
--

Thanks for Clarifying , It helped 

> Expose the kill Executor in Yarn Mode
> -
>
> Key: SPARK-18798
> URL: https://issues.apache.org/jira/browse/SPARK-18798
> Project: Spark
>  Issue Type: Improvement
>Reporter: Narendra
>
> Expose the kill Executor in Yarn Mode
> I can see spark already has exposed the kill Executor method through spark 
> context  for Mesos , fi spark can expose the same method for Yarn it will 
> good feature if some want to test application stability by killing randomly 
> Executor
> I see spark have kill Executor in YarnAllocator, it won't we much time 
> consuming to expose this , anyone can work I can also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-09 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-18810:
-

 Summary: SparkR install.spark does not work for RCs, snapshots
 Key: SPARK-18810
 URL: https://issues.apache.org/jira/browse/SPARK-18810
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.2, 2.1.0
Reporter: Shivaram Venkataraman


We publish source archives of the SparkR package now in RCs and in nightly 
snapshot builds. One of the problems that still remains is that `install.spark` 
does not work for these as it looks for the final Spark version to be present 
in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18809) Kinesis deaggregation issue on master

2016-12-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18809:


Assignee: Apache Spark

> Kinesis deaggregation issue on master
> -
>
> Key: SPARK-18809
> URL: https://issues.apache.org/jira/browse/SPARK-18809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Brian ONeill
>Assignee: Apache Spark
>
> Fix for SPARK-14421 was never applied to master.
> https://github.com/apache/spark/pull/16236
> Upgrade KCL to 1.6.2 to support deaggregation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 167 matches

Mail list logo