date:20160520

[jira] [Commented] (SPARK-15101) Audit: ml.clustering and ml.recommendation

2016-05-20 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294737#comment-15294737
 ] 

yuhao yang commented on SPARK-15101:


Resolve the issue here as all sub tasks are finished. cc [~josephkb].

> Audit: ml.clustering and ml.recommendation
> --
>
> Key: SPARK-15101
> URL: https://issues.apache.org/jira/browse/SPARK-15101
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15461) modify python test script using default version 2.7

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15461:


Assignee: (was: Apache Spark)

> modify python test script using default version 2.7
> ---
>
> Key: SPARK-15461
> URL: https://issues.apache.org/jira/browse/SPARK-15461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> To Spark 2.0, the python test script do not support python 2.6, so need to 
> update the default python version used in pytion/run_tests.py to python 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15461) modify python test script using default version 2.7

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15461:


Assignee: Apache Spark

> modify python test script using default version 2.7
> ---
>
> Key: SPARK-15461
> URL: https://issues.apache.org/jira/browse/SPARK-15461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Critical
>
> To Spark 2.0, the python test script do not support python 2.6, so need to 
> update the default python version used in pytion/run_tests.py to python 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15461) modify python test script using default version 2.7

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294732#comment-15294732
 ] 

Apache Spark commented on SPARK-15461:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/13240

> modify python test script using default version 2.7
> ---
>
> Key: SPARK-15461
> URL: https://issues.apache.org/jira/browse/SPARK-15461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> To Spark 2.0, the python test script do not support python 2.6, so need to 
> update the default python version used in pytion/run_tests.py to python 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15461) modify python test script using default version 2.7

2016-05-20 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15461:
---
Component/s: Tests
 PySpark

> modify python test script using default version 2.7
> ---
>
> Key: SPARK-15461
> URL: https://issues.apache.org/jira/browse/SPARK-15461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> To Spark 2.0, the python test script do not support python 2.6, so need to 
> update the default python version used in pytion/run_tests.py to python 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15460) Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting hive.metastore.warehouse.dir

2016-05-20 Thread Xiao Li (JIRA)

Xiao Li created SPARK-15460:
---

 Summary: Issue Exceptions from Thrift Server and Spark-SQL Cli 
when Users Inputting hive.metastore.warehouse.dir
 Key: SPARK-15460
 URL: https://issues.apache.org/jira/browse/SPARK-15460
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Multiple test suites are still using hive.metastore.warehouse.dir as the input 
parameters to the Thrift Server or Spark-SQL Cli. Before this PR, this 
parameter is simply ignored without any message. However, these parameters took 
effect to change the default location of database in warehouse until Spark 
2.0.0.

To ensure users correctly use the new parameter, this PR is to issue an 
exception if users specify the parm: hive.metastore.warehouse.dir in 
Thrift-Server and Spark SQL CLI:
{noformat}
Exception in thread "main" java.lang.Error: hive.metastore.warehouse.dir is 
deprecated. Instead, use spark.sql.warehouse.dir to specify the default 
location of database.
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15460) Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting hive.metastore.warehouse.dir

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294730#comment-15294730
 ] 

Apache Spark commented on SPARK-15460:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13111

> Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting 
> hive.metastore.warehouse.dir
> ---
>
> Key: SPARK-15460
> URL: https://issues.apache.org/jira/browse/SPARK-15460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Multiple test suites are still using hive.metastore.warehouse.dir as the 
> input parameters to the Thrift Server or Spark-SQL Cli. Before this PR, this 
> parameter is simply ignored without any message. However, these parameters 
> took effect to change the default location of database in warehouse until 
> Spark 2.0.0.
> To ensure users correctly use the new parameter, this PR is to issue an 
> exception if users specify the parm: hive.metastore.warehouse.dir in 
> Thrift-Server and Spark SQL CLI:
> {noformat}
> Exception in thread "main" java.lang.Error: hive.metastore.warehouse.dir is 
> deprecated. Instead, use spark.sql.warehouse.dir to specify the default 
> location of database.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15461) modify python test script using default version 2.7

2016-05-20 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-15461:
--

 Summary: modify python test script using default version 2.7
 Key: SPARK-15461
 URL: https://issues.apache.org/jira/browse/SPARK-15461
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Weichen Xu
Priority: Critical


To Spark 2.0, the python test script do not support python 2.6, so need to 
update the default python version used in pytion/run_tests.py to python 2.7.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15460) Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting hive.metastore.warehouse.dir

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15460:


Assignee: (was: Apache Spark)

> Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting 
> hive.metastore.warehouse.dir
> ---
>
> Key: SPARK-15460
> URL: https://issues.apache.org/jira/browse/SPARK-15460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Multiple test suites are still using hive.metastore.warehouse.dir as the 
> input parameters to the Thrift Server or Spark-SQL Cli. Before this PR, this 
> parameter is simply ignored without any message. However, these parameters 
> took effect to change the default location of database in warehouse until 
> Spark 2.0.0.
> To ensure users correctly use the new parameter, this PR is to issue an 
> exception if users specify the parm: hive.metastore.warehouse.dir in 
> Thrift-Server and Spark SQL CLI:
> {noformat}
> Exception in thread "main" java.lang.Error: hive.metastore.warehouse.dir is 
> deprecated. Instead, use spark.sql.warehouse.dir to specify the default 
> location of database.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15460) Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting hive.metastore.warehouse.dir

2016-05-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15460:

Issue Type: Improvement  (was: Bug)

> Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting 
> hive.metastore.warehouse.dir
> ---
>
> Key: SPARK-15460
> URL: https://issues.apache.org/jira/browse/SPARK-15460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Multiple test suites are still using hive.metastore.warehouse.dir as the 
> input parameters to the Thrift Server or Spark-SQL Cli. Before this PR, this 
> parameter is simply ignored without any message. However, these parameters 
> took effect to change the default location of database in warehouse until 
> Spark 2.0.0.
> To ensure users correctly use the new parameter, this PR is to issue an 
> exception if users specify the parm: hive.metastore.warehouse.dir in 
> Thrift-Server and Spark SQL CLI:
> {noformat}
> Exception in thread "main" java.lang.Error: hive.metastore.warehouse.dir is 
> deprecated. Instead, use spark.sql.warehouse.dir to specify the default 
> location of database.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15460) Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting hive.metastore.warehouse.dir

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15460:


Assignee: Apache Spark

> Issue Exceptions from Thrift Server and Spark-SQL Cli when Users Inputting 
> hive.metastore.warehouse.dir
> ---
>
> Key: SPARK-15460
> URL: https://issues.apache.org/jira/browse/SPARK-15460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Multiple test suites are still using hive.metastore.warehouse.dir as the 
> input parameters to the Thrift Server or Spark-SQL Cli. Before this PR, this 
> parameter is simply ignored without any message. However, these parameters 
> took effect to change the default location of database in warehouse until 
> Spark 2.0.0.
> To ensure users correctly use the new parameter, this PR is to issue an 
> exception if users specify the parm: hive.metastore.warehouse.dir in 
> Thrift-Server and Spark SQL CLI:
> {noformat}
> Exception in thread "main" java.lang.Error: hive.metastore.warehouse.dir is 
> deprecated. Instead, use spark.sql.warehouse.dir to specify the default 
> location of database.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15320) Spark-SQL Cli Ignores Parameter hive.metastore.warehouse.dir

2016-05-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-15320.
---
Resolution: Not A Problem

> Spark-SQL Cli Ignores Parameter hive.metastore.warehouse.dir 
> -
>
> Key: SPARK-15320
> URL: https://issues.apache.org/jira/browse/SPARK-15320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When overriding {{hive.metastore.warehouse.dir}} in the spark-sql command 
> line, it does not work. This is a regression. It works in the previous 
> release.
> For example, 
> {noformat}
> ./spark-sql --hiveconf hive.metastore.warehouse.dir=/Users/xiaoli/a/b
> {noformat}
> However, the log shows the value is overridden by the default value of 
> "spark.sql.warehouse.dir". 
> {noformat}
> 16/05/13 13:43:35 INFO HiveClientImpl: Warehouse location for Hive client 
> (version 1.2.1) is 
> /Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse
> {noformat}
> We also can see the same usage in the CliSuite:
> https://github.com/apache/spark/blob/890abd1279014d692548c9f3b557483644a0ee32/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L92



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15098) Audit: ml.classification

2016-05-20 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294719#comment-15294719
 ] 

yuhao yang commented on SPARK-15098:


I've made a pass and found no notable changes required for user guide and 
examples. 


> Audit: ml.classification
> 
>
> Key: SPARK-15098
> URL: https://issues.apache.org/jira/browse/SPARK-15098
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15437) Failed to create HiveContext in SparkR

2016-05-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15437.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> Failed to create HiveContext in SparkR
> --
>
> Key: SPARK-15437
> URL: https://issues.apache.org/jira/browse/SPARK-15437
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Failed to create HiveContext in SparkR, even if we build the project with 
> Hive support. 
> {noformat}
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive 
> -Phive-thriftserver -Psparkr -DskipTests clean package
> {noformat}
> {noformat}
>  Welcome to
>   __ 
>/ __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT 
> /_/ 
>  Spark context is available as sc, SQL context is available as sqlContext
> > hiveContext <- sparkRHive.init(sc)
> 16/05/19 22:49:45 ERROR RBackendHandler:  on 
> org.apache.spark.sql.hive.HiveContext failed
> Error in value[[3L]](cond) : Spark SQL is not built with Hive support
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module

2016-05-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15424.
-
Resolution: Fixed

> Revert SPARK-14807 Create a hivecontext-compatibility module
> 
>
> Key: SPARK-15424
> URL: https://issues.apache.org/jira/browse/SPARK-15424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> I initially asked to create a hivecontext-compatibility module to put the 
> HiveContext there. But we are so close to Spark 2.0 release and there is only 
> a single class in it. It seems overkill to have an entire package, which 
> makes it more inconvenient, for a single class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15459) Make Range logical and physical explain consistent

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294686#comment-15294686
 ] 

Apache Spark commented on SPARK-15459:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13239

> Make Range logical and physical explain consistent
> --
>
> Key: SPARK-15459
> URL: https://issues.apache.org/jira/browse/SPARK-15459
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15459) Make Range logical and physical explain consistent

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15459:


Assignee: Reynold Xin  (was: Apache Spark)

> Make Range logical and physical explain consistent
> --
>
> Key: SPARK-15459
> URL: https://issues.apache.org/jira/browse/SPARK-15459
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15459) Make Range logical and physical explain consistent

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15459:


Assignee: Apache Spark  (was: Reynold Xin)

> Make Range logical and physical explain consistent
> --
>
> Key: SPARK-15459
> URL: https://issues.apache.org/jira/browse/SPARK-15459
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15459) Make Range logical and physical explain consistent

2016-05-20 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15459:
---

 Summary: Make Range logical and physical explain consistent
 Key: SPARK-15459
 URL: https://issues.apache.org/jira/browse/SPARK-15459
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15429) When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately.

2016-05-20 Thread Albert Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292908#comment-15292908
 ] 

Albert Cheng edited comment on SPARK-15429 at 5/21/16 2:58 AM:
---

I have a idea about this issue.

First, add a new parameter `concurrentJobs` to PIDRateEstimator.
Second, We can change the `error = latestRate - processingRate` to `error = 
latestRate - processingRate * concurrentJobs.toDouble`. And change the 
`historicalError = schedulingDelay.toDouble * processingRate / 
batchIntervalMillis` to `historicalError = schedulingDelay.toDouble * 
processingRate * concurrentJobs.toDouble / batchIntervalMillis`.

Is it right?
I would like to fix this. 
[~apachespark]


was (Author: cq365423762):
I have a idea about this issue.

First, add a new parameter `concurrentJobs` to PIDRateEstimator.
Second, We can change the `error = latestRate - processingRate` to `error = 
latestRate - processingRate * concurrentJobs.toDouble`. And change the 
`historicalError = schedulingDelay.toDouble * processingRate / 
batchIntervalMillis` to `historicalError = schedulingDelay.toDouble * 
processingRate * concurrentJobs.toDouble / batchIntervalMillis`.

Is it right?
I would like to fix this. 

> When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate 
> the receiving rate accurately.
> --
>
> Key: SPARK-15429
> URL: https://issues.apache.org/jira/browse/SPARK-15429
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Albert Cheng
>
> When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate 
> the receiving rate accurately.
> For example, if the batch duration is set to 10 seconds, each rdd in the 
> dstream will take 20s to compute. By changing 
> `spark.streaming.concurrentJobs=2`, each rdd in the dstream still takes 20s 
> to consume the data, which leads to poor estimation of backpressure by 
> PIDRateEstimator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15329) When start spark with yarn: spark.SparkContext: Error initializing SparkContext.

2016-05-20 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294653#comment-15294653
 ] 

Saisai Shao commented on SPARK-15329:
-

{code}
2016-05-15 00:06:08,368 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Process tree for container: container_1463267120616_0001_01_01 has 
processes older than 1 iteration running over the configured limit. 
Limit=2254857728, current usage = 2331357184
2016-05-15 00:06:08,374 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=1,containerID=container_1463267120616_0001_01_01] is 
running beyond virtual memory limits. Current usage: 264.2 MB of 1 GB physical 
memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1463267120616_0001_01_01 :
{code}

Please check the nodemanager log, container is killed by NM due to out of vmem, 
please increase the pmem vmem ratio or turn off vmem check.

If you meet any problem when running Spark other than bugs, please send the 
mail to user mailing list firstly. JIRA is not used for Q

>  When start spark with yarn: spark.SparkContext: Error initializing 
> SparkContext. 
> --
>
> Key: SPARK-15329
> URL: https://issues.apache.org/jira/browse/SPARK-15329
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Jon
>
> Hi, Im trying to start spark with yarn-client, like this "spark-shell 
> --master yarn-client" but Im getting the error below.
> If I start spark just with "spark-shell" everything works fine.
> I have a single node machine where I have all hadoop processes running, and a 
> hive metastore server running.
> I already try more than 30 different configurations, but nothing is working, 
> the config that I have now is this:
> core-site.xml:
> 
> 
> fs.defaultFS
> hdfs://masternode:9000
> 
> 
> hdfs-site.xml:
> 
> 
> dfs.replication
> 1
> 
> 
> yarn-site.xml:
> 
> 
> yarn.resourcemanager.resource-tracker.address
> masternode:8031
> 
> 
> yarn.resourcemanager.address
> masternode:8032
> 
> 
> yarn.resourcemanager.scheduler.address
> masternode:8030
> 
> 
> yarn.resourcemanager.admin.address
> masternode:8033
> 
> 
> yarn.resourcemanager.webapp.address
> masternode:8088
> 
> 
> About spark confs:
> spark-env.sh:
> HADOOP_CONF_DIR=/usr/local/hadoop-2.7.1/hadoop
> SPARK_MASTER_IP=masternode
> spark-defaults.conf
> spark.master spark://masternode:7077
> spark.serializer org.apache.spark.serializer.KryoSerializer
> Do you understand why this is happening?
> hadoopadmin@mn:~$ spark-shell --master yarn-client
> 16/05/14 23:21:07 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/05/14 23:21:07 INFO spark.SecurityManager: Changing view acls to: 
> hadoopadmin
> 16/05/14 23:21:07 INFO spark.SecurityManager: Changing modify acls to: 
> hadoopadmin
> 16/05/14 23:21:07 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); 
> users with modify permissions: Set(hadoopadmin)
> 16/05/14 23:21:08 INFO spark.HttpServer: Starting HTTP Server
> 16/05/14 23:21:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/05/14 23:21:08 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:36979
> 16/05/14 23:21:08 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 36979.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/05/14 23:21:12 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/05/14 23:21:12 INFO spark.SecurityManager: Changing view acls to: 
> hadoopadmin
> 16/05/14 23:21:12 INFO spark.SecurityManager: Changing modify acls to: 
> hadoopadmin
> 16/05/14 23:21:12 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); 
> users with modify permissions: Set(hadoopadmin)
> 16/05/14 23:21:12 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 33128.
> 16/05/14 23:21:13 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/05/14 23:21:13 INFO Remoting: Starting remoting
> 16/05/14 23:21:13 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.15.0.11:34382]
> 16/05/14 23:21:13 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 34382.
> 16/05/14 23:21:13 INFO

[jira] [Updated] (SPARK-15423) why it is very slow to clean resources in Spark-2.0.0-preview

2016-05-20 Thread zszhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zszhong updated SPARK-15423:

Summary: why it is very slow to clean resources in Spark-2.0.0-preview  
(was: why it is very slow to clean resources in Spark)

> why it is very slow to clean resources in Spark-2.0.0-preview
> -
>
> Key: SPARK-15423
> URL: https://issues.apache.org/jira/browse/SPARK-15423
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, MLlib
>Affects Versions: 2.0.0
> Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode
>Reporter: zszhong
>  Labels: newbie, starter
>
> Hi, everyone! I'm new to Spark. Originally I submitted a post in 
> [http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark],
>  but somebody think that it is off-topic. Thus I post here to ask for your 
> help. If this post is not related here, please feel free to delete it. I just 
> copy the content here, I don't know how to edit the code to be more readable, 
> so please refer to the link in stackoverflow.
> I've submitted a very simple task into a standalone Spark environment 
> (`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the 
> following command:
> bin/spark-submit.sh --master spark://hostname.domain:7077 --conf 
> "spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/
> where the `SimpleApp.py` is:
> from __future__ import print_function
> import sys
> import os
> import numpy as np
> from pyspark import SparkContext 
> from pyspark.mllib.tree import RandomForest, RandomForestModel
> from pyspark.mllib.util import MLUtils 
> trainDataPath = sys.argv[1]
> valDataPath = sys.argv[2]
> sc = SparkContext(appName="Classification using Spark Random Forest")
> trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
> valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
>model = RandomForest.trainClassifier(trainData, numClasses=6, 
> categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", 
> impurity='gini', maxDepth=4, maxBins=32)
> predictions = model.predict(valData.map(lambda x: x.features))
> labelsAndPredictions = valData.map(lambda lp: 
> lp.label).zip(predictions)
> testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() 
> / float(valData.count())
> print('Test Error = ' + str(testErr))
> And the task is running OK and can output the `Test Error` as follows:
> Test Error = 0.380580779161
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB)
> 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
> 127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
> 127.0.0.1:37978 in memory (size: 389.0 B, free: 4.5 GB)
>

[jira] [Commented] (SPARK-15423) why it is very slow to clean resources in Spark

2016-05-20 Thread zszhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294637#comment-15294637
 ] 

zszhong commented on SPARK-15423:
-

I've also downloaded spark-1.6.1 to run the same code and application. It seems 
work well in spark-1.6.1 and can exit correctly in a reasonable time.
Thus this problem might be related to `spark-2.0.0-preview`.

> why it is very slow to clean resources in Spark
> ---
>
> Key: SPARK-15423
> URL: https://issues.apache.org/jira/browse/SPARK-15423
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, MLlib
>Affects Versions: 2.0.0
> Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode
>Reporter: zszhong
>  Labels: newbie, starter
>
> Hi, everyone! I'm new to Spark. Originally I submitted a post in 
> [http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark],
>  but somebody think that it is off-topic. Thus I post here to ask for your 
> help. If this post is not related here, please feel free to delete it. I just 
> copy the content here, I don't know how to edit the code to be more readable, 
> so please refer to the link in stackoverflow.
> I've submitted a very simple task into a standalone Spark environment 
> (`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the 
> following command:
> bin/spark-submit.sh --master spark://hostname.domain:7077 --conf 
> "spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/
> where the `SimpleApp.py` is:
> from __future__ import print_function
> import sys
> import os
> import numpy as np
> from pyspark import SparkContext 
> from pyspark.mllib.tree import RandomForest, RandomForestModel
> from pyspark.mllib.util import MLUtils 
> trainDataPath = sys.argv[1]
> valDataPath = sys.argv[2]
> sc = SparkContext(appName="Classification using Spark Random Forest")
> trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
> valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
>model = RandomForest.trainClassifier(trainData, numClasses=6, 
> categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", 
> impurity='gini', maxDepth=4, maxBins=32)
> predictions = model.predict(valData.map(lambda x: x.features))
> labelsAndPredictions = valData.map(lambda lp: 
> lp.label).zip(predictions)
> testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() 
> / float(valData.count())
> print('Test Error = ' + str(testErr))
> And the task is running OK and can output the `Test Error` as follows:
> Test Error = 0.380580779161
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB)
> 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
> 127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo:

[jira] [Commented] (SPARK-15453) Sort Merge Join to use bucketing metadata to optimize query plan

2016-05-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294627#comment-15294627
 ] 

Tejas Patil commented on SPARK-15453:
-

[~rxin] Yes. I updated the jira title. If we avoid the exchange and sort for 
bucketed tables, then it would essentially be equivalent to what SMB join in 
Hive would do. I have put out an initial PR for adding the support for 
datasource. I am looking for some early feedback over the PR just to be sure if 
I am not missing some cases. 

Once thats confirmed, I plan to go ahead and add that support for reading from 
Hive tables.

> Sort Merge Join to use bucketing metadata to optimize query plan
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15453) Sort Merge Join to use bucketing metadata to optimize query plan

2016-05-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294627#comment-15294627
 ] 

Tejas Patil edited comment on SPARK-15453 at 5/21/16 1:42 AM:
--

[~rxin] Yes. I have updated the jira title. If we avoid the exchange and sort 
for bucketed tables, then it would essentially be equivalent to what SMB join 
in Hive would do. I have put out an initial PR for adding the support for 
datasource. I am looking for some early feedback over the PR just to be sure if 
I am not missing some cases. 

Once thats confirmed, I plan to go ahead and add that support for reading from 
Hive tables.


was (Author: tejasp):
[~rxin] Yes. I updated the jira title. If we avoid the exchange and sort for 
bucketed tables, then it would essentially be equivalent to what SMB join in 
Hive would do. I have put out an initial PR for adding the support for 
datasource. I am looking for some early feedback over the PR just to be sure if 
I am not missing some cases. 

Once thats confirmed, I plan to go ahead and add that support for reading from 
Hive tables.

> Sort Merge Join to use bucketing metadata to optimize query plan
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15458) Disable schema inference for streaming datasets on file streams

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15458:


Assignee: Apache Spark  (was: Tathagata Das)

> Disable schema inference for streaming datasets on file streams
> ---
>
> Key: SPARK-15458
> URL: https://issues.apache.org/jira/browse/SPARK-15458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> If the user relies on the schema to be inferred in file streams can break 
> easily for multiple reasons
> - accidentally running on a directory which has no data
> - schema changing underneath
> - on restart, the query will infer schema again, and may unexpectedly infer 
> incorrect schema, as the file in the directory may be different at the time 
> of the restart.
> To avoid these complicated scenarios, for Spark 2.0, we are going to disable 
> schema inferencing by default with a config, so that user is forced to 
> consider explicitly what is the schema it wants, rather than the system 
> trying to infer it and run into weird corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15458) Disable schema inference for streaming datasets on file streams

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294623#comment-15294623
 ] 

Apache Spark commented on SPARK-15458:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13238

> Disable schema inference for streaming datasets on file streams
> ---
>
> Key: SPARK-15458
> URL: https://issues.apache.org/jira/browse/SPARK-15458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If the user relies on the schema to be inferred in file streams can break 
> easily for multiple reasons
> - accidentally running on a directory which has no data
> - schema changing underneath
> - on restart, the query will infer schema again, and may unexpectedly infer 
> incorrect schema, as the file in the directory may be different at the time 
> of the restart.
> To avoid these complicated scenarios, for Spark 2.0, we are going to disable 
> schema inferencing by default with a config, so that user is forced to 
> consider explicitly what is the schema it wants, rather than the system 
> trying to infer it and run into weird corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15458) Disable schema inference for streaming datasets on file streams

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15458:


Assignee: Tathagata Das  (was: Apache Spark)

> Disable schema inference for streaming datasets on file streams
> ---
>
> Key: SPARK-15458
> URL: https://issues.apache.org/jira/browse/SPARK-15458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If the user relies on the schema to be inferred in file streams can break 
> easily for multiple reasons
> - accidentally running on a directory which has no data
> - schema changing underneath
> - on restart, the query will infer schema again, and may unexpectedly infer 
> incorrect schema, as the file in the directory may be different at the time 
> of the restart.
> To avoid these complicated scenarios, for Spark 2.0, we are going to disable 
> schema inferencing by default with a config, so that user is forced to 
> consider explicitly what is the schema it wants, rather than the system 
> trying to infer it and run into weird corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15458) Disable schema inference for streaming datasets on file streams

2016-05-20 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-15458:
-

 Summary: Disable schema inference for streaming datasets on file 
streams
 Key: SPARK-15458
 URL: https://issues.apache.org/jira/browse/SPARK-15458
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das


If the user relies on the schema to be inferred in file streams can break 
easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer 
incorrect schema, as the file in the directory may be different at the time of 
the restart.

To avoid these complicated scenarios, for Spark 2.0, we are going to disable 
schema inferencing by default with a config, so that user is forced to consider 
explicitly what is the schema it wants, rather than the system trying to infer 
it and run into weird corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-05-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294591#comment-15294591
 ] 

Joseph K. Bradley commented on SPARK-15457:
---

By the way, I plan to deprecate spark.mllib examples which use the deprecated 
SGD algorithms, unless they are easy to change.  Does that sound reasonable?  
If there are not matching spark.ml examples, I will make sure to create JIRAs 
for them.

> Eliminate MLlib 2.0 build warnings from deprecations
> 
>
> Key: SPARK-15457
> URL: https://issues.apache.org/jira/browse/SPARK-15457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Several classes and methods have been deprecated and are creating lots of 
> build warnings in branch-2.0.  This issue is to identify and fix those items:
> * *WithSGD classes: Change to make class not deprecated, object deprecated, 
> and public class constructor deprecated.  Any public use will require a 
> deprecated API.  We need to keep a non-deprecated private API since we cannot 
> eliminate certain uses: Python API, streaming algs, and examples.
> ** Use in PythonMLlibAPI: Change to using private constructors
> ** Streaming algs: No warnings after we un-deprecate the classes
> ** Examples: Deprecate or change ones which use deprecated APIs
> * others (to be listed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-05-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294590#comment-15294590
 ] 

Joseph K. Bradley commented on SPARK-15457:
---

OK I have a WIP one for the SGD issues.  Shall we separate them, or shall I 
send you my branch to incorporate?

> Eliminate MLlib 2.0 build warnings from deprecations
> 
>
> Key: SPARK-15457
> URL: https://issues.apache.org/jira/browse/SPARK-15457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Several classes and methods have been deprecated and are creating lots of 
> build warnings in branch-2.0.  This issue is to identify and fix those items:
> * *WithSGD classes: Change to make class not deprecated, object deprecated, 
> and public class constructor deprecated.  Any public use will require a 
> deprecated API.  We need to keep a non-deprecated private API since we cannot 
> eliminate certain uses: Python API, streaming algs, and examples.
> ** Use in PythonMLlibAPI: Change to using private constructors
> ** Streaming algs: No warnings after we un-deprecate the classes
> ** Examples: Deprecate or change ones which use deprecated APIs
> * others (to be listed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7159) Support multiclass logistic regression in spark.ml

2016-05-20 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294588#comment-15294588
 ] 

DB Tsai commented on SPARK-7159:


Hello [~sethah],

I think we will make it as separate SoftmaxRegression or 
MutinomialLogisticRegression class since they have different behavior when 
pivoting. See 
https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions
 for detail. As a result, in GLMNET, they have two different independent 
implementation. In MLOR, people normally regularize the coefficients without 
doing pivoting, as a result, you will have n * k coefficients where n is the 
dimensions of features, and k is the number of classes. In binary LOR, by 
default, the pivoting is performed, so we end up with n  coefficients. Note 
that you of course can do pivoting in MLOR, but choosing which class to pivot 
will create different solutions, and that's why in MLOR, people don't pivot.

I already started to work on this, and if you have time to help, I'm willing to 
give it to you, and help you to implement this. Let me know what you think. 

Thanks.

> Support multiclass logistic regression in spark.ml
> --
>
> Key: SPARK-7159
> URL: https://issues.apache.org/jira/browse/SPARK-7159
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Critical
>
> This should be implemented by checking the input DataFrame's label column for 
> feature metadata specifying the number of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-05-20 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294580#comment-15294580
 ] 

Liwei Lin commented on SPARK-15406:
---

Hi [~c...@koeninger.org], any plan on this? Thanks!

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15456.
---
  Resolution: Fixed
Assignee: Bryan Cutler
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.0.0
>
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15439) Failed to run unit test in SparkR

2016-05-20 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294522#comment-15294522
 ] 

Miao Wang commented on SPARK-15439:
---

Reproduced. Now I am analyzing the reason.

> Failed to run unit test in SparkR
> -
>
> Key: SPARK-15439
> URL: https://issues.apache.org/jira/browse/SPARK-15439
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>
> Failed to run ./R/run-tests.sh   around recent commit (May 19, 2016)
> It might be related to permission. It seems I used `sudo ./R/run-tests.sh` 
> and it worked sometimes. Without permission, maybe we couldn't access /tmp 
> directory.  However, the SparkR unit testing is still brittle.
> [error 
> message|https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15273) YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect OnOutOfMemoryError parameter given by user

2016-05-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15273:
--
Assignee: Ted Yu

> YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect 
> OnOutOfMemoryError parameter given by user
> -
>
> Key: SPARK-15273
> URL: https://issues.apache.org/jira/browse/SPARK-15273
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 2.0.0
>
>
> As Nirav reported in this thread:
> http://search-hadoop.com/m/q3RTtdF3yNLMd7u
> YarnSparkHadoopUtil#getOutOfMemoryErrorArgument previously specified 'kill 
> %p' unconditionally.
> We should respect the parameter given by user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15273) YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect OnOutOfMemoryError parameter given by user

2016-05-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15273.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13057
[https://github.com/apache/spark/pull/13057]

> YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect 
> OnOutOfMemoryError parameter given by user
> -
>
> Key: SPARK-15273
> URL: https://issues.apache.org/jira/browse/SPARK-15273
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
> Fix For: 2.0.0
>
>
> As Nirav reported in this thread:
> http://search-hadoop.com/m/q3RTtdF3yNLMd7u
> YarnSparkHadoopUtil#getOutOfMemoryErrorArgument previously specified 'kill 
> %p' unconditionally.
> We should respect the parameter given by user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-05-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294495#comment-15294495
 ] 

Sean Owen commented on SPARK-15457:
---

Yeah I have a PR brewing that will fix some of them, like the ones introduced 
by deprecating precison/recall in MulticlassMetrics (all should use accuracy)

> Eliminate MLlib 2.0 build warnings from deprecations
> 
>
> Key: SPARK-15457
> URL: https://issues.apache.org/jira/browse/SPARK-15457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Several classes and methods have been deprecated and are creating lots of 
> build warnings in branch-2.0.  This issue is to identify and fix those items:
> * *WithSGD classes: Change to make class not deprecated, object deprecated, 
> and public class constructor deprecated.  Any public use will require a 
> deprecated API.  We need to keep a non-deprecated private API since we cannot 
> eliminate certain uses: Python API, streaming algs, and examples.
> ** Use in PythonMLlibAPI: Change to using private constructors
> ** Streaming algs: No warnings after we un-deprecate the classes
> ** Examples: Deprecate or change ones which use deprecated APIs
> * others (to be listed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-20 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294494#comment-15294494
 ] 

Kazuaki Ishizaki commented on SPARK-15285:
--

I can take it today if they are busy.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15456:


Assignee: Apache Spark

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15456:


Assignee: (was: Apache Spark)

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294462#comment-15294462
 ] 

Apache Spark commented on SPARK-15456:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/13237

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-05-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15457:
--
Description: 
Several classes and methods have been deprecated and are creating lots of build 
warnings in branch-2.0.  This issue is to identify and fix those items:
* *WithSGD classes: Change to make class not deprecated, object deprecated, and 
public class constructor deprecated.  Any public use will require a deprecated 
API.  We need to keep a non-deprecated private API since we cannot eliminate 
certain uses: Python API, streaming algs, and examples.
** Use in PythonMLlibAPI: Change to using private constructors
** Streaming algs: No warnings after we un-deprecate the classes
** Examples: Deprecate or change ones which use deprecated APIs
* others (to be listed)

  was:
Several classes and methods have been deprecated and are creating lots of build 
warnings in branch-2.0.  This issue is to identify and fix those items:
* *WithSGD classes
* others (to be listed)


> Eliminate MLlib 2.0 build warnings from deprecations
> 
>
> Key: SPARK-15457
> URL: https://issues.apache.org/jira/browse/SPARK-15457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Several classes and methods have been deprecated and are creating lots of 
> build warnings in branch-2.0.  This issue is to identify and fix those items:
> * *WithSGD classes: Change to make class not deprecated, object deprecated, 
> and public class constructor deprecated.  Any public use will require a 
> deprecated API.  We need to keep a non-deprecated private API since we cannot 
> eliminate certain uses: Python API, streaming algs, and examples.
> ** Use in PythonMLlibAPI: Change to using private constructors
> ** Streaming algs: No warnings after we un-deprecate the classes
> ** Examples: Deprecate or change ones which use deprecated APIs
> * others (to be listed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15455) For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294454#comment-15294454
 ] 

Apache Spark commented on SPARK-15455:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13236

> For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop 
> classes
> -
>
> Key: SPARK-15455
> URL: https://issues.apache.org/jira/browse/SPARK-15455
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we always share Hadoop classes between Spark side and the 
> metastore client side (HiveClientImpl). However, it is possible that the 
> Hadoop used by the metastore client is in a different version of Hadoop. 
> Thus, in this case, we cannot share Hadoop classes. Once we disable sharing 
> Hadoop classes, we cannot pass a Hadoop Configuration to HiveClientImpl 
> because Configuration will be loaded by different classloaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-05-20 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-15457:
-

 Summary: Eliminate MLlib 2.0 build warnings from deprecations
 Key: SPARK-15457
 URL: https://issues.apache.org/jira/browse/SPARK-15457
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


Several classes and methods have been deprecated and are creating lots of build 
warnings in branch-2.0.  This issue is to identify and fix those items:
* *WithSGD classes
* others (to be listed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-15456:


 Summary: PySpark Shell fails to create SparkContext if HiveConf 
not found
 Key: SPARK-15456
 URL: https://issues.apache.org/jira/browse/SPARK-15456
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Bryan Cutler


When starting the PySpark shell, if HiveConf is not available then will fall 
back to create a SparkSession from a SparkContext.  This is attempted with the 
variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294446#comment-15294446
 ] 

Bryan Cutler commented on SPARK-15456:
--

I can submit a fix for this

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294437#comment-15294437
 ] 

Apache Spark commented on SPARK-15327:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/13235

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15327:


Assignee: Apache Spark  (was: Davies Liu)

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Apache Spark
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15327:


Assignee: Davies Liu  (was: Apache Spark)

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7

2016-05-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294426#comment-15294426
 ] 

Marcelo Vanzin commented on SPARK-15451:


I think there's still some value even if 2.1.0 switches to jdk8; it would help 
making sure backports don't break things, for example.

I'll see whether there's something simple that we can do.

> Spark PR builder should fail if code doesn't compile against JDK 7
> --
>
> Key: SPARK-15451
> URL: https://issues.apache.org/jira/browse/SPARK-15451
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> We need to compile certain parts of the build using jdk8, so that we test 
> things like lambdas. But when possible, we should either compile using jdk7, 
> or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in 
> jdk8-specific library calls.
> I'll take a look at fixing the maven / sbt files, but I'm not sure how to 
> update the PR builders since this will most probably require at least a new 
> env variable (to say where jdk7 is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15449) Wrong Data Format - Documentation Issue

2016-05-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15449:
--
Target Version/s:   (was: 1.6.1)
   Fix Version/s: (was: 1.6.1)

[~wangmiao1981] the problem is in the Java example only.
[~kiranbpatil] read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first 
if you please; it doesn't make sense to target 1.6.1, which you shouldn't set 
anyway.

> Wrong Data Format - Documentation Issue
> ---
>
> Key: SPARK-15449
> URL: https://issues.apache.org/jira/browse/SPARK-15449
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 1.6.1
>Reporter: Kiran Biradarpatil
>Priority: Minor
>
> JAVA example given for MLLib NaiveBayes at 
> http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data 
> in LibSVM format. But the example data in MLLib 
> data/mllib/sample_naive_bayes_data.txt is not in right format. 
> So please rectify the sample data file or the the implementation example.
> Thanks!
> Kiran 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15078) Add all TPCDS 1.4 benchmark queries for SparkSQL

2016-05-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15078.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13188
[https://github.com/apache/spark/pull/13188]

> Add all TPCDS 1.4 benchmark queries for SparkSQL
> 
>
> Key: SPARK-15078
> URL: https://issues.apache.org/jira/browse/SPARK-15078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7159) Support multiclass logistic regression in spark.ml

2016-05-20 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294372#comment-15294372
 ] 

Seth Hendrickson edited comment on SPARK-7159 at 5/20/16 10:17 PM:
---

[~dbtsai][~josephkb] I'd like to take this one if it's still open. I have an 
implementation that is functional except for some corner cases, and can have a 
PR submitted before too long. 

One part of the design that needs to be discussed (as far as I can tell), is 
how to handle passing the coefficients/intercepts to the model without breaking 
the API. If we were not concerned about the API compatibility, I'd say the best 
way would be to make the intercept a {{Vector}} and the coefficients a 
{{Vector}} (flattened) or a {{Matrix}}. I can't think of a way that would be 
both easy to use and not break the API. With that in mind, another option may 
be to stick with the same convention used in MLlib where the 
intercept/coefficients follow the obvious convention for binary logistic 
regression, but in the case of multinomial the intercept is always zero 
(meaningless), and the coefficients are a flattened {{Vector}} with the 
intercepts baked in. This is not a user-friendly solution IMO, but it would not 
break the API. Perhaps this has already been discussed? 

Thanks for your input!


was (Author: sethah):
[~dbtsai][~josephkb] I'd like to take this one if it's still open. I have an 
implementation that is functional except for some corner cases, and can have a 
PR submitted before too long. 

One part of the design that needs to be discussed (as far as I can tell), is 
how to handle passing the coefficients/intercepts to the model without breaking 
the API. If we were not concerned about the API compatibility, I'd say the best 
way would be to make the intercept an {{Vector}} and the coefficients a 
{{Vector}} (flattened) or a {{Matrix}}. I can't think of a way that would be 
both easy to use and not break the API. With that in mind, another option may 
be to stick with the same convention used in MLlib where the 
intercept/coefficients follow the obvious convention for binary logistic 
regression, but in the case of multinomial the intercept is always zero 
(meaningless), and the coefficients are a flattened {{Vector}} with the 
intercepts baked in. This is not a user-friendly solution IMO, but it would not 
break the API. Perhaps this has already been discussed? 

Thanks for your input!

> Support multiclass logistic regression in spark.ml
> --
>
> Key: SPARK-7159
> URL: https://issues.apache.org/jira/browse/SPARK-7159
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Critical
>
> This should be implemented by checking the input DataFrame's label column for 
> feature metadata specifying the number of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15446) catalyst using BigInteger.longValueExact that not supporting java 7 and compile error

2016-05-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15446.
---
Resolution: Duplicate

> catalyst using BigInteger.longValueExact that not supporting java 7 and 
> compile error
> -
>
> Key: SPARK-15446
> URL: https://issues.apache.org/jira/browse/SPARK-15446
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>
> catalyst using BigInteger.longValueExact that not supporting java 7 and 
> compile error.
> in source file:
> org.apache.spark.sql.types.Decimal.scala, line 137
> if using java 7 jdk, then compile fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15446) catalyst using BigInteger.longValueExact that not supporting java 7 and compile error

2016-05-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-15446:
---

[~WeichenXu123] yes please search JIRA first, and "Fixed" is not the correct 
resolution

> catalyst using BigInteger.longValueExact that not supporting java 7 and 
> compile error
> -
>
> Key: SPARK-15446
> URL: https://issues.apache.org/jira/browse/SPARK-15446
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>
> catalyst using BigInteger.longValueExact that not supporting java 7 and 
> compile error.
> in source file:
> org.apache.spark.sql.types.Decimal.scala, line 137
> if using java 7 jdk, then compile fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7159) Support multiclass logistic regression in spark.ml

2016-05-20 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294372#comment-15294372
 ] 

Seth Hendrickson commented on SPARK-7159:
-

[~dbtsai][~josephkb] I'd like to take this one if it's still open. I have an 
implementation that is functional except for some corner cases, and can have a 
PR submitted before too long. 

One part of the design that needs to be discussed (as far as I can tell), is 
how to handle passing the coefficients/intercepts to the model without breaking 
the API. If we were not concerned about the API compatibility, I'd say the best 
way would be to make the intercept an {{Vector}} and the coefficients a 
{{Vector}} (flattened) or a {{Matrix}}. I can't think of a way that would be 
both easy to use and not break the API. With that in mind, another option may 
be to stick with the same convention used in MLlib where the 
intercept/coefficients follow the obvious convention for binary logistic 
regression, but in the case of multinomial the intercept is always zero 
(meaningless), and the coefficients are a flattened {{Vector}} with the 
intercepts baked in. This is not a user-friendly solution IMO, but it would not 
break the API. Perhaps this has already been discussed? 

Thanks for your input!

> Support multiclass logistic regression in spark.ml
> --
>
> Key: SPARK-7159
> URL: https://issues.apache.org/jira/browse/SPARK-7159
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>Priority: Critical
>
> This should be implemented by checking the input DataFrame's label column for 
> feature metadata specifying the number of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7

2016-05-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294370#comment-15294370
 ] 

Sean Owen commented on SPARK-15451:
---

That's fine with me, but so is just going ahead and requiring Java 8. That may 
be upon us if there's no objection to requiring it in 2.1.0, which means master 
would soon require it. Just saying don't go to a lot of trouble.

> Spark PR builder should fail if code doesn't compile against JDK 7
> --
>
> Key: SPARK-15451
> URL: https://issues.apache.org/jira/browse/SPARK-15451
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> We need to compile certain parts of the build using jdk8, so that we test 
> things like lambdas. But when possible, we should either compile using jdk7, 
> or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in 
> jdk8-specific library calls.
> I'll take a look at fixing the maven / sbt files, but I'm not sure how to 
> update the PR builders since this will most probably require at least a new 
> env variable (to say where jdk7 is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15454) HadoopFsRelation should filter out files starting with _

2016-05-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15454.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> HadoopFsRelation should filter out files starting with _
> 
>
> Key: SPARK-15454
> URL: https://issues.apache.org/jira/browse/SPARK-15454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not 
> be reading those files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15327:
--

Assignee: Davies Liu

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15455) For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes

2016-05-20 Thread Yin Huai (JIRA)

Yin Huai created SPARK-15455:


 Summary: For IsolatedClientLoader, we need to provide a conf to 
disable sharing Hadoop classes
 Key: SPARK-15455
 URL: https://issues.apache.org/jira/browse/SPARK-15455
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yin Huai
Assignee: Yin Huai


Right now, we always share Hadoop classes between Spark side and the metastore 
client side (HiveClientImpl). However, it is possible that the Hadoop used by 
the metastore client is in a different version of Hadoop. Thus, in this case, 
we cannot share Hadoop classes. Once we disable sharing Hadoop classes, we 
cannot pass a Hadoop Configuration to HiveClientImpl because Configuration will 
be loaded by different classloaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294242#comment-15294242
 ] 

Apache Spark commented on SPARK-8426:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/13234

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15449) Wrong Data Format - Documentation Issue

2016-05-20 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294237#comment-15294237
 ] 

Miao Wang commented on SPARK-15449:
---

This example doesn't require libsvm format as it has its own data parsing 
function.

Otherwise, it can use sc.read.format("libsvm") to load data.

> Wrong Data Format - Documentation Issue
> ---
>
> Key: SPARK-15449
> URL: https://issues.apache.org/jira/browse/SPARK-15449
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 1.6.1
>Reporter: Kiran Biradarpatil
>Priority: Minor
> Fix For: 1.6.1
>
>
> JAVA example given for MLLib NaiveBayes at 
> http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data 
> in LibSVM format. But the example data in MLLib 
> data/mllib/sample_naive_bayes_data.txt is not in right format. 
> So please rectify the sample data file or the the implementation example.
> Thanks!
> Kiran 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294219#comment-15294219
 ] 

Apache Spark commented on SPARK-11827:
--

User 'ted-yu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13233

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Assignee: kevin yu
>Priority: Minor
> Fix For: 2.0.0
>
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15405) YARN uploading the same __spark_conf__.zip twice

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15405:


Assignee: Apache Spark

> YARN uploading the same __spark_conf__.zip twice
> 
>
> Key: SPARK-15405
> URL: https://issues.apache.org/jira/browse/SPARK-15405
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> I was running 2.0 and noticed we are now uploading what appears to be the 
> same __spark_conf__.zip file twice.
> This was introduced when we changed how the cache files are handled:
> https://github.com/apache/spark/commit/f47dbf27fa034629fab12d0f3c89ab75edb03f86
> If they are truly the same we should be able to just use the same zip file:
> 16/05/19 14:31:22 INFO Client: Uploading resource 
> file:/tmp/spark-ad014dac-9682-4d83-af7a-53b16e5d6423/__spark_conf__717768860288979034.zip
>  -> 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/tgraves/.sparkStaging/application_1463551738094_11599/__spark_conf__717768860288979034.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15405) YARN uploading the same __spark_conf__.zip twice

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15405:


Assignee: (was: Apache Spark)

> YARN uploading the same __spark_conf__.zip twice
> 
>
> Key: SPARK-15405
> URL: https://issues.apache.org/jira/browse/SPARK-15405
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> I was running 2.0 and noticed we are now uploading what appears to be the 
> same __spark_conf__.zip file twice.
> This was introduced when we changed how the cache files are handled:
> https://github.com/apache/spark/commit/f47dbf27fa034629fab12d0f3c89ab75edb03f86
> If they are truly the same we should be able to just use the same zip file:
> 16/05/19 14:31:22 INFO Client: Uploading resource 
> file:/tmp/spark-ad014dac-9682-4d83-af7a-53b16e5d6423/__spark_conf__717768860288979034.zip
>  -> 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/tgraves/.sparkStaging/application_1463551738094_11599/__spark_conf__717768860288979034.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15405) YARN uploading the same __spark_conf__.zip twice

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294212#comment-15294212
 ] 

Apache Spark commented on SPARK-15405:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13232

> YARN uploading the same __spark_conf__.zip twice
> 
>
> Key: SPARK-15405
> URL: https://issues.apache.org/jira/browse/SPARK-15405
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> I was running 2.0 and noticed we are now uploading what appears to be the 
> same __spark_conf__.zip file twice.
> This was introduced when we changed how the cache files are handled:
> https://github.com/apache/spark/commit/f47dbf27fa034629fab12d0f3c89ab75edb03f86
> If they are truly the same we should be able to just use the same zip file:
> 16/05/19 14:31:22 INFO Client: Uploading resource 
> file:/tmp/spark-ad014dac-9682-4d83-af7a-53b16e5d6423/__spark_conf__717768860288979034.zip
>  -> 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/tgraves/.sparkStaging/application_1463551738094_11599/__spark_conf__717768860288979034.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-05-20 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294181#comment-15294181
 ] 

Miles Crawford commented on SPARK-4563:
---

Any chance we could boost the priority of this? I think that driver processes 
running locally against a shared cluster is not an uncommon use case.

Being able to configure the executors to connect to a specific driver IP is 
commonly necessary to deal with NAT or any situation where the source address 
observed by the executor cannot be used to connect directly to the driver.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-20 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294178#comment-15294178
 ] 

Davies Liu commented on SPARK-15285:


cc [~cloud_fan]

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB

2016-05-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15285:
---
Assignee: Wenchen Fan

> Generated SpecificSafeProjection.apply method grows beyond 64 KB
> 
>
> Key: SPARK-15285
> URL: https://issues.apache.org/jira/browse/SPARK-15285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>
> The following code snippet results in 
> {noformat}
>  org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> {noformat}
> {code}
> case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", 
> s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", 
> s10:String="10", s11:String="11", s12:String="12", s13:String="13", 
> s14:String="14", s15:String="15", s16:String="16", s17:String="17", 
> s18:String="18", s19:String="19", s20:String="20", s21:String="21", 
> s22:String="22", s23:String="23", s24:String="24", s25:String="25", 
> s26:String="26", s27:String="27", s28:String="28", s29:String="29", 
> s30:String="30", s31:String="31", s32:String="32", s33:String="33", 
> s34:String="34", s35:String="35", s36:String="36", s37:String="37", 
> s38:String="38", s39:String="39", s40:String="40", s41:String="41", 
> s42:String="42", s43:String="43", s44:String="44", s45:String="45", 
> s46:String="46", s47:String="47", s48:String="48", s49:String="49", 
> s50:String="50", s51:String="51", s52:String="52", s53:String="53", 
> s54:String="54", s55:String="55", s56:String="56", s57:String="57", 
> s58:String="58", s59:String="59", s60:String="60", s61:String="61", 
> s62:String="62", s63:String="63", s64:String="64", s65:String="65", 
> s66:String="66", s67:String="67", s68:String="68", s69:String="69", 
> s70:String="70", s71:String="71", s72:String="72", s73:String="73", 
> s74:String="74", s75:String="75", s76:String="76", s77:String="77", 
> s78:String="78", s79:String="79", s80:String="80", s81:String="81", 
> s82:String="82", s83:String="83", s84:String="84", s85:String="85", 
> s86:String="86", s87:String="87", s88:String="88", s89:String="89", 
> s90:String="90", s91:String="91", s92:String="92", s93:String="93", 
> s94:String="94", s95:String="95", s96:String="96", s97:String="97", 
> s98:String="98", s99:String="99", s100:String="100")
> case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: 
> S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: 
> S100=S100(), s9: S100=S100(), s10: S100=S100())
> val ds = Seq(S(),S(),S()).toDS
> ds.show()
> {code}
> I could reproduce this with Spark built from 1.6 branch and with 
> https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15453) Sort Merge Join to use bucketing metadata to optimize query plan

2016-05-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:

Summary: Sort Merge Join to use bucketing metadata to optimize query plan  
(was: Improve join planning for bucketed / sorted tables)

> Sort Merge Join to use bucketing metadata to optimize query plan
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14331) Exceptions saving to parquetFile after join from dataframes in master

2016-05-20 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294168#comment-15294168
 ] 

Davies Liu commented on SPARK-14331:


Could you post the full stacktrace? This exception should be caused by another 
one.

> Exceptions saving to parquetFile after join from dataframes in master
> -
>
> Key: SPARK-14331
> URL: https://issues.apache.org/jira/browse/SPARK-14331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I'm trying to use master and write to a parquet file when using a dataframe 
> but am seeing the exception below.  Not sure exact state of dataframes right 
> now so if this is known issue let me know.
> I read 2 sources of parquet files, joined them, then saved them back.
>  val df_pixels = sqlContext.read.parquet("data1")
> val df_pixels_renamed = df_pixels.withColumnRenamed("photo_id", 
> "pixels_photo_id")
> val df_meta = sqlContext.read.parquet("data2")
> val df = df_meta.as("meta").join(df_pixels_renamed, $"meta.photo_id" === 
> $"pixels_photo_id", "inner").drop("pixels_photo_id")
> df.write.parquet(args(0))
> 16/04/01 17:21:34 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(pixels_photo_id#3, 2), None
> +- WholeStageCodegen
>:  +- Filter isnotnull(pixels_photo_id#3)
>: +- INPUT
>+- Coalesce 0
>   +- WholeStageCodegen
>  :  +- Project [img_data#0,photo_id#1 AS pixels_photo_id#3]
>  : +- Scan HadoopFiles[img_data#0,photo_id#1] Format: 
> ParquetFormat, PushedFilters: [], ReadSchema: 
> struct
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:109)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
> at 
> org.apache.spark.sql.execution.InputAdapter.upstreams(WholeStageCodegen.scala:236)
> at org.apache.spark.sql.execution.Sort.upstreams(Sort.scala:104)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:351)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
> at 
> org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegen.scala:228)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:137)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294165#comment-15294165
 ] 

Apache Spark commented on SPARK-15453:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/13231

> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15453:


Assignee: Apache Spark

> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15453:


Assignee: (was: Apache Spark)

> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294144#comment-15294144
 ] 

Apache Spark commented on SPARK-15165:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/13230

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 2.0.0
>
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code causes compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15205) Codegen can compile the same source code more than twice

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294145#comment-15294145
 ] 

Apache Spark commented on SPARK-15205:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/13230

> Codegen can compile the same source code more than twice
> 
>
> Key: SPARK-15205
> URL: https://issues.apache.org/jira/browse/SPARK-15205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> Sometimes, we have generated codes they are equal except for comments.
> One example is here.
> {code}
> val df = sc.parallelize(1 to 10).toDF
> df.selectExpr("value + 1").show // query1
> df.selectExpr("value + 2").show // query2
> {code}
> The following code is one of generated code when query1 above is executed.
> {code}
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> /* 004 */ }
> /* 005 */ 
> /* 006 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */   
> /* 008 */   private Object[] references;
> /* 009 */   private MutableRow mutableRow;
> /* 010 */   private Object[] values;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   
> /* 013 */   
> /* 014 */   public SpecificSafeProjection(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 017 */ 
> /* 018 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 019 */   }
> /* 020 */   
> /* 021 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 022 */ InternalRow i = (InternalRow) _i;
> /* 023 */ /* createexternalrow(if (isnull(input[0, int])) null else 
> input[0, int], StructField((value + 1),IntegerType,false)) */
> /* 024 */ values = new Object[1];
> /* 025 */ /* if (isnull(input[0, int])) null else input[0, int] */
> /* 026 */ /* isnull(input[0, int]) */
> /* 027 */ /* input[0, int] */
> /* 028 */ int value3 = i.getInt(0);
> /* 029 */ boolean isNull1 = false;
> /* 030 */ int value1 = -1;
> /* 031 */ if (!false && false) {
> /* 032 */   /* null */
> /* 033 */   final int value4 = -1;
> /* 034 */   isNull1 = true;
> /* 035 */   value1 = value4;
> /* 036 */ } else {
> /* 037 */   /* input[0, int] */
> /* 038 */   int value5 = i.getInt(0);
> /* 039 */   isNull1 = false;
> /* 040 */   value1 = value5;
> /* 041 */ }
> /* 042 */ if (isNull1) {
> /* 043 */   values[0] = null;
> /* 044 */ } else {
> /* 045 */   values[0] = value1;
> /* 046 */ }
> /* 047 */ 
> /* 048 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> this.schema);
> /* 049 */ if (false) {
> /* 050 */   mutableRow.setNullAt(0);
> /* 051 */ } else {
> /* 052 */   
> /* 053 */   mutableRow.update(0, value);
> /* 054 */ }
> /* 055 */ 
> /* 056 */ return mutableRow;
> /* 057 */   }
> /* 058 */ }
> /* 059 */ 
> {code}
> On the other hand, the following code is for query2.
> {code}
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> /* 004 */ }
> /* 005 */ 
> /* 006 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */   
> /* 008 */   private Object[] references;
> /* 009 */   private MutableRow mutableRow;
> /* 010 */   private Object[] values;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   
> /* 013 */   
> /* 014 */   public SpecificSafeProjection(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 017 */ 
> /* 018 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 019 */   }
> /* 020 */   
> /* 021 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 022 */ InternalRow i = (InternalRow) _i;
> /* 023 */ /* createexternalrow(if (isnull(input[0, int])) null else 
> input[0, int], StructField((value + 2),IntegerType,false)) */
> /* 024 */ values = new Object[1];
> /* 025 */ /* if (isnull(input[0, int])) null else input[0, int] */
> /* 026 */ /* isnull(input[0, int]) */
> /* 027 */ /* input[0, int] */
> /* 028 */ int value3 = i.getInt(0);
> /* 029 */ boolean isNull1 = false;
> /* 030 */ int value1 = -1;
> /* 031 */ if (!false && false) {
> /* 032 */   /* null */

[jira] [Closed] (SPARK-15448) Flaky test:pyspark.ml.tests.DefaultValuesTests.test_java_params

2016-05-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-15448.
--
   Resolution: Duplicate
Fix Version/s: 2.0.0

> Flaky test:pyspark.ml.tests.DefaultValuesTests.test_java_params
> ---
>
> Key: SPARK-15448
> URL: https://issues.apache.org/jira/browse/SPARK-15448
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.0.0
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> {code}
> ==
> FAIL [1.284s]: test_java_params (pyspark.ml.tests.DefaultValuesTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
>  line 1161, in test_java_params
> self.check_params(cls())
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
>  line 1136, in check_params
> % (p.name, str(py_stage)))
> AssertionError: True != False : Default value mismatch of param 
> linkPredictionCol for Params GeneralizedLinearRegression_4a78b84aab05b0ed2192
> --
> {code}
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3003/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14031:


Assignee: Apache Spark  (was: Davies Liu)

> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Assignee: Apache Spark
>Priority: Critical
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14031:


Assignee: Davies Liu  (was: Apache Spark)

> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Assignee: Davies Liu
>Priority: Critical
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294127#comment-15294127
 ] 

Apache Spark commented on SPARK-14031:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/13229

> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Assignee: Davies Liu
>Priority: Critical
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15438) Improve the explain of whole-stage codegen

2016-05-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15438.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Improve the explain of whole-stage codegen
> --
>
> Key: SPARK-15438
> URL: https://issues.apache.org/jira/browse/SPARK-15438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Currently, the explain of a query with whole-stage codegen looks like this
> {code}
> >>> df = sqlCtx.range(1000);df2 = 
> >>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
> >>> 'id').explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#1L]
> : +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
> ::- Range 0, 1, 4, 1000, [id#1L]
> :+- INPUT
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>+- WholeStageCodegen
>   :  +- Range 0, 1, 4, 1000, [id#4L]
> {code}
> The problem is that the plan looks much different than logical plan, make us 
> hard to understand the plan (especially when the logical plan is not showed 
> together).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-20 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294116#comment-15294116
 ] 

Nick Pentreath commented on SPARK-15447:


[~mengxr] yes will aim to run some tests during early next week.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15294) Add pivot functionality to SparkR

2016-05-20 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294091#comment-15294091
 ] 

Felix Cheung commented on SPARK-15294:
--

Feel free to ping me if you need any help!






On Thu, May 19, 2016 at 4:42 AM -0700, "Mikołaj Hnatiuk (JIRA)" 
 wrote:






[ 
https://issues.apache.org/jira/browse/SPARK-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290961#comment-15290961
 ]

Mikołaj Hnatiuk commented on SPARK-15294:
-

I'm on it. I have some troubles debugging this whole SparkR API -> backend.R -> 
Spark pipeline and I'm getting error messages that are hard for me to digest, 
but I guess there is no guide how to do this, so I will just take my time doing 
this:)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Add pivot functionality to SparkR
> -
>
> Key: SPARK-15294
> URL: https://issues.apache.org/jira/browse/SPARK-15294
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Mikołaj Hnatiuk
>Priority: Minor
>  Labels: pivot
>
> R users are very used to transforming data using functions such as dcast 
> (pkg:reshape2). https://github.com/apache/spark/pull/7841 introduces such 
> functionality to Scala and Python APIs. I'd like to suggest adding this 
> functionality into SparkR API to pivot DataFrames.
> I'd love to to this, however, my knowledge of Scala is still limited, but 
> with a proper guidance I can give it a try.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15442:


Assignee: Apache Spark  (was: Nick Pentreath)

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15442:


Assignee: Nick Pentreath  (was: Apache Spark)

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15442:
--

Assignee: Nick Pentreath

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14031) Dataframe to csv IO, system performance enters high CPU state and write operation takes 1 hour to complete

2016-05-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14031:
--

Assignee: Davies Liu

> Dataframe to csv IO, system performance enters high CPU state and write 
> operation takes 1 hour to complete
> --
>
> Key: SPARK-14031
> URL: https://issues.apache.org/jira/browse/SPARK-14031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 
> -1TB and Ubuntu14.04 Vagrant 4 Cores 8g
>Reporter: Vincent Ohprecio
>Assignee: Davies Liu
>Priority: Critical
> Attachments: visualVMscreenshot.png
>
>
> Summary
> When using spark-assembly-2.0.0/spark-shell trying to write out results of 
> dataframe to csv, system performance enters high CPU state and write 
> operation takes 1 hour to complete. 
> * Affecting: [Stage 5:>  (0 + 2) / 21]
> * Stage 5 elapsed time 348827227ns
> In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data 
> and Stage5 csv write times where between 2 - 22 seconds. 
> In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where 
> similar between 2 - 22 seconds.
> Files 
> 1. Data File is "2008.csv"
> 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html
> 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 1 - Setup
> High CPU and 58 minute average completion time 
> * MACOSX 10.11.2
> * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB 
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb
> Observation 2 - Setup
> High CPU and waited over hour for csv write but didnt wait to complete 
> * Ubuntu14.04
> * 4cores 8gb
> * spark-assembly-2.0.0
> * spark-csv_2.11-1.4
> Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-05-20 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294053#comment-15294053
 ] 

Reynold Xin commented on SPARK-15453:
-

[~tejasp] there are multiple issues here right? The ticket is actually not 
about smj, but rather avoiding exchanges if the input are already 
co-partitioned, and also avoiding sorts if the input are already sorted?


> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-05-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15453:

Summary: Improve join planning for bucketed / sorted tables  (was: Support 
for SMB Join)

> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15454) HadoopFsRelation should filter out files starting with _

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15454:


Assignee: Reynold Xin  (was: Apache Spark)

> HadoopFsRelation should filter out files starting with _
> 
>
> Key: SPARK-15454
> URL: https://issues.apache.org/jira/browse/SPARK-15454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not 
> be reading those files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15454) HadoopFsRelation should filter out files starting with _

2016-05-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294047#comment-15294047
 ] 

Apache Spark commented on SPARK-15454:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13227

> HadoopFsRelation should filter out files starting with _
> 
>
> Key: SPARK-15454
> URL: https://issues.apache.org/jira/browse/SPARK-15454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not 
> be reading those files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15454) HadoopFsRelation should filter out files starting with _

2016-05-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15454:


Assignee: Apache Spark  (was: Reynold Xin)

> HadoopFsRelation should filter out files starting with _
> 
>
> Key: SPARK-15454
> URL: https://issues.apache.org/jira/browse/SPARK-15454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not 
> be reading those files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7

2016-05-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293958#comment-15293958
 ] 

Marcelo Vanzin edited comment on SPARK-15451 at 5/20/16 7:38 PM:
-

I'm not suggesting building twice. I'm suggesting building whatever needs to be 
compatible with jdk7 using jdk7 (or at least using jdk7's rt.jar).


was (Author: vanzin):
I'm not suggestion building twice. I'm suggesting building whatever needs to be 
compatible with jdk7 using jdk7 (or at least using jdk7's rt.jar).

> Spark PR builder should fail if code doesn't compile against JDK 7
> --
>
> Key: SPARK-15451
> URL: https://issues.apache.org/jira/browse/SPARK-15451
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> We need to compile certain parts of the build using jdk8, so that we test 
> things like lambdas. But when possible, we should either compile using jdk7, 
> or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in 
> jdk8-specific library calls.
> I'll take a look at fixing the maven / sbt files, but I'm not sure how to 
> update the PR builders since this will most probably require at least a new 
> env variable (to say where jdk7 is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15190) Support using SQLUserDefinedType for case classes

2016-05-20 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15190.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12965
[https://github.com/apache/spark/pull/12965]

> Support using SQLUserDefinedType for case classes
> -
>
> Key: SPARK-15190
> URL: https://issues.apache.org/jira/browse/SPARK-15190
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now inferring the schema for case classes happens before searching the 
> SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case 
> classes doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15454) HadoopFsRelation should filter out files starting with _

2016-05-20 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15454:
---

 Summary: HadoopFsRelation should filter out files starting with _
 Key: SPARK-15454
 URL: https://issues.apache.org/jira/browse/SPARK-15454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be 
reading those files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15453) Support for SMB Join

2016-05-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:

Description: 
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

{noformat}
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  : +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
: +- WholeStageCodegen
::  +- Project [j#20,k#21,i#22]
:: +- Filter (isnotnull(k#21) && isnotnull(j#20))
::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   : +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
  +- WholeStageCodegen
 :  +- Project [j#23,k#24,i#25]
 : +- Filter (isnotnull(k#24) && isnotnull(j#23))
 :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
{noformat}

  was:
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

{quote}
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  : +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
: +- WholeStageCodegen
::  +- Project [j#20,k#21,i#22]
:: +- Filter (isnotnull(k#21) && isnotnull(j#20))
::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   : +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
  +- WholeStageCodegen
 :  +- Project [j#23,k#24,i#25]
 : +- Filter (isnotnull(k#24) && isnotnull(j#23))
 :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
{quote}


> Support for SMB Join
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Updated] (SPARK-15453) Support for SMB Join

2016-05-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:

Description: 
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

```
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  : +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
: +- WholeStageCodegen
::  +- Project [j#20,k#21,i#22]
:: +- Filter (isnotnull(k#21) && isnotnull(j#20))
::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   : +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
  +- WholeStageCodegen
 :  +- Project [j#23,k#24,i#25]
 : +- Filter (isnotnull(k#24) && isnotnull(j#23))
 :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
```

> Support for SMB Join
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> ```
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15453) Support for SMB Join

2016-05-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:

Description: 
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

{quote}
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  : +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
: +- WholeStageCodegen
::  +- Project [j#20,k#21,i#22]
:: +- Filter (isnotnull(k#21) && isnotnull(j#20))
::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   : +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
  +- WholeStageCodegen
 :  +- Project [j#23,k#24,i#25]
 : +- Filter (isnotnull(k#24) && isnotnull(j#23))
 :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
{quote}

  was:
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

```
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  : +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
: +- WholeStageCodegen
::  +- Project [j#20,k#21,i#22]
:: +- Filter (isnotnull(k#21) && isnotnull(j#20))
::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   : +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
  +- WholeStageCodegen
 :  +- Project [j#23,k#24,i#25]
 : +- Filter (isnotnull(k#24) && isnotnull(j#23))
 :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct
```


> Support for SMB Join
> 
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {quote}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15453) Support for SMB Join

2016-05-20 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-15453:
---

 Summary: Support for SMB Join
 Key: SPARK-15453
 URL: https://issues.apache.org/jira/browse/SPARK-15453
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tejas Patil
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 245 matches

Mail list logo