[jira] [Comment Edited] (SPARK-10525) Add Python example for VectorSlicer to user guide

2016-04-30 Thread Amit Shinde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263489#comment-15263489
 ] 

Amit Shinde edited comment on SPARK-10525 at 5/1/16 3:46 AM:
-

Hi :

I was looking at this JIRA and found a similar JIRA logged and fixed here 
[SPARK-14514|https://issues.apache.org/jira/browse/SPARK-14514] .

The pull request is here : https://github.com/apache/spark/pull/12282

Does this resolve this JIRA as well ?

[~josephkb]
--
Amit


was (Author: ashinde1):
Hi :

I was looking at this JIRA and found a similar JIRA logged and fixed here 
[SPARK-14514|https://issues.apache.org/jira/browse/SPARK-14514] .

The pull request is here : https://github.com/apache/spark/pull/12282

Does this resolve this JIRA as well ?

--
Amit

> Add Python example for VectorSlicer to user guide
> -
>
> Key: SPARK-10525
> URL: https://issues.apache.org/jira/browse/SPARK-10525
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13425) Documentation for CSV datasource options

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13425:


Assignee: (was: Apache Spark)

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265594#comment-15265594
 ] 

Apache Spark commented on SPARK-13425:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/12817

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13425) Documentation for CSV datasource options

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13425:


Assignee: Apache Spark

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15033) fix a flaky test in CachedTableSuite

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15033.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> fix a flaky test in CachedTableSuite
> 
>
> Key: SPARK-15033
> URL: https://issues.apache.org/jira/browse/SPARK-15033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-04-30 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265591#comment-15265591
 ] 

Xin Wu commented on SPARK-14927:


Since Spark 2.0.0 has moved around a lot of stuff, including splitting the 
HiveMetaStoreCatalog into 2 files for resolving and creating tables, 
respectively, I would try this on Spark 2.0.0. 

{code}scala> spark.sql("create database if not exists tmp")
16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning 
NoSuchObjectException
res23: org.apache.spark.sql.DataFrame = []

scala> 
df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1")
16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data 
source relation `tmp`.`tmp1` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive. Input path(s): 
file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1

scala> spark.sql("select * from tmp.tmp1").show
+---++
|val|year|
+---++
|  a|2012|
+---++
{code}

For datasource table creation as above, SparkSQL will create the table as a 
hive internal table but not compatible with hive. SparkSQL puts partition 
column information (actually including also other things like column schema, 
bucket/sort columns) into serdeInfo.parameters. When querying the table, 
SparkSQL resolve the table and parse the information back from 
serdeInfo.parameters. 

Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL 
commands are run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", 
the command now does not support showing partitions for datasource table. 

{code}
scala> spark.sql("show partitions tmp.tmp1").show
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a 
datasource table: tmp.tmp1;
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529)
  ... 48 elided
{code}

Hope this helps. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> 

[jira] [Commented] (SPARK-14422) Improve handling of optional configs in SQLConf

2016-04-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265589#comment-15265589
 ] 

Marcelo Vanzin commented on SPARK-14422:


Hi [~techaddict] you're free to take any bug that is not assigned to anyone.

> Improve handling of optional configs in SQLConf
> ---
>
> Key: SPARK-14422
> URL: https://issues.apache.org/jira/browse/SPARK-14422
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> As Michael showed here: 
> https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150
> Handling of optional configs in SQLConf is a little sub-optimal right now. We 
> should clean that up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14422) Improve handling of optional configs in SQLConf

2016-04-30 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265573#comment-15265573
 ] 

Sandeep Singh commented on SPARK-14422:
---

Hi Marcelo,
Do you mind if I take this up ?

> Improve handling of optional configs in SQLConf
> ---
>
> Key: SPARK-14422
> URL: https://issues.apache.org/jira/browse/SPARK-14422
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> As Michael showed here: 
> https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150
> Handling of optional configs in SQLConf is a little sub-optimal right now. We 
> should clean that up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options

2016-04-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265572#comment-15265572
 ] 

Hyukjin Kwon commented on SPARK-13425:
--

[~rxin] I will. Thanks! (I believe R one is not yet, 
https://issues.apache.org/jira/browse/SPARK-13174)

> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14684) Verification of partition specs in SessionCatalog

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265570#comment-15265570
 ] 

Apache Spark commented on SPARK-14684:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12801

> Verification of partition specs in SessionCatalog
> -
>
> Key: SPARK-14684
> URL: https://issues.apache.org/jira/browse/SPARK-14684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When users inputting invalid partition spec, we might not be able to catch 
> and issue the error messages. Sometimes, it could cause a disaster result. 
> For example, previously, when we alter a table and drop a partition with 
> invalid spec, it could drop all the partitions due to a bug/defect in Hive 
> Metastore API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options

2016-04-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265564#comment-15265564
 ] 

Reynold Xin commented on SPARK-13425:
-

[~hyukjin.kwon] want to submit a pr now for this documentation? Remember we 
have scala, python, and maybe R (not sure if CSV data source exists in R yet).


> Documentation for CSV datasource options
> 
>
> Key: SPARK-13425
> URL: https://issues.apache.org/jira/browse/SPARK-13425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As said https://github.com/apache/spark/pull/11262#discussion_r53508815,
> CSV datasource is added for Spark 2.0.0 and therefore the options might have 
> to be added in documentation.
> The options can be found 
> [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf]
>  in Parsing Options section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14143) Options for parsing NaNs, Infinity and nulls for numeric types

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14143.
-
   Resolution: Fixed
 Assignee: Hossein Falaki
Fix Version/s: 2.0.0

> Options for parsing NaNs, Infinity and nulls for numeric types
> --
>
> Key: SPARK-14143
> URL: https://issues.apache.org/jira/browse/SPARK-14143
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15036) When creating a database, we need to qualify its path

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15036.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> When creating a database, we need to qualify its path
> -
>
> Key: SPARK-15036
> URL: https://issues.apache.org/jira/browse/SPARK-15036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15034.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Use the value of spark.sql.warehouse.dir as the warehouse location instead of 
> using hive.metastore.warehouse.dir
> 
>
> Key: SPARK-15034
> URL: https://issues.apache.org/jira/browse/SPARK-15034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set 
> warehouse location. We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15035) SessionCatalog needs to set the location for default DB

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15035.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> SessionCatalog needs to set the location for default DB
> ---
>
> Key: SPARK-15035
> URL: https://issues.apache.org/jira/browse/SPARK-15035
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> Right now, in SessionCatalog, the default location of the database is an 
> empty string. It will break create table command when we use SparkSession 
> without hive support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14931) Mismatched default Param values between pipelines in Spark and PySpark

2016-04-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14931:
--
Summary: Mismatched default Param values between pipelines in Spark and 
PySpark  (was: Mismatched default values between pipelines in Spark and PySpark)

> Mismatched default Param values between pipelines in Spark and PySpark
> --
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: ML, PySpark
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0

2016-04-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13448:
--
Description: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
spark.mllib
* SPARK-14768: Remove expectedType arg for PySpark Param
* SPARK-14931: Mismatched default Param values between pipelines in Spark and 
PySpark

  was:
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
spark.mllib
* SPARK-14768: Remove expectedType arg for PySpark Param


> Document MLlib behavior changes in Spark 2.0
> 
>
> Key: SPARK-13448
> URL: https://issues.apache.org/jira/browse/SPARK-13448
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
> remember to add them to the migration guide / release notes.
> * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
> to 1e-6.
> * SPARK-7780: Intercept will not be regularized if users train binary 
> classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, 
> because it calls ML LogisticRegresson implementation. Meanwhile if users set 
> without regularization, training with or without feature scaling will return 
> the same solution by the same convergence rate(because they run the same code 
> route), this behavior is different from the old API.
> * SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
> results
> * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
> default, if checkpointing is being used.
> * SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
> not handle them correctly.
> * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
> spark.mllib
> * SPARK-14768: Remove expectedType arg for PySpark Param
> * SPARK-14931: Mismatched default Param values between pipelines in Spark and 
> PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-04-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-15043:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0
 Priority: Blocker  (was: Major)
  Component/s: MLlib
  Summary: Fix and re-enable flaky test: 
mllib.stat.JavaStatisticsSuite.testCorr  (was: Flaky test: 
mllib.stat.JavaStatisticsSuite.testCorr)

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15043) Flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-04-30 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-15043:
--

 Summary: Flaky test: mllib.stat.JavaStatisticsSuite.testCorr
 Key: SPARK-15043
 URL: https://issues.apache.org/jira/browse/SPARK-15043
 Project: Spark
  Issue Type: Bug
Reporter: Josh Rosen


It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
flaky:

https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr

The first observed failure was in 
https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816

{code}
java.lang.AssertionError: expected:<0.9986422261219262> but 
was:<0.9986422261219272>
at 
org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
{code}

I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark

2016-04-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14931:
--
Assignee: Xusen Yin

> Mismatched default values between pipelines in Spark and PySpark
> 
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: ML, PySpark
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark

2016-04-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14931:
--
Shepherd: Joseph K. Bradley

> Mismatched default values between pipelines in Spark and PySpark
> 
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: ML, PySpark
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265518#comment-15265518
 ] 

Apache Spark commented on SPARK-14931:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12816

> Mismatched default values between pipelines in Spark and PySpark
> 
>
> Key: SPARK-14931
> URL: https://issues.apache.org/jira/browse/SPARK-14931
> Project: Spark
>  Issue Type: Bug
>Reporter: Xusen Yin
>  Labels: ML, PySpark
>
> Mismatched default values between pipelines in Spark and PySpark lead to 
> different pipelines in PySpark after saving and loading.
> Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python/Java example.

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15031:
--
Description: 
This PR aims to update Scala/Python/Java examples by replacing `SQLContext` 
with newly added `SparkSession`. For this, two new `SparkSesion` ctor are 
added, and also fixes the following examples.

**sql.py**
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}

**dataframe_example.py**
{code}
- features = df.select("features").map(lambda r: r.features)
+ features = df.select("features").rdd.map(lambda r: r.features)
{code}

Note that the following examples are untouched in this PR since it fails some 
unknown issue.

- `simple_params_example.py`
- `aft_survival_regression.py`

  was:
This PR aims to update Scala/Python/Java examples by replacing SQLContext with 
newly added SparkSession. For this, two new `SparkSesion` ctor are added, and 
also fixes the following examples.

**sql.py**
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}

**dataframe_example.py**
{code}
- features = df.select("features").map(lambda r: r.features)
+ features = df.select("features").rdd.map(lambda r: r.features)
{code}

Note that the following examples are untouched in this PR since it fails some 
unknown issue.

- `simple_params_example.py`
- `aft_survival_regression.py`


> Use SparkSession in Scala/Python/Java example.
> --
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Dongjoon Hyun
>
> This PR aims to update Scala/Python/Java examples by replacing `SQLContext` 
> with newly added `SparkSession`. For this, two new `SparkSesion` ctor are 
> added, and also fixes the following examples.
> **sql.py**
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}
> **dataframe_example.py**
> {code}
> - features = df.select("features").map(lambda r: r.features)
> + features = df.select("features").rdd.map(lambda r: r.features)
> {code}
> Note that the following examples are untouched in this PR since it fails some 
> unknown issue.
> - `simple_params_example.py`
> - `aft_survival_regression.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python/Java example.

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15031:
--
   Priority: Major  (was: Trivial)
Description: 
This PR aims to update Scala/Python/Java examples by replacing SQLContext with 
newly added SparkSession. For this, two new `SparkSesion` ctor are added, and 
also fixes the following examples.

**sql.py**
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}

**dataframe_example.py**
{code}
- features = df.select("features").map(lambda r: r.features)
+ features = df.select("features").rdd.map(lambda r: r.features)
{code}

Note that the following examples are untouched in this PR since it fails some 
unknown issue.

- `simple_params_example.py`
- `aft_survival_regression.py`

  was:
This PR aims to update Scala/Python examples by replacing SQLContext with newly 
added SparkSession. Also, this fixes the following examples.

**sql.py**
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}

**dataframe_example.py**
{code}
- features = df.select("features").map(lambda r: r.features)
+ features = df.select("features").rdd.map(lambda r: r.features)
{code}

Note that the following examples are untouched in this PR since it fails some 
unknown issue.

- `simple_params_example.py`
- `aft_survival_regression.py`

Summary: Use SparkSession in Scala/Python/Java example.  (was: Use 
SparkSession in Scala/Python example.)

> Use SparkSession in Scala/Python/Java example.
> --
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Dongjoon Hyun
>
> This PR aims to update Scala/Python/Java examples by replacing SQLContext 
> with newly added SparkSession. For this, two new `SparkSesion` ctor are 
> added, and also fixes the following examples.
> **sql.py**
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}
> **dataframe_example.py**
> {code}
> - features = df.select("features").map(lambda r: r.features)
> + features = df.select("features").rdd.map(lambda r: r.features)
> {code}
> Note that the following examples are untouched in this PR since it fails some 
> unknown issue.
> - `simple_params_example.py`
> - `aft_survival_regression.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265472#comment-15265472
 ] 

Dongjoon Hyun commented on SPARK-15037:
---

I'll add the following constructor to `SparkSession` and proceed SPARK-15031 
first.
{code}
  def this(sparkContext: JavaSparkContext) = this(sparkContext.sc)
{code}

> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265470#comment-15265470
 ] 

Dongjoon Hyun commented on SPARK-15037:
---

It's because `JavaSparkContext` cannot be converted to SparkContext in the 
following code.
{code}
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SparkSession spark = new SparkSession(ctx)
{code}

> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15042) ConnectedComponents fails to compute graph with 200 vertices (but long paths)

2016-04-30 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Claßen updated SPARK-15042:
---
Description: 
ConnectedComponents takes forever and eventually fails with OutOfMemory when 
computing this graph: {code}{ (i, i+1) | i <- { 1..200 } }{code}

If you generate the example graph, e.g., with this bash command

{code}
for i in {1..200} ; do echo "$i $(($i+1))" ; done > input.graph
{code}

... then should be able to reproduce in the spark-shell by running:

{code}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
val graph = GraphLoader.edgeListFile(sc, "input.graph").cache()

ConnectedComponents.run(graph)
{code}

I seems to take forever, and spawns these warnings from time to time:

{code}
16/04/30 20:06:24 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@7af98fbd,BlockManagerId(driver, localhost, 
43440))] in 1 attempts
{code}

For additional information, here is a link to my related question on 
Stackoverflow:
http://stackoverflow.com/q/36892272/783510

One comment so far, was that the number of skipping tasks grows exponentially.

---

Here is the complete output of a spark-shell session:

{noformat}
phil@terra-arch:~/tmp/spark-graph$ spark-shell 
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Spark context available as sc.
SQL context available as sqlContext.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
  /_/
 
Using Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._

scala> import org.apache.spark.graphx.lib._
import org.apache.spark.graphx.lib._

scala> 

scala> val graph = GraphLoader.edgeListFile(sc, "input.graph").cache()
graph: org.apache.spark.graphx.Graph[Int,Int] = 
org.apache.spark.graphx.impl.GraphImpl@1fa9692b

scala> ConnectedComponents.run(graph)
16/04/30 20:05:29 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@50432fd2,BlockManagerId(driver, localhost, 
43440))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 

[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265417#comment-15265417
 ] 

Reynold Xin commented on SPARK-15037:
-

Why do we need JavaSparkSession? SparkSession itself should be Java friendly.


> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2016-04-30 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265411#comment-15265411
 ] 

Gayathri Murali commented on SPARK-15041:
-

I can work on this

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15042) ConnectedComponents fails to compute graph with 200 vertices (but long paths)

2016-04-30 Thread JIRA
Philipp Claßen created SPARK-15042:
--

 Summary: ConnectedComponents fails to compute graph with 200 
vertices (but long paths)
 Key: SPARK-15042
 URL: https://issues.apache.org/jira/browse/SPARK-15042
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.6.1
 Environment: Local cluster (1 instance) running on Arch Linux
Scala 2.11.7, Java 1.8.0_92
Reporter: Philipp Claßen


ConnectedComponents takes forever and eventually fails with OutOfMemory when 
computing this graph: {code}{ (i, i+1) | i <- { 1..200 } }{code}

If you generate the example graph, e.g., with this bash command

{code}
for i in {1..200} ; do echo "$i $(($i+1))" ; done > input.graph
{code}

... then should be able to reproduce in the spark-shell by running:

{code}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
val graph = GraphLoader.edgeListFile(sc, "input.graph").cache()

ConnectedComponents.run(graph)
{code}

For additional information, here is a link to my related question on 
Stackoverflow:
http://stackoverflow.com/q/36892272/783510

One comment so far, was that the number of skipping tasks grows exponentially.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-04-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265397#comment-15265397
 ] 

Xiao Li commented on SPARK-14993:
-

Ok, if nobody starts it, I will work on this. Thanks!

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2016-04-30 Thread yuhao yang (JIRA)
yuhao yang created SPARK-15041:
--

 Summary: adding mode strategy for ml.feature.Imputer for 
categorical features
 Key: SPARK-15041
 URL: https://issues.apache.org/jira/browse/SPARK-15041
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Priority: Minor


Adding mode strategy for ml.feature.Imputer for categorical features. This need 
to wait until PR for SPARK-13568 gets merged.
https://github.com/apache/spark/pull/11601

>From comments of jkbradley and Nick Pentreath in the PR
{quote}
Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
approaches such as frequentItems or Count-Min Sketch (will require an update to 
CMS to return "heavy-hitters").
investigate if we can use metadata to only allow mode for categorical features 
(or perhaps as an easier alternative, allow mode for only Int/Long columns)
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15040) PySpark impl for ml.feature.Imputer

2016-04-30 Thread yuhao yang (JIRA)
yuhao yang created SPARK-15040:
--

 Summary: PySpark impl for ml.feature.Imputer
 Key: SPARK-15040
 URL: https://issues.apache.org/jira/browse/SPARK-15040
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Priority: Minor


PySpark impl for ml.feature.Imputer.

This need to wait until PR for SPARK-13568 gets merged.
https://github.com/apache/spark/pull/11601




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15039) Kinesis reciever does not work in Yarn

2016-04-30 Thread Tsai Li Ming (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsai Li Ming updated SPARK-15039:
-
Description: 
Hi,

Using the pyspark kinesis example, it does not receive any messages from 
Kinesis when submitting to a YARN cluster, though it is working fine when using 
local mode. 

{code}
spark-submit \
--executor-cores 4 \
--num-executors 4 \
--packages 
com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
 
{code}

I had to downgrade the package to 1.5.1. 1.6.1 does not work too. 

Not sure whether this is related to SPARK-12453

  was:
Hi,

Using the pyspark kinesis example, it does not receive any messages from 
Kinesis when submitting to a YARN cluster, though it is working fine when using 
local mode. 

{code}
spark-submit \
--executor-cores 4 \
--num-executors 4 \
--packages 
com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
 
{code}

I had to downgrade the package to 1.5.1. 1.6.1 does not work too. 


> Kinesis reciever does not work in Yarn
> --
>
> Key: SPARK-15039
> URL: https://issues.apache.org/jira/browse/SPARK-15039
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: YARN
> HDP 2.4.0
>Reporter: Tsai Li Ming
>
> Hi,
> Using the pyspark kinesis example, it does not receive any messages from 
> Kinesis when submitting to a YARN cluster, though it is working fine when 
> using local mode. 
> {code}
> spark-submit \
> --executor-cores 4 \
> --num-executors 4 \
> --packages 
> com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
>  
> {code}
> I had to downgrade the package to 1.5.1. 1.6.1 does not work too. 
> Not sure whether this is related to SPARK-12453



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14785) Support correlated scalar subquery

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265357#comment-15265357
 ] 

Apache Spark commented on SPARK-14785:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12815

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14785) Support correlated scalar subquery

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14785:


Assignee: (was: Apache Spark)

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14785) Support correlated scalar subquery

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14785:


Assignee: Apache Spark

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15039) Kinesis reciever does not work in Yarn

2016-04-30 Thread Tsai Li Ming (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsai Li Ming updated SPARK-15039:
-
Description: 
Hi,

Using the pyspark kinesis example, it does not receive any messages from 
Kinesis when submitting to a YARN cluster, though it is working fine when using 
local mode. 

{code}
spark-submit \
--executor-cores 4 \
--num-executors 4 \
--packages 
com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
 
{code}

I had to downgrade the package to 1.5.1. 1.6.1 does not work too. 

  was:
Hi,

Using the pyspark kinesis example, it does not receive any messages from 
Kinesis when submitting to a YARN cluster, though it is working when using 
local mode. 

```
spark-submit \
--executor-cores 4 \
--num-executors 4 \
--packages 
com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
 
```

I had to downgrade the package to 1.5.1 before it can work. 


> Kinesis reciever does not work in Yarn
> --
>
> Key: SPARK-15039
> URL: https://issues.apache.org/jira/browse/SPARK-15039
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: YARN
> HDP 2.4.0
>Reporter: Tsai Li Ming
>
> Hi,
> Using the pyspark kinesis example, it does not receive any messages from 
> Kinesis when submitting to a YARN cluster, though it is working fine when 
> using local mode. 
> {code}
> spark-submit \
> --executor-cores 4 \
> --num-executors 4 \
> --packages 
> com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
>  
> {code}
> I had to downgrade the package to 1.5.1. 1.6.1 does not work too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14952) Remove methods that were deprecated in 1.6.0

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265356#comment-15265356
 ] 

Apache Spark commented on SPARK-14952:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12815

> Remove methods that were deprecated in 1.6.0
> 
>
> Key: SPARK-14952
> URL: https://issues.apache.org/jira/browse/SPARK-14952
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> Running {{grep -inr "@deprecated"}} I found a few methods that were 
> deprecated in SPARK-1.6:
> {noformat}
> ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193:  
> @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392:
>   @deprecated("Use coefficients instead.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483:
>   @deprecated("Use coefficients instead.", "1.6.0")
> {noformat}
> Lets remove those as part of 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15039) Kinesis reciever does not work in Yarn

2016-04-30 Thread Tsai Li Ming (JIRA)
Tsai Li Ming created SPARK-15039:


 Summary: Kinesis reciever does not work in Yarn
 Key: SPARK-15039
 URL: https://issues.apache.org/jira/browse/SPARK-15039
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.0
 Environment: YARN
HDP 2.4.0
Reporter: Tsai Li Ming


Hi,

Using the pyspark kinesis example, it does not receive any messages from 
Kinesis when submitting to a YARN cluster, though it is working when using 
local mode. 

```
spark-submit \
--executor-cores 4 \
--num-executors 4 \
--packages 
com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1
 
```

I had to downgrade the package to 1.5.1 before it can work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15030.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12813
[https://github.com/apache/spark/pull/12813]

> Support formula in spark.kmeans in SparkR
> -
>
> Key: SPARK-15030
> URL: https://issues.apache.org/jira/browse/SPARK-15030
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> In SparkR, spark.kmeans take a DataFrame with double columns. This is 
> different from other ML methods we implemented, which support R model 
> formula. We should add support for that as well.
> {code:none}
> spark.kmeans(data = df, formula = ~ lat + lon, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15038) Add ability to do broadcasts in SQL at execution time

2016-04-30 Thread Patrick Woody (JIRA)
Patrick Woody created SPARK-15038:
-

 Summary: Add ability to do broadcasts in SQL at execution time
 Key: SPARK-15038
 URL: https://issues.apache.org/jira/browse/SPARK-15038
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.1
Reporter: Patrick Woody


Currently the auto broadcasting done in SparkSQL is asynchronous and done at 
query planning time. If you have a large query with many broadcasts, this can 
end up creating a large amount of memory pressure/possible OOMs all at once 
when it actually isn't necessary.

The current workaround for these types of queries is to disable broadcast 
joins, which can be prohibitive performance wise. The proposal for this ticket 
is to allow a config point to toggle doing these broadcasts either 
eagerly/asynchronously or doing the broadcasts lazily at execution time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14785) Support correlated scalar subquery

2016-04-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265338#comment-15265338
 ] 

Xiao Li commented on SPARK-14785:
-

Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. 
Thanks!

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14785) Support correlated scalar subquery

2016-04-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14785:

Description: 
For example:
{code}
SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
{code}
it could be rewritten as 

{code}
SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
t3.id = t.id where b > avg_c
{code}

TPCDS Q92, Q81, Q6 required this

Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.

  was:
For example:
{code}
SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
{code}
it could be rewritten as 

{code}
SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
t3.id = t.id where b > avg_c
{code}

TPCDS Q92, Q81, Q6 required this


> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing

2016-04-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265335#comment-15265335
 ] 

Xiao Li commented on SPARK-14968:
-

[~hvanhovell]

Yeah, you are right. After trying to reproduce it, I got the following error 
message. "org.apache.spark.sql.AnalysisException: Correlated scalar subqueries 
are not supported"

Glad to know you are working on the support of correlated scalar subquery. 
Thanks! 

Xiao

> TPC-DS query 1 resolved attribute(s) missing
> 
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> with customer_total_return as
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
>  limit 100
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14952) Remove methods that were deprecated in 1.6.0

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14952:
--
Assignee: Herman van Hovell

> Remove methods that were deprecated in 1.6.0
> 
>
> Key: SPARK-14952
> URL: https://issues.apache.org/jira/browse/SPARK-14952
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> Running {{grep -inr "@deprecated"}} I found a few methods that were 
> deprecated in SPARK-1.6:
> {noformat}
> ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193:  
> @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392:
>   @deprecated("Use coefficients instead.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483:
>   @deprecated("Use coefficients instead.", "1.6.0")
> {noformat}
> Lets remove those as part of 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14952) Remove methods that were deprecated in 1.6.0

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14952.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12732
[https://github.com/apache/spark/pull/12732]

> Remove methods that were deprecated in 1.6.0
> 
>
> Key: SPARK-14952
> URL: https://issues.apache.org/jira/browse/SPARK-14952
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Reporter: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> Running {{grep -inr "@deprecated"}} I found a few methods that were 
> deprecated in SPARK-1.6:
> {noformat}
> ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193:  
> @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392:
>   @deprecated("Use coefficients instead.", "1.6.0")
> ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483:
>   @deprecated("Use coefficients instead.", "1.6.0")
> {noformat}
> Lets remove those as part of 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265330#comment-15265330
 ] 

Apache Spark commented on SPARK-14850:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/12814

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14653.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12802
[https://github.com/apache/spark/pull/12802]

> Remove NumericParser and jackson dependency from mllib-local
> 
>
> Key: SPARK-14653
> URL: https://issues.apache.org/jira/browse/SPARK-14653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> After SPARK-14549, we should remove NumericParser and jackson from 
> mllib-local, which were introduced very earlier and now replaced by UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14162) java.lang.IllegalStateException: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-30 Thread Martin Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265310#comment-15265310
 ] 

Martin Hall commented on SPARK-14162:
-

I got the same error when I had forgotten to copy the oracle jdbc jar file 
(ojdbc6.jar) to one of the spark worker nodes

> java.lang.IllegalStateException: Did not find registered driver with class 
> oracle.jdbc.OracleDriver
> ---
>
> Key: SPARK-14162
> URL: https://issues.apache.org/jira/browse/SPARK-14162
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Zoltan Fedor
>
> This is an interesting one.
> We are using JupyterHub with Python to connect to a Hadoop cluster to run 
> Spark jobs and as the new Spark versions come out I compile them and add as 
> new kernels to JupyterHub to be used.
> There are also some libraries we are using, like ojdbc to connect to an 
> Oracle database.
> Now the interesting thing, that ojdbc worked fine in Spark 1.6.0 but suddenly 
> "it cannot be found" in 1.6.1.
> Everything, all settings are the same when starting pyspark 1.6.1 and 1.6.0, 
> so there is no reason for it not to work in 1.6.1 if it works in 1.6.0.
> This is the pysparjk code I am running in both 1.6.1 and 1.6.0:
> {quote}
> df = 
> sqlContext.read.format('jdbc').options(url='jdbc:oracle:thin:'+connection_script+'',
>  dbtable='bi.contact').load()
> print(df.count()){quote}
> And it throws this error in 1.6.1 only:
> {quote}
> java.lang.IllegalStateException: Did not find registered driver with class 
> oracle.jdbc.OracleDriver
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:347)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){quote}
> I know that this usually means that the ojdbc driver is not available on the 
> executor, but it is. Spark is being started the exact same way in 1.6.1 as in 
> 1.6.0 and it does find it on 1.6.0.
> I can steadily reproduce this, so the only conclusion that something must 
> have changed between 1.6.0 and 1.6.1 causing this, but I have see no 
> "depreciation" notice of anything what could cause this.
> Environment variables set when starting pyspark 1.6.1:
> {quote}
>   "SPARK_HOME": "/usr/lib/spark-1.6.1-hive",
>   "SCALA_HOME": "/usr/lib/scala",
>   "HADOOP_CONF_DIR": "/etc/hadoop/venus-hadoop-conf",
>   "HADOOP_HOME": "/usr/bin/hadoop",
>   "HIVE_HOME": "/usr/bin/hive",
>   "LD_LIBRARY_PATH": "/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH",
>   "YARN_HOME": "",
>   "SPARK_DIST_CLASSPATH": 
> "/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*",
>   "SPARK_LIBRARY_PATH": "/usr/lib/hadoop/lib",
>   "PATH": 
> 

[jira] [Commented] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing

2016-04-30 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265296#comment-15265296
 ] 

Herman van Hovell commented on SPARK-14968:
---

[~jfc...@us.ibm.com] This is a correlated scalar subquery and this does this 
not work yet. I am currently working on this.

> TPC-DS query 1 resolved attribute(s) missing
> 
>
> Key: SPARK-14968
> URL: https://issues.apache.org/jira/browse/SPARK-14968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> This is a regression from a week ago. Failed to generate plan for query 1 in 
> TPCDS using 0427 build from 
> people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/.
> Was working in build from 0421.
> The error is:
> {noformat}
> 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from 
> processCmd at CliDriver.java:376
> 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes.
> Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from 
> ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = 
> ctr_store_sk#2);
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/04/27 07:00:59 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> {noformat}
> The query is:
> {noformat}
> with customer_total_return as
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_RETURN_AMT) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
>  select  c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
>  limit 100
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15031:
--
Description: 
This PR aims to update Scala/Python examples by replacing SQLContext with newly 
added SparkSession. Also, this fixes the following examples.

**sql.py**
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}

**dataframe_example.py**
{code}
- features = df.select("features").map(lambda r: r.features)
+ features = df.select("features").rdd.map(lambda r: r.features)
{code}

Note that the following examples are untouched in this PR since it fails some 
unknown issue.

- `simple_params_example.py`
- `aft_survival_regression.py`

  was:
Currently, Python SQL example, `sql.py`, fails.

{code}
bin/spark-submit examples/src/main/python/sql.py
Traceback (most recent call last):
  File 
"/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", line 
60, in 
people = sqlContext.jsonFile(path)
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
{code}

{code}
Traceback (most recent call last):
  File 
"/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", line 
72, in 
people.registerAsTable("people")
  File "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line 795, in __getattr__
AttributeError: 'DataFrame' object has no attribute 'registerAsTable'
{code}

This issue fixes them by the following fix.
{code}
-people = sqlContext.jsonFile(path)
+people = sqlContext.read.json(path)
...
-people.registerAsTable("people")
+people.registerTempTable("people")
{code}


> Use SparkSession in Scala/Python example.
> -
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This PR aims to update Scala/Python examples by replacing SQLContext with 
> newly added SparkSession. Also, this fixes the following examples.
> **sql.py**
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}
> **dataframe_example.py**
> {code}
> - features = df.select("features").map(lambda r: r.features)
> + features = df.select("features").rdd.map(lambda r: r.features)
> {code}
> Note that the following examples are untouched in this PR since it fails some 
> unknown issue.
> - `simple_params_example.py`
> - `aft_survival_regression.py`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15031:
--
Issue Type: Improvement  (was: Bug)

> Use SparkSession in Scala/Python example.
> -
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, Python SQL example, `sql.py`, fails.
> {code}
> bin/spark-submit examples/src/main/python/sql.py
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 60, in 
> people = sqlContext.jsonFile(path)
> AttributeError: 'SQLContext' object has no attribute 'jsonFile'
> {code}
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 72, in 
> people.registerAsTable("people")
>   File 
> "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 795, in __getattr__
> AttributeError: 'DataFrame' object has no attribute 'registerAsTable'
> {code}
> This issue fixes them by the following fix.
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> ...
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15031:
--
Summary: Use SparkSession in Scala/Python example.  (was: Fix SQL python 
example)

> Use SparkSession in Scala/Python example.
> -
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, Python SQL example, `sql.py`, fails.
> {code}
> bin/spark-submit examples/src/main/python/sql.py
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 60, in 
> people = sqlContext.jsonFile(path)
> AttributeError: 'SQLContext' object has no attribute 'jsonFile'
> {code}
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 72, in 
> people.registerAsTable("people")
>   File 
> "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 795, in __getattr__
> AttributeError: 'DataFrame' object has no attribute 'registerAsTable'
> {code}
> This issue fixes them by the following fix.
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> ...
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-04-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265288#comment-15265288
 ] 

Liang-Chi Hsieh commented on SPARK-14906:
-

Yes.

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15030:


Assignee: Yanbo Liang  (was: Apache Spark)

> Support formula in spark.kmeans in SparkR
> -
>
> Key: SPARK-15030
> URL: https://issues.apache.org/jira/browse/SPARK-15030
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In SparkR, spark.kmeans take a DataFrame with double columns. This is 
> different from other ML methods we implemented, which support R model 
> formula. We should add support for that as well.
> {code:none}
> spark.kmeans(data = df, formula = ~ lat + lon, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15030:


Assignee: Apache Spark  (was: Yanbo Liang)

> Support formula in spark.kmeans in SparkR
> -
>
> Key: SPARK-15030
> URL: https://issues.apache.org/jira/browse/SPARK-15030
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> In SparkR, spark.kmeans take a DataFrame with double columns. This is 
> different from other ML methods we implemented, which support R model 
> formula. We should add support for that as well.
> {code:none}
> spark.kmeans(data = df, formula = ~ lat + lon, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265282#comment-15265282
 ] 

Apache Spark commented on SPARK-15030:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12813

> Support formula in spark.kmeans in SparkR
> -
>
> Key: SPARK-15030
> URL: https://issues.apache.org/jira/browse/SPARK-15030
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In SparkR, spark.kmeans take a DataFrame with double columns. This is 
> different from other ML methods we implemented, which support R model 
> formula. We should add support for that as well.
> {code:none}
> spark.kmeans(data = df, formula = ~ lat + lon, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14858) Push predicates with subquery

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14858:
--
Assignee: Herman van Hovell

> Push predicates with subquery 
> --
>
> Key: SPARK-14858
> URL: https://issues.apache.org/jira/browse/SPARK-14858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> Currently we rewrite the subquery as Join in the beginning of Optimizer, we 
> should defer that to enable predicates push down (because Join can't be 
> easily pushed down).
> cc [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14981) CatalogTable should contain sorting directions of sorting columns

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14981:
--
Assignee: Cheng Lian

> CatalogTable should contain sorting directions of sorting columns
> -
>
> Key: SPARK-14981
> URL: https://issues.apache.org/jira/browse/SPARK-14981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> For a bucketed table with sorting columns, {{CatalogTable}} only records 
> sorting column names, while sorting directions (ASC/DESC) are missing.
> Our SQL parser supports the syntax, but sorting directions are silently 
> dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13289:
--
Assignee: Nick Pentreath

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>Assignee: Nick Pentreath
>  Labels: features
> Fix For: 2.0.0
>
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14737) Kafka Brokers are down - spark stream should retry

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14737.
---
Resolution: Not A Problem

Given the problem statement here, I think this is not a Spark problem.

> Kafka Brokers are down - spark stream should retry
> --
>
> Key: SPARK-14737
> URL: https://issues.apache.org/jira/browse/SPARK-14737
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
> Environment: Suse Linux, Cloudera Enterprise 5.4.8 (#7 built by 
> jenkins on 20151023-1205 git: d7dbdf29ac1d57ae9fb19958502d50dcf4e4fffd), 
> kafka_2.10-0.8.2.2
>Reporter: Faisal
>
> I have spark streaming application that uses direct streaming - listening to 
> KAFKA topic.
> {code}
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3");
> kafkaParams.put("auto.offset.reset", "largest");
> HashSet topicsSet = new HashSet();
> topicsSet.add("Topic1");
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(
> jssc, 
> String.class, 
> String.class,
> StringDecoder.class, 
> StringDecoder.class, 
> kafkaParams, 
> topicsSet
> );
> {code}
> I notice when i stop/shutdown kafka brokers, my spark application also 
> shutdown.
> Here is the spark execution script
> {code}
> spark-submit \
> --master yarn-cluster \
> --files /home/siddiquf/spark/log4j-spark.xml
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
> --conf 
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
> --class com.example.MyDataStreamProcessor \
> myapp.jar 
> {code}
> Spark job submitted successfully and i can track the application driver and 
> worker/executor nodes.
> Everything works fine but only concern if kafka borkers are offline or 
> restarted my application controlled by yarn should not shutdown? but it does.
> If this is expected behavior then how to handle such situation with least 
> maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and 
> managed by different team that is why requires our application to be 
> resilient enough.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14750) Make historyServer refer application log in hdfs

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14750.
---
Resolution: Won't Fix

> Make historyServer refer application log in hdfs
> 
>
> Key: SPARK-14750
> URL: https://issues.apache.org/jira/browse/SPARK-14750
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: SuYan
>
> Make history server refer application log, just like MR history server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13289.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11812
[https://github.com/apache/spark/pull/11812]

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
> Fix For: 2.0.0
>
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy

2016-04-30 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265261#comment-15265261
 ] 

Benjamin Fradet commented on SPARK-14985:
-

I'll take this one if you guys don't mind.

> Update LinearRegression, LogisticRegression summary internals to handle model 
> copy
> --
>
> Key: SPARK-14985
> URL: https://issues.apache.org/jira/browse/SPARK-14985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA + the PR for [SPARK-14852] for details.  The summaries should 
> handle creating an internal copy of the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14989) Upgrade to Jackson 2.7.3

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14989:
--
Priority: Blocker  (was: Major)

This is one of a handful that I think actually have to be resolved before 2.0.0 
one way or the other given it's a dependency change. 

> Upgrade to Jackson 2.7.3
> 
>
> Key: SPARK-14989
> URL: https://issues.apache.org/jira/browse/SPARK-14989
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12154) Upgrade to Jersey 2

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12154:
--
Priority: Blocker  (was: Major)

This is one of a handful that I think actually have to be resolved before 2.0.0 
one way or the other given it's a dependency change. 

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>Priority: Blocker
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15014) Spark Shell does not work with Ammonite Shell

2016-04-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265257#comment-15265257
 ] 

Sean Owen commented on SPARK-15014:
---

Why is this a Spark problem per se? Spark has its own shell (derived of course 
from the Scala shell), but it isn't pluggable.

> Spark Shell does not work with Ammonite Shell
> -
>
> Key: SPARK-15014
> URL: https://issues.apache.org/jira/browse/SPARK-15014
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: All
>Reporter: John-Michael Reed
>Priority: Minor
>  Labels: shell, shell-script
>
> Lihaoyi has an enhanced Scala Shell called Ammonite. 
> https://github.com/lihaoyi/Ammonite
> Users of Ammonite shell have tried to use it with Apache Spark. 
> https://github.com/lihaoyi/Ammonite/issues/382
> Spark Shell does not work with Ammonite Shell, but I want it to because the 
> Ammonite REPL offers enhanced auto-complete, pretty printing, and other 
> features. See http://www.lihaoyi.com/Ammonite/#Ammonite-REPL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265256#comment-15265256
 ] 

Dongjoon Hyun commented on SPARK-15037:
---

Hi, [~rxin]. 

Until now, it seems there are two issues.

- `object SQLContext` has still its own unique functions. We cannot replace 
`SQLContext` completely because `SharedSQLContext` uses it. Also, 
`MLlibTestSparkContext` does.
- Also, constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` 
class is needed for Java testsuite.

We had better handle them as separate issues before this kind of refactoring 
issue. How do you think about this?

> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265256#comment-15265256
 ] 

Dongjoon Hyun edited comment on SPARK-15037 at 4/30/16 9:08 AM:


Hi, [~rxin]. 

Until now, it seems there are two issues.

- `object SQLContext` has still its own unique functions. We cannot replace 
`SQLContext` completely because `SharedSQLContext` uses it. Also, 
`MLlibTestSparkContext` does.
- Constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` class is 
needed for Java testsuite.

We had better handle them as separate issues before this kind of refactoring 
issue. How do you think about this?


was (Author: dongjoon):
Hi, [~rxin]. 

Until now, it seems there are two issues.

- `object SQLContext` has still its own unique functions. We cannot replace 
`SQLContext` completely because `SharedSQLContext` uses it. Also, 
`MLlibTestSparkContext` does.
- Also, constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` 
class is needed for Java testsuite.

We had better handle them as separate issues before this kind of refactoring 
issue. How do you think about this?

> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15015) Log statements lack file name/number

2016-04-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265255#comment-15265255
 ] 

Sean Owen commented on SPARK-15015:
---

Hm, is it actually possible to know the line number at runtime? it's present in 
the bytecode but not sure how a logging API would reach it. Here, it's your IDE 
providing this info.

> Log statements lack file name/number
> 
>
> Key: SPARK-15015
> URL: https://issues.apache.org/jira/browse/SPARK-15015
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.1
> Environment: All
>Reporter: John-Michael Reed
>Priority: Trivial
>  Labels: debug, log
>
> I would like it if the Apache Spark project had file names and line numbers 
> in its log statements like this:
> http://i.imgur.com/4hvGQ0t.png
> The example uses my library, http://johnreedlol.github.io/scala-trace-debug/, 
> but https://github.com/lihaoyi/sourcecode is also useful for this purpose. 
> The real benefit in doing this is because the user of an IDE can jump to the 
> location of a log statement without having to set breakpoints.
> http://s29.postimg.org/ud0knou1j/debug_Screenshot_Crop.png
> Note that the arrow will go to the next log statement if each log statement 
> is hyperlinked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-15037:
-

 Summary: Use SparkSession instread of SQLContext in testsuites
 Key: SPARK-15037
 URL: https://issues.apache.org/jira/browse/SPARK-15037
 Project: Spark
  Issue Type: Bug
Reporter: Dongjoon Hyun


This issue aims to update the existing testsuites to use `SparkSession` 
instread of `SQLContext` since `SQLContext` exists just for backward 
compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites

2016-04-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15037:
--
Issue Type: Test  (was: Bug)

> Use SparkSession instread of SQLContext in testsuites
> -
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Test
>Reporter: Dongjoon Hyun
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15028) Remove Hive config override

2016-04-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15028.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12806
[https://github.com/apache/spark/pull/12806]

> Remove Hive config override
> ---
>
> Key: SPARK-15028
> URL: https://issues.apache.org/jira/browse/SPARK-15028
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14113) Consider marking JobConf closure-cleaning in HadoopRDD as optional

2016-04-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14113.
---
Resolution: Won't Fix

See PR discussion

> Consider marking JobConf closure-cleaning in HadoopRDD as optional
> --
>
> Key: SPARK-14113
> URL: https://issues.apache.org/jira/browse/SPARK-14113
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> In HadoopRDD, the following code was introduced as a part of SPARK-6943.
> {noformat}
>   if (initLocalJobConfFuncOpt.isDefined) {
> sparkContext.clean(initLocalJobConfFuncOpt.get)
>   }
> {noformat}
> When working on one of the changes in OrcRelation, I tried passing 
> initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance 
> penalty (due to closure cleaning) with large RDDs. This would be invoked for 
> every HadoopRDD initialization causing the bottleneck.
> example threadstack is given below
> {noformat}
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
> at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390)
> at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
> at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224)
> at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2079)
> at 
> org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112){noformat}
> Creating this JIRA to explore the possibility of removing it or mark it 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15036) When creating a database, we need to qualify its path

2016-04-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15036:


 Summary: When creating a database, we need to qualify its path
 Key: SPARK-15036
 URL: https://issues.apache.org/jira/browse/SPARK-15036
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15035) SessionCatalog needs to set the location for default DB

2016-04-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15035:


 Summary: SessionCatalog needs to set the location for default DB
 Key: SPARK-15035
 URL: https://issues.apache.org/jira/browse/SPARK-15035
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


Right now, in SessionCatalog, the default location of the database is an empty 
string. It will break create table command when we use SparkSession without 
hive support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-04-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15034:


 Summary: Use the value of spark.sql.warehouse.dir as the warehouse 
location instead of using hive.metastore.warehouse.dir
 Key: SPARK-15034
 URL: https://issues.apache.org/jira/browse/SPARK-15034
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set 
warehouse location. We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15033) fix a flaky test in CachedTableSuite

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265240#comment-15265240
 ] 

Apache Spark commented on SPARK-15033:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12811

> fix a flaky test in CachedTableSuite
> 
>
> Key: SPARK-15033
> URL: https://issues.apache.org/jira/browse/SPARK-15033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15033) fix a flaky test in CachedTableSuite

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15033:


Assignee: Apache Spark  (was: Wenchen Fan)

> fix a flaky test in CachedTableSuite
> 
>
> Key: SPARK-15033
> URL: https://issues.apache.org/jira/browse/SPARK-15033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15033) fix a flaky test in CachedTableSuite

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15033:


Assignee: Wenchen Fan  (was: Apache Spark)

> fix a flaky test in CachedTableSuite
> 
>
> Key: SPARK-15033
> URL: https://issues.apache.org/jira/browse/SPARK-15033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15033) fix a flaky test in CachedTableSuite

2016-04-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-15033:
---

 Summary: fix a flaky test in CachedTableSuite
 Key: SPARK-15033
 URL: https://issues.apache.org/jira/browse/SPARK-15033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Target Version/s:   (was: 2.0.0)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265238#comment-15265238
 ] 

Xiangrui Meng commented on SPARK-15027:
---

It might be tricky to use Dataset due to encoders and generic ID types. But if 
we use DataFrame as input and output, it seems feasible. It would be great if 
you can take a look.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Assignee: (was: Xiangrui Meng)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265232#comment-15265232
 ] 

Nick Pentreath commented on SPARK-15027:


Ok - it would make sense to have it in 2.0 if possible even though it is 
DeveloperApi. I can do it.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229
 ] 

Xiangrui Meng edited comment on SPARK-15027 at 4/30/16 7:50 AM:


Just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.


was (Author: mengxr):
No, just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229
 ] 

Xiangrui Meng commented on SPARK-15027:
---

No, just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265228#comment-15265228
 ] 

Nick Pentreath commented on SPARK-15027:


[~mengxr] are you intending this to be a more "superficial" change (as in, 
change the signature of train to take a Dataset, but still operate on RDDs 
inside the method), or try to have the entire algorithm operate on Dataset?

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This continue the work from SPARK-14412 to update 
> `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and 
> `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update 
> `ALS.train` to use `Dataset` instead of `RDD`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Description: We should also update `ALS.train` to use `Dataset/DataFrame` 
instead of `RDD` to be consistent with other APIs under spark.ml and it also 
leaves space for Tungsten-based optimization.  (was: This continue the work 
from SPARK-14412 to update `intermediateRDDStorageLevel` to 
`intermediateStorageLevel`, and `finalRDDStorageLevel` to `finalStoargeLevel`. 
We should also update `ALS.train` to use `Dataset` instead of `RDD`.)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Summary: ALS.train should use DataFrame instead of RDD  (was: ml.ALS params 
and ALS.train should not depend on RDD)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This continue the work from SPARK-14412 to update 
> `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and 
> `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update 
> `ALS.train` to use `Dataset` instead of `RDD`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive

2016-04-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15032:


 Summary: When we create a new JDBC session, we may need to create 
a new session of executionHive
 Key: SPARK-15032
 URL: https://issues.apache.org/jira/browse/SPARK-15032
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


Right now, we only use executionHive in thriftserver. When we create a new jdbc 
session, we probably need to create a new session of executionHive. I am not 
sure what will break if we leave the code as is. But, I feel it will be safer 
to create a new session of executionHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15029) Bad error message for two generators in the project clause

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15029:


Assignee: Apache Spark

> Bad error message for two generators in the project clause
> --
>
> Key: SPARK-15029
> URL: https://issues.apache.org/jira/browse/SPARK-15029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> {code}
> scala> spark.range(1000).map(i => (Array[Long](i), 
> Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true)
> org.apache.spark.sql.AnalysisException: Only one generator allowed per select 
> but Generate and and Explode found.;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> {code}
> It's confusing to call one "Generator" and the other "Explode". There is also 
> two "and"s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15029) Bad error message for two generators in the project clause

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265226#comment-15265226
 ] 

Apache Spark commented on SPARK-15029:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12810

> Bad error message for two generators in the project clause
> --
>
> Key: SPARK-15029
> URL: https://issues.apache.org/jira/browse/SPARK-15029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> scala> spark.range(1000).map(i => (Array[Long](i), 
> Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true)
> org.apache.spark.sql.AnalysisException: Only one generator allowed per select 
> but Generate and and Explode found.;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> {code}
> It's confusing to call one "Generator" and the other "Explode". There is also 
> two "and"s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15029) Bad error message for two generators in the project clause

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15029:


Assignee: (was: Apache Spark)

> Bad error message for two generators in the project clause
> --
>
> Key: SPARK-15029
> URL: https://issues.apache.org/jira/browse/SPARK-15029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> scala> spark.range(1000).map(i => (Array[Long](i), 
> Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true)
> org.apache.spark.sql.AnalysisException: Only one generator allowed per select 
> but Generate and and Explode found.;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> {code}
> It's confusing to call one "Generator" and the other "Explode". There is also 
> two "and"s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2016-04-30 Thread Partha Talukder (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265219#comment-15265219
 ] 

Partha Talukder commented on SPARK-14975:
-

Thanks Joseph. I would keep that in mind.

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13485.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> (Dataset-oriented) API evolution in Spark 2.0
> -
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are 
> not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal 
> because of its heavy dependency on RDDs)
> 3. First class support for sessions
> 4. First class support for some system catalog
> See the design doc for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0

2016-04-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13485:

Priority: Blocker  (was: Major)

> (Dataset-oriented) API evolution in Spark 2.0
> -
>
> Key: SPARK-13485
> URL: https://issues.apache.org/jira/browse/SPARK-13485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: API Evolution in Spark 2.0.pdf
>
>
> As part of Spark 2.0, we want to create a stable API foundation for Dataset 
> to become the main user-facing API in Spark. This ticket tracks various tasks 
> related to that.
> The main high level changes are:
> 1. Merge Dataset/DataFrame
> 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are 
> not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal 
> because of its heavy dependency on RDDs)
> 3. First class support for sessions
> 4. First class support for some system catalog
> See the design doc for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15031) Fix SQL python example

2016-04-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15031:


Assignee: (was: Apache Spark)

> Fix SQL python example
> --
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, Python SQL example, `sql.py`, fails.
> {code}
> bin/spark-submit examples/src/main/python/sql.py
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 60, in 
> people = sqlContext.jsonFile(path)
> AttributeError: 'SQLContext' object has no attribute 'jsonFile'
> {code}
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 72, in 
> people.registerAsTable("people")
>   File 
> "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 795, in __getattr__
> AttributeError: 'DataFrame' object has no attribute 'registerAsTable'
> {code}
> This issue fixes them by the following fix.
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> ...
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15031) Fix SQL python example

2016-04-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265218#comment-15265218
 ] 

Apache Spark commented on SPARK-15031:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12809

> Fix SQL python example
> --
>
> Key: SPARK-15031
> URL: https://issues.apache.org/jira/browse/SPARK-15031
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, Python SQL example, `sql.py`, fails.
> {code}
> bin/spark-submit examples/src/main/python/sql.py
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 60, in 
> people = sqlContext.jsonFile(path)
> AttributeError: 'SQLContext' object has no attribute 'jsonFile'
> {code}
> {code}
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", 
> line 72, in 
> people.registerAsTable("people")
>   File 
> "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 795, in __getattr__
> AttributeError: 'DataFrame' object has no attribute 'registerAsTable'
> {code}
> This issue fixes them by the following fix.
> {code}
> -people = sqlContext.jsonFile(path)
> +people = sqlContext.read.json(path)
> ...
> -people.registerAsTable("people")
> +people.registerTempTable("people")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >