date:20151001

[jira] [Commented] (SPARK-10883) Be able to build each module individually

2015-10-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940800#comment-14940800
 ] 

Jean-Baptiste Onofré commented on SPARK-10883:
--

Fair enough. Thanks for the update Marcelo. What do you think if I update the 
README.md with a quick note about this ?

> Be able to build each module individually
> -
>
> Key: SPARK-10883
> URL: https://issues.apache.org/jira/browse/SPARK-10883
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>
> Right now, due to the location of the scalastyle-config.xml location, it's 
> not possible to build an individual module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2015-10-01 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940778#comment-14940778
 ] 

Simeon Simeonov commented on SPARK-9761:


[~yhuai] What about this one? The problem survives as restart so it doesn't 
seem to be caused by lack of refreshing.

> Inconsistent metadata handling with ALTER TABLE
> ---
>
> Key: SPARK-9761
> URL: https://issues.apache.org/jira/browse/SPARK-9761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
> The table in question was created with {{HiveContext.read.json()}}.
> Steps:
> # {{alter table dimension_components add columns (z string);}} succeeds.
> # {{describe dimension_components;}} does not show the new column, even after 
> restarting spark-sql.
> # A second {{alter table dimension_components add columns (z string);}} fails 
> with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Duplicate column name: z
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940776#comment-14940776
 ] 

Simeon Simeonov commented on SPARK-9762:


[~yhuai] the Hive compatibility section of the documentation should be updated 
to identify these cases. It is unfortunate to trust the docs only to discover a 
known lack of compatibility that was not documented.

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct,template:string,variation:bigint>
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct,template:string,variation:bigint,z:string>;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940763#comment-14940763
 ] 

Yin Huai commented on SPARK-9762:
-

[~simeons] Different versions of Hive have different kinds internal 
restrictions. It is not always possible to store the metadata in a hive 
compatible way. For example, if the metastore uses Hive 0.13, Hive will rejects 
the create table call of a parquet table if columns have a binary or a decimal 
one. So, to save the table's metadata, we have to workaround it and save it in 
a way that is not compatible with hive. 

The reason that you see two different output for describe table and show 
columns is that spark sql has implemented describe table but we still delegate 
show columns command to Hive. Because the metadata is not hive compatible, show 
columns command gives you a different output.

We have been gradually adding more coverage on native support of different 
kinds of commands. If there is any specific commands that are important to your 
use cases, please feel free to create jiras. 

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct,template:string,variation:bigint>
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct,template:string,variation:bigint,z:string>;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940757#comment-14940757
 ] 

Simeon Simeonov commented on SPARK-9762:


[~yhuai] Refreshing is not the issue here. The issue is that {{DESCRIBE tbl}} 
and {{SHOW COLUMNS tbl}} show different columns for a table even without 
altering it which suggests that Spark SQL is not managing table metadata 
correctly.

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct,template:string,variation:bigint>
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct,template:string,variation:bigint,z:string>;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940731#comment-14940731
 ] 

Joseph K. Bradley commented on SPARK-5874:
--

It was delayed because it took longer than expected to finalize the rest of the 
API.  However, it's scheduled for 1.6 now, and at least partial coverage should 
be complete for 1.6.

> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-01 Thread Yongjia Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940702#comment-14940702
 ] 

Yongjia Wang commented on SPARK-5874:
-

The functionality about force save/load all pipeline components is very 
important. In the design doc, it says to do this in 1.4 for the new 
Transformer/Estimator framework under the .ml package. We are at 1.5.0 right 
now, nothing happened on that path. I wonder if there was some major conceptual 
changes or just workload/resource issue.


> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940694#comment-14940694
 ] 

Felix Cheung commented on SPARK-10903:
--

toDF is doing checks in preference order
btw, what functions we want to automatically find the sqlContext?
{code}
setMethod("toDF", signature(x = "RDD"),
  function(x, ...) {
sqlContext <- if (exists(".sparkRHivesc", envir = .sparkREnv)) {
  get(".sparkRHivesc", envir = .sparkREnv)
} else if (exists(".sparkRSQLsc", envir = .sparkREnv)) {
  get(".sparkRSQLsc", envir = .sparkREnv)
} else {
  stop("no SQL context available")
}
createDataFrame(sqlContext, x, ...)
  })
{code}

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9867) Move utilities for binary data into ByteArray

2015-10-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9867.

   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 1.6.0

> Move utilities for binary data into ByteArray
> -
>
> Key: SPARK-9867
> URL: https://issues.apache.org/jira/browse/SPARK-9867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 1.6.0
>
>
> The utilities such as Substring#substringBinarySQL and 
> BinaryPrefixComparator#computePrefix for binary data are put together in 
> ByteArray for easy-to-read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10904:


Assignee: Apache Spark

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>Assignee: Apache Spark
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10904:


Assignee: (was: Apache Spark)

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940691#comment-14940691
 ] 

Apache Spark commented on SPARK-10904:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/8961

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7135) Expression for monotonically increasing IDs

2015-10-01 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940688#comment-14940688
 ] 

Reynold Xin commented on SPARK-7135:


Can you explain your use case a bit more?


> Expression for monotonically increasing IDs
> ---
>
> Key: SPARK-7135
> URL: https://issues.apache.org/jira/browse/SPARK-7135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: dataframe
> Fix For: 1.4.0
>
>
> Seems like a common use case that users might want a unique ID for each row. 
> It is more expensive to have consecutive IDs, since that'd require two pass 
> over the data. However, many use cases can be satisfied by just having unique 
> ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7275) Make LogicalRelation public

2015-10-01 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940687#comment-14940687
 ] 

Reynold Xin commented on SPARK-7275:


Sure we can. Do you want to submit a pull request?

> Make LogicalRelation public
> ---
>
> Key: SPARK-7275
> URL: https://issues.apache.org/jira/browse/SPARK-7275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
>
> It seems LogicalRelation is the only part of the LogicalPlan that is not 
> public. This makes it harder to work with full logical plans from third party 
> packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2015-10-01 Thread Naden Franciscus (JIRA)

Naden Franciscus created SPARK-10908:


 Summary: ClassCastException in HadoopRDD.getJobConf
 Key: SPARK-10908
 URL: https://issues.apache.org/jira/browse/SPARK-10908
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Naden Franciscus


Whilst running a Spark SQL job (I can't provide an explain plan as many of 
these are happening concurrently) the following exception is thrown:

java.lang.ClassCastException: [B cannot be cast to 
org.apache.spark.util.SerializableConfiguration
rg.apache.spark.util.SerializableConfiguration
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940678#comment-14940678
 ] 

Apache Spark commented on SPARK-10906:
--

User 'rahulpalamuttam' has created a pull request for this issue:
https://github.com/apache/spark/pull/8960

> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10906:


Assignee: (was: Apache Spark)

> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10906:


Assignee: Apache Spark

> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10505) windowed form of count ( star ) fails with No handler for udf class

2015-10-01 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940675#comment-14940675
 ] 

Xin Wu commented on SPARK-10505:


This error is triggered in function HiveFunctionRegistry.lookupFunction() 
of org.apache.spark.sql.hive.hiveUDFs.scala..
The logic falls into this line:  sys.error(s"No handler for udf 
${functionInfo.getFunctionClass}"). 

The reason is that hive class 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount, seeming to be an 
aggregate function class, does not extend 
org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver like other 
aggregate function classes do, such as GenericUDAFAverage.  However, Spark code 
in HiveFunctionRegistry.lookupFunction() checks if 
AbstractGenericUDAFResolver is assignable from GenericUDAFCount, in this case, 
which is not satisfied obviously. such that the logic can not fall into doing 
HiveUDAFFunction(new HiveFunctionWrapper(functionClassName), children), like 
GenericUDAFAverage would have been able to.

Futhermore, interface 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2 is implemented by 
all the aggregate function classes, including GenericUDAFCount. So I am 
wondering whether the solution may be replacing AbstractGenericUDAFResolver 
with GenericUDAFResolver2 in the condition 
else if 
(classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))

This is to assume that class GenericUDAFCount is supposed to process "count(*) 
over (partition by c1)"..  

Spark/Hive experts, any comments?


> windowed form of count ( star ) fails with No handler for udf class
> ---
>
> Key: SPARK-10505
> URL: https://issues.apache.org/jira/browse/SPARK-10505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>
> The following statement will parse/execute in Hive 0.13 but fails in SPARK. 
> {code}
> create a simple ORC table in Hive 
> create table  if not exists TOLAP (RNUM int , C1 string, C2 string, C3 int, 
> C4 int) TERMINATED BY '\n' 
>  STORED AS orc ;
> select rnum, c1, c2, c3, count(*) over(partition by c1) from tolap
> Error: java.lang.RuntimeException: No handler for udf class 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Weide Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940670#comment-14940670
 ] 

Weide Zhang edited comment on SPARK-5575 at 10/2/15 1:17 AM:
-

Hi Alexander, 

The features I am looking to add/have include : 

1. more activation function such as ReLU, LeakyReLU, max pooling

2. support simultaneous testing and training phase similar to what caffe does

3. scalability change (including support larger model, parameter server, this 
is long term)

so far i haven't made any of the change yet. if other people already have made 
the change to the current spark, i will be happy to take that as well.


was (Author: weidezhang):
Hi Alexander, 

The features I am looking to add include : 

1. more activation function such as ReLU, LeakyReLU, max pooling

2. support simultaneous testing and training phase similar to what caffe does

3. scalability change (including support larger model, parameter server, this 
is long term)

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10907) Get rid of pending unroll memory

2015-10-01 Thread Andrew Or (JIRA)

Andrew Or created SPARK-10907:
-

 Summary: Get rid of pending unroll memory
 Key: SPARK-10907
 URL: https://issues.apache.org/jira/browse/SPARK-10907
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.4.0
Reporter: Andrew Or


It's incredibly complicated to have both unroll memory and pending unroll 
memory in MemoryStore.scala. We can probably express it with only unroll memory 
through some minor refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Weide Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940670#comment-14940670
 ] 

Weide Zhang commented on SPARK-5575:


Hi Alexander, 

The features I am looking to add include : 

1. more activation function such as ReLU, LeakyReLU, max pooling

2. support simultaneous testing and training phase similar to what caffe does

3. scalability change (including support larger model, parameter server, this 
is long term)

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9158) PyLint should only fail on error

2015-10-01 Thread Alan Chin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940664#comment-14940664
 ] 

Alan Chin commented on SPARK-9158:
--

I'd like the opportunity to work on this.

> PyLint should only fail on error
> 
>
> Key: SPARK-9158
> URL: https://issues.apache.org/jira/browse/SPARK-9158
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Davies Liu
>Priority: Critical
>
> It's boring to fight with warning from Pylint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640
 ] 

Rahul Palamuttam edited comment on SPARK-10906 at 10/2/15 12:32 AM:


Hi,
Can I tackle this? 
I have been working on a patch and will create a PR shortly. 



was (Author: rahul palamuttam):
Hi,
Can I tackle this? 
I have been working on a patch and will create a PR shortly. 

- Rahul P


> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640
 ] 

Rahul Palamuttam commented on SPARK-10906:
--

Hi,
Can I tackle this? 
I have been working on a patch and will create a pull RQ shortly. 

- Rahul P


> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Rahul Palamuttam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640
 ] 

Rahul Palamuttam edited comment on SPARK-10906 at 10/2/15 12:31 AM:


Hi,
Can I tackle this? 
I have been working on a patch and will create a PR shortly. 

- Rahul P



was (Author: rahul palamuttam):
Hi,
Can I tackle this? 
I have been working on a patch and will create a pull RQ shortly. 

- Rahul P


> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10400) Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec"

2015-10-01 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10400.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8566
[https://github.com/apache/spark/pull/8566]

> Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec"
> --
>
> Key: SPARK-10400
> URL: https://issues.apache.org/jira/browse/SPARK-10400
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.6.0
>
>
> We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while 
> working on implementing Parquet backwards-compatibility rules in SPARK-6777. 
> It indicates whether we should use legacy Parquet format adopted by Spark 1.4 
> and prior versions or the standard format defined in parquet-format spec. 
> However, the name of this option is somewhat confusing, because it's not 
> super intuitive why we shouldn't follow the spec. Would be nice to rename it 
> to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they 
> have opposite meanings). Note that this option is not "public" ({{isPublic}} 
> is false).
> At the moment of writing, 1.5 RC3 has already been cut. If we can't make this 
> one into 1.5, we can deprecate the old option with the new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940535#comment-14940535
 ] 

Naden Franciscus commented on SPARK-10474:
--

[~yhuai] Standalone

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe,

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940527#comment-14940527
 ] 

Yin Huai commented on SPARK-10474:
--

[~nadenf] Are you running Spark on mesos or yarn? Of, you are using the 
standalone mode?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-10342) Cooperative memory management

2015-10-01 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940524#comment-14940524
 ] 

Davies Liu commented on SPARK-10342:


This will be used internal for SQL. For example, aggregation and 
sort-merge-join both will acquire large page to do in-memory aggregation or 
sorting, one could use most of the memory, then the other once can't have 
enough memory to work. Currently, each operator will preserve a page to make 
sure that they can start (could have to work with the only one page). The 
better solution could be, when one operator (for example aggregation) need more 
memory, other operators could be notified to release some memory by spilling. 
This could improve the memory utilization (don't need to preserve a page 
anymore) and void OOM.

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-01 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940508#comment-14940508
 ] 

Davies Liu commented on SPARK-10903:


LGTM. Another question is that can we have different SQLContext in the same 
time? one for HiveContext, one for SQLContext.

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940505#comment-14940505
 ] 

Naden Franciscus commented on SPARK-10474:
--

Can confirm also getting this issue now. There must be something common to both 
though right.

An acquire should never fail unless the OS is out of memory right ?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.

[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-10-01 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940494#comment-14940494
 ] 

Naden Franciscus commented on SPARK-10309:
--

Has been difficult to get a clean stacktrace/explain trace because we are 
executing lots of SQL commands in parallel and we don't know which one is 
failing. We are absolutely doing lots of joins/aggregation/sorts. 

I have tried increase shuffle.memoryFraction to 0.8 but that didn't help.

This is still an issue with the latest Spark 1.5.2 branch.



> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940487#comment-14940487
 ] 

Naden Franciscus edited comment on SPARK-10474 at 10/1/15 9:58 PM:
---

I can't provide the explain plan since we are executing 1000s of SQL statement 
and hard to tell which is which.

Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change.

Will file this in another ticket.


was (Author: nadenf):
I can't provide the explain plan since we are executing 1000s of SQL statement 
and hard to tell which is which.

Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change.

@Andrew: is there is a ticket for this ?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940487#comment-14940487
 ] 

Naden Franciscus commented on SPARK-10474:
--

I can't provide the explain plan since we are executing 1000s of SQL statement 
and hard to tell which is which.

Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change.

@Andrew: is there is a ticket for this ?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940481#comment-14940481
 ] 

Joseph K. Bradley commented on SPARK-10780:
---

Sure, please do!

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-01 Thread Jayant Shekhar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940463#comment-14940463
 ] 

Jayant Shekhar commented on SPARK-10780:


Hi [~josephkb] I can work on it?

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940446#comment-14940446
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi, Weide,

Sounds good! What kind of feature are you planning to add?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10906) More efficient SparseMatrix.equals

2015-10-01 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-10906:
-

 Summary: More efficient SparseMatrix.equals
 Key: SPARK-10906
 URL: https://issues.apache.org/jira/browse/SPARK-10906
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
method.  However, it looks like Breeze's equals is inefficient: 
[https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]

Breeze iterates over all values, including implicit zeros.  We could make this 
more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940444#comment-14940444
 ] 

Weiqiang Zhuang commented on SPARK-10904:
-

That works because it invokes select with list().

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function

2015-10-01 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940442#comment-14940442
 ] 

Shivaram Venkataraman commented on SPARK-10894:
---

Yes, as [~felixcheung] said this is by design. The main reason is that we use 
`df$Age` as an easy handle or a reference to a column in a distributed data 
frame that can be passed to other functions without using strings ("Age"). The 
`df$A` also auto completes and is easy to use. 
  
The square brackets API is meant to provide some basic compatibility with R 
(e.g. df[, df$Age] or df[, "Age"]). However my opinion is that the overall 
DataFrame API is targeted to work more like dplyr and I don't think supporting 
all aspects of R data.frames is a design goal.

> Add 'drop' support for DataFrame's subset function
> --
>
> Key: SPARK-10894
> URL: https://issues.apache.org/jira/browse/SPARK-10894
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>
> SparkR DataFrame can be subset to get one or more columns of the dataset. The 
> current '[' implementation does not support 'drop' when is asked for just one 
> column. This is not consistent with the R syntax:
> x[i, j, ... , drop = TRUE]
> # in R, when drop is FALSE, remain as data.frame
> > class(iris[, "Sepal.Width", drop=F])
> [1] "data.frame"
> # when drop is TRUE (default), drop to be a vector
> > class(iris[, "Sepal.Width", drop=T])
> [1] "numeric"
> > class(iris[,"Sepal.Width"])
> [1] "numeric"
> > df <- createDataFrame(sqlContext, iris)
> # in SparkR, 'drop' argument has no impact
> > class(df[,"Sepal_Width", drop=F])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> # should have dropped to be a Column class instead
> > class(df[,"Sepal_Width", drop=T])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> > class(df[,"Sepal_Width"])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> We should add the 'drop' support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2015-10-01 Thread Dmytro Bielievtsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmytro Bielievtsov updated SPARK-10872:
---
Description: 
Starting from spark 1.4.0 (works well on 1.3.1), the following code fails with 
"XSDB6: Another instance of Derby may have already booted the database 
~/metastore_db":

{code:python}
from pyspark import SparkContext, HiveContext
sc = SparkContext("local[*]", "app1")
sql = HiveContext(sc)
sql.createDataFrame([[1]]).collect()
sc.stop()
sc = SparkContext("local[*]", "app2")
sql = HiveContext(sc)
sql.createDataFrame([[1]]).collect()  # Py4J error
{code}

This is related to [#SPARK-9539], and I intend to restart spark context several 
times for isolated jobs to prevent cache cluttering and GC errors.

Here's a larger part of the full error trace:
{noformat}
Failed to start database 'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the 
next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to start database 
'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the 
next exception for details.
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
at 
org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at 
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at 
org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
at 
org.apache.hadoop.hive.ql.metad

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-01 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940432#comment-14940432
 ] 

Shivaram Venkataraman commented on SPARK-10903:
---

Yeah this sounds like a good idea as we probably don't want so support multiple 
SQL contexts inside the same R session.
cc [~davies] [~falaki] to see if they have any scenarios where this might be a 
problem.

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940429#comment-14940429
 ] 

Felix Cheung commented on SPARK-10904:
--

`head(df[,c("Sepal_Width", "Sepal_Length")])` this works, I guess that's why 
I'm surprised.

I think I know how to fix this. I will take this.


>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940416#comment-14940416
 ] 

Felix Cheung commented on SPARK-10904:
--

I added that line of comment on "df$age"  I could clarify that.

I think we should make `select(df, c("col1", "col2"))` work

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1762) Add functionality to pin RDDs in cache

2015-10-01 Thread FangzhouXing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940395#comment-14940395
 ] 

FangzhouXing commented on SPARK-1762:
-

What is the current eviction policy?
Instead of pinning, what if we just make the eviction policy smarter? (from a 
quick look, it seems like the current policy is FIFO)

We want developers to think about how much memory the system has less, not more.

> Add functionality to pin RDDs in cache
> --
>
> Key: SPARK-1762
> URL: https://issues.apache.org/jira/browse/SPARK-1762
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> Right now, all RDDs are created equal, and there is no mechanism to identify 
> a certain RDD to be more important than the rest. This is a problem if the 
> RDD fraction is small, because just caching a few RDDs can evict more 
> important ones.
> A side effect of this feature is that we can now more safely allocate a 
> smaller spark.storage.memoryFraction if we know how large our important RDDs 
> are, without having to worry about them being evicted. This allows us to use 
> more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10821) RandomForest serialization OOM during findBestSplits

2015-10-01 Thread Jay Luan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940370#comment-14940370
 ] 

Jay Luan commented on SPARK-10821:
--

Thank you for the insight, do you know what the status of the new 
implementation of decision tree is or a possible ETA for when it will be ready? 
Maybe I can help with either testing or contributing to the code.

> RandomForest serialization OOM during findBestSplits
> 
>
> Key: SPARK-10821
> URL: https://issues.apache.org/jira/browse/SPARK-10821
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon EC2 Linux
>Reporter: Jay Luan
>  Labels: OOM, out-of-memory
>
> I am getting OOM during serialization for a relatively small dataset for a 
> RandomForest. Even with spark.serializer.objectStreamReset at 1, It is still 
> running out of memory when attempting to serialize my data.
> Stack Trace:
> Traceback (most recent call last):
>   File "/root/random_forest/random_forest_spark.py", line 198, in 
> main()
>   File "/root/random_forest/random_forest_spark.py", line 166, in main
> trainModel(dset)
>   File "/root/random_forest/random_forest_spark.py", line 191, in trainModel
> impurity='gini', maxDepth=4, maxBins=32)
>   File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, 
> in trainClassifier
>   File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, 
> in _train
>   File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 
> 130, in callMLlibFunc
>   File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 
> 123, in callJavaFunc
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: 
> Done removing RDD 7, response is 0
> 15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 
> AkkaRpcEndpointRef(Actor[akka://sparkDriver/temp/$Mj])
> : An error occurred while calling o89.trainRandomForestModel.
> : java.lang.OutOfMemoryError
> at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
> at 
> org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
> at 
> org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
> at 
> org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
> at 
> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflect

[jira] [Resolved] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-10-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7218.

  Resolution: Fixed
Assignee: Reynold Xin
Target Version/s: 1.6.0  (was: )

Ah forgot to close this: this has been fixed already. Code in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/local/LocalNode.scala

> Create a real iterator with open/close for Spark SQL
> 
>
> Key: SPARK-7218
> URL: https://issues.apache.org/jira/browse/SPARK-7218
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940368#comment-14940368
 ] 

Weiqiang Zhuang commented on SPARK-10904:
-

Yes, list() works. Question is whether or not the c() will be supported. If 
not, the document for select should be updated.

There are a couple of errors in the following given example. 1) select(df, 
c("col1", "col2")) does not work; 2) the claim of using "$" as a similar method 
to select is false, because df$age returns a 'Column' class while select(df, 
'age') returns a 'DataFrame' class.

Examples

## Not run: 
  select(df, "*")
  select(df, "col1", "col2")
  select(df, df$name, df$age + 1)
  select(df, c("col1", "col2"))
  select(df, list(df$name, df$age + 1))
  # Similar to R data frames columns can also be selected using `$`
  df$age

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10905) Export freqItems() for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rerngvit yanggratoke updated SPARK-10905:
-
Summary: Export freqItems() for DataFrameStatFunctions in SparkR  (was: 
Implement freqItems() for DataFrameStatFunctions in SparkR)

> Export freqItems() for DataFrameStatFunctions in SparkR
> ---
>
> Key: SPARK-10905
> URL: https://issues.apache.org/jira/browse/SPARK-10905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: rerngvit yanggratoke
> Fix For: 1.6.0
>
>
> Currently only crosstab is implemented. This subtask is about adding 
> freqItems() API to sparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10671) Calling a UDF with insufficient number of input arguments should throw an analysis error

2015-10-01 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10671:
-
Assignee: Wenchen Fan  (was: Yin Huai)

> Calling a UDF with insufficient number of input arguments should throw an 
> analysis error
> 
>
> Key: SPARK-10671
> URL: https://issues.apache.org/jira/browse/SPARK-10671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> {code}
> import org.apache.spark.sql.functions._
> Seq((1,2)).toDF("a", "b").select(callUDF("percentile", $"a"))
> {code}
> This should throws an Analysis Exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10671) Calling a UDF with insufficient number of input arguments should throw an analysis error

2015-10-01 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10671.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8941
[https://github.com/apache/spark/pull/8941]

> Calling a UDF with insufficient number of input arguments should throw an 
> analysis error
> 
>
> Key: SPARK-10671
> URL: https://issues.apache.org/jira/browse/SPARK-10671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> {code}
> import org.apache.spark.sql.functions._
> Seq((1,2)).toDF("a", "b").select(callUDF("percentile", $"a"))
> {code}
> This should throws an Analysis Exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function

2015-10-01 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940359#comment-14940359
 ] 

Weiqiang Zhuang commented on SPARK-10894:
-

Is there a good reason it was designed like this? The R codes we have seen are 
always using df[,c(1)] and df$col1 interchangeable.

> Add 'drop' support for DataFrame's subset function
> --
>
> Key: SPARK-10894
> URL: https://issues.apache.org/jira/browse/SPARK-10894
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>
> SparkR DataFrame can be subset to get one or more columns of the dataset. The 
> current '[' implementation does not support 'drop' when is asked for just one 
> column. This is not consistent with the R syntax:
> x[i, j, ... , drop = TRUE]
> # in R, when drop is FALSE, remain as data.frame
> > class(iris[, "Sepal.Width", drop=F])
> [1] "data.frame"
> # when drop is TRUE (default), drop to be a vector
> > class(iris[, "Sepal.Width", drop=T])
> [1] "numeric"
> > class(iris[,"Sepal.Width"])
> [1] "numeric"
> > df <- createDataFrame(sqlContext, iris)
> # in SparkR, 'drop' argument has no impact
> > class(df[,"Sepal_Width", drop=F])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> # should have dropped to be a Column class instead
> > class(df[,"Sepal_Width", drop=T])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> > class(df[,"Sepal_Width"])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> We should add the 'drop' support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10788:
--
Priority: Minor  (was: Major)

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940345#comment-14940345
 ] 

Joseph K. Bradley commented on SPARK-10788:
---

Though I should say: I should probably put this as Minor priority.  It's not a 
huge savings, and it's likely a somewhat complex change.  If you have other 
things you're working on, I'd prioritize those instead.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940343#comment-14940343
 ] 

Joseph K. Bradley commented on SPARK-10788:
---

OK, thanks!

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940316#comment-14940316
 ] 

Seth Hendrickson edited comment on SPARK-10788 at 10/1/15 8:04 PM:
---

Yes, much clearer, thanks! I can work on this task.


was (Author: sethah):
Yes, much clearer. I can work on this task.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10342) Cooperative memory management

2015-10-01 Thread FangzhouXing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940315#comment-14940315
 ] 

FangzhouXing commented on SPARK-10342:
--

In my understanding, inactive spark programs will receive memory warning when 
memory runs low. And then a handler implemented by programmer will be called to 
reduce their program memory usage, just like what happened in iOS app.
Is this correct?

Also, what's an example use-case for this?

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940316#comment-14940316
 ] 

Seth Hendrickson commented on SPARK-10788:
--

Yes, much clearer. I can work on this task.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940310#comment-14940310
 ] 

Shivaram Venkataraman commented on SPARK-10904:
---

I don't think we support passing in `c` for a list of things in SparkR because 
the serialization, deserialization was not supported for it. Changing it to 
`list("col1", "col2")` should work here

cc [~sunrui]  Some of the recent serializer fixes may help here ?

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-10-01 Thread Deron Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940253#comment-14940253
 ] 

Deron Eriksson edited comment on SPARK-10286 at 10/1/15 7:42 PM:
-

Hi, I see @since annotations on clustering.py, recommendation.py, 
regression.py, and tuning.py in pyspark.ml.* but not currently on others or in 
pyspark.ml.param.*, so I would like to work on this one.


was (Author: deron):
Hi, I see @since annotations on clustering.py, recommendation.py, 
regression.py, and tuning.py in pyspark.ml.* but not on currently on others or 
in pyspark.ml.param.*, so I would like to work on this one.

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rerngvit yanggratoke updated SPARK-10905:
-
Shepherd: Shivaram Venkataraman  (was: Sun Rui)

> Implement freqItems() for DataFrameStatFunctions in SparkR
> --
>
> Key: SPARK-10905
> URL: https://issues.apache.org/jira/browse/SPARK-10905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: rerngvit yanggratoke
> Fix For: 1.6.0
>
>
> Currently only crosstab is implemented. This subtask is about adding 
> freqItems() API to sparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940092#comment-14940092
 ] 

Marcelo Vanzin edited comment on SPARK-10901 at 10/1/15 7:39 PM:
-

As a potential workaround, he can add his app jar (or the kryo jar) to 
{{spark.(executor,driver).extraClassPath}}. If he distributes the jars with the 
application (using {{--jars}} in cluster mode or {{spark.yarn.dist.files}} in 
client mode), just add the jar names without any path and they should be 
prepended to the app's classpath.


was (Author: vanzin):
As a potential workaround, he can add his app jar (or the kryo jar) to 
{{spark.{executor,driver}.extraClassPath}}. If he distributes the jars with the 
application (using {{--jars}} in cluster mode or {{spark.yarn.dist.files}} in 
client mode), just add the jar names without any path and they should be 
prepended to the app's classpath.

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940289#comment-14940289
 ] 

Marcelo Vanzin commented on SPARK-10901:


Yes I did. I'd have to look more closely for why `userClassPathFirst` is not 
working for this case, nothing pops up at the moment.

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940264#comment-14940264
 ] 

Thomas Graves commented on SPARK-10901:
---

[~vanzin] thanks for the suggestion.  Did you work on the new user class path 
stuff at all?  Wondering if you might have ideas why that didn't work.  It 
looks like its loading both versions.

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940263#comment-14940263
 ] 

Weiqiang Zhuang commented on SPARK-10904:
-

Here you go:

sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, iris) 
select(df, c("Sepal_Width","Sepal_Length"))

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rerngvit yanggratoke updated SPARK-10905:
-
  Shepherd: Sun Rui
Remaining Estimate: (was: 168h)
 Original Estimate: (was: 168h)
   Description: Currently only crosstab is implemented. This subtask is 
about adding freqItems() API to sparkR

> Implement freqItems() for DataFrameStatFunctions in SparkR
> --
>
> Key: SPARK-10905
> URL: https://issues.apache.org/jira/browse/SPARK-10905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: rerngvit yanggratoke
> Fix For: 1.6.0
>
>
> Currently only crosstab is implemented. This subtask is about adding 
> freqItems() API to sparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Weide Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940257#comment-14940257
 ] 

Weide Zhang commented on SPARK-5575:


hello, i plan to add more feature for spark dnn especially for adding more 
layer functionalities as well as more types activation function. shall i send 
pull request to https://github.com/avulanov/spark/tree/ann-interface-gemm ? 

thx

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-10-01 Thread Deron Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940253#comment-14940253
 ] 

Deron Eriksson commented on SPARK-10286:


Hi, I see @since annotations on clustering.py, recommendation.py, 
regression.py, and tuning.py in pyspark.ml.* but not on currently on others or 
in pyspark.ml.param.*, so I would like to work on this one.

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rerngvit yanggratoke updated SPARK-10905:
-
Summary: Implement freqItems() for DataFrameStatFunctions in SparkR  (was: 
Implement freqItems for DataFrameStatFunctions in SparkR)

> Implement freqItems() for DataFrameStatFunctions in SparkR
> --
>
> Key: SPARK-10905
> URL: https://issues.apache.org/jira/browse/SPARK-10905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: rerngvit yanggratoke
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940252#comment-14940252
 ] 

Felix Cheung commented on SPARK-10894:
--

To Shivaram's point, I think it is intentional that df$Sepal_Width is Column 
and df[, df$Sepal_Width] is a DataFrame.


> Add 'drop' support for DataFrame's subset function
> --
>
> Key: SPARK-10894
> URL: https://issues.apache.org/jira/browse/SPARK-10894
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>
> SparkR DataFrame can be subset to get one or more columns of the dataset. The 
> current '[' implementation does not support 'drop' when is asked for just one 
> column. This is not consistent with the R syntax:
> x[i, j, ... , drop = TRUE]
> # in R, when drop is FALSE, remain as data.frame
> > class(iris[, "Sepal.Width", drop=F])
> [1] "data.frame"
> # when drop is TRUE (default), drop to be a vector
> > class(iris[, "Sepal.Width", drop=T])
> [1] "numeric"
> > class(iris[,"Sepal.Width"])
> [1] "numeric"
> > df <- createDataFrame(sqlContext, iris)
> # in SparkR, 'drop' argument has no impact
> > class(df[,"Sepal_Width", drop=F])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> # should have dropped to be a Column class instead
> > class(df[,"Sepal_Width", drop=T])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> > class(df[,"Sepal_Width"])
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> We should add the 'drop' support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10905) Implement freqItems for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940248#comment-14940248
 ] 

rerngvit yanggratoke commented on SPARK-10905:
--

I am going to work on this issue.

> Implement freqItems for DataFrameStatFunctions in SparkR
> 
>
> Key: SPARK-10905
> URL: https://issues.apache.org/jira/browse/SPARK-10905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: rerngvit yanggratoke
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10905) Implement freqItems for DataFrameStatFunctions in SparkR

2015-10-01 Thread rerngvit yanggratoke (JIRA)

rerngvit yanggratoke created SPARK-10905:


 Summary: Implement freqItems for DataFrameStatFunctions in SparkR
 Key: SPARK-10905
 URL: https://issues.apache.org/jira/browse/SPARK-10905
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.5.0
Reporter: rerngvit yanggratoke
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940239#comment-14940239
 ] 

Felix Cheung commented on SPARK-10904:
--

Do you have the repo steps/code?

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-10-01 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940232#comment-14940232
 ] 

Andrew Or commented on SPARK-10474:
---

[~nadenf] That's a different issue. In your stack trace Spark fails to acquire 
memory in the prepare phase. SPARK-10474 is already past that, but fails to 
allocate it after switching to sort-based aggregation. It's still an issue we 
should fix but it's a separate one.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
>

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940231#comment-14940231
 ] 

Felix Cheung commented on SPARK-10903:
--

[~shivaram]what do you think?
If ok I would love to take this change.

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940229#comment-14940229
 ] 

Felix Cheung commented on SPARK-10903:
--

+1
could/should we have a version of this that checks .sparkRSQLsc in . sparkREnv? 
(see 
https://github.com/NarineK/spark/blob/sparkrasDataFrame/R/pkg/R/sparkR.R#L224

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10753) Implement freqItems() and sampleBy() in DataFrameStatFunctions

2015-10-01 Thread rerngvit yanggratoke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940226#comment-14940226
 ] 

rerngvit yanggratoke commented on SPARK-10753:
--

Can we break this into two sub tasks: freqItems() and sampleBy()? I would like 
to work on adding the freqItems().

> Implement freqItems() and sampleBy() in DataFrameStatFunctions
> --
>
> Key: SPARK-10753
> URL: https://issues.apache.org/jira/browse/SPARK-10753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10866) [Spark SQL] [UDF] the floor function got wrong return value type

2015-10-01 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10866.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8933
[https://github.com/apache/spark/pull/8933]

> [Spark SQL] [UDF] the floor function got wrong return value type
> 
>
> Key: SPARK-10866
> URL: https://issues.apache.org/jira/browse/SPARK-10866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
> Fix For: 1.6.0
>
>
> As per floor definition,it should get BIGINT return value
> -floor(DOUBLE a)
> -Returns the maximum BIGINT value that is equal to or less than a.
> But in current Spark implementation, it got wrong value type.
> e.g.,
> select floor(2642.12) from udf_test_web_sales limit 1;
> 2642.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2642



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-10-01 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10865.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8933
[https://github.com/apache/spark/pull/8933]

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940217#comment-14940217
 ] 

Joseph K. Bradley commented on SPARK-10669:
---

Sure, please do.

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9695) Add random seed Param to ML Pipeline

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940207#comment-14940207
 ] 

Joseph K. Bradley edited comment on SPARK-9695 at 10/1/15 6:45 PM:
---

That's what I would propose to.  There are a few complications to figure out 
though.

*API*

* If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite 
that seed?  I'd vote for no.

*What behavior do we want in the situation below?*

Situation:
* User creates a Pipeline with some stages
* User sets pipeline.seed
* User saves pipeline to FILE
* User runs pipeline and produces model A
* User loads Pipeline from FILE and runs it to produce model B

I'd say that the ideal behavior will be for model A and B to produce exactly 
the same results.  However, this will require us to guarantee that each 
Pipeline stage is given the same seed for both A and B; i.e., the random number 
generator used by the Pipeline should not change behavior across Spark 
versions.  Is that a reasonable assumption?

Note: We could have a Pipeline set its stages' seeds whenever Pipeline.setSeed 
is called, but that would cause problems with (a) the question above under 
"API" and (b) if stages are modified after the seed is modified.

I'll try to think of other possible issues too.

CC: [~mengxr]



was (Author: josephkb):
That's what I would propose to.  There are a few complications to figure out 
though.

*API*

* If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite 
that seed?  I'd vote for no.

*What behavior do we want in the situation below?*

Situation:
* User creates a Pipeline with some stages
* User sets pipeline.seed
* User saves pipeline to FILE
* User runs pipeline and produces model A
* User loads Pipeline from FILE and runs it to produce model B

I'd say that the ideal behavior will be for model A and B to produce exactly 
the same results.  However, this will require us to guarantee that each 
Pipeline stage is given the same seed for both A and B; i.e., the random number 
generator used by the Pipeline should not change behavior across Spark 
versions.  Is that a reasonable assumption?

I'll try to think of other possible issues too.

CC: [~mengxr]


> Add random seed Param to ML Pipeline
> 
>
> Key: SPARK-9695
> URL: https://issues.apache.org/jira/browse/SPARK-9695
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Note this will require some discussion about whether to make HasSeed the main 
> API for whether an algorithm takes a seed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940207#comment-14940207
 ] 

Joseph K. Bradley commented on SPARK-9695:
--

That's what I would propose to.  There are a few complications to figure out 
though.

*API*

* If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite 
that seed?  I'd vote for no.

*What behavior do we want in the situation below?*

Situation:
* User creates a Pipeline with some stages
* User sets pipeline.seed
* User saves pipeline to FILE
* User runs pipeline and produces model A
* User loads Pipeline from FILE and runs it to produce model B

I'd say that the ideal behavior will be for model A and B to produce exactly 
the same results.  However, this will require us to guarantee that each 
Pipeline stage is given the same seed for both A and B; i.e., the random number 
generator used by the Pipeline should not change behavior across Spark 
versions.  Is that a reasonable assumption?

I'll try to think of other possible issues too.

CC: [~mengxr]


> Add random seed Param to ML Pipeline
> 
>
> Key: SPARK-9695
> URL: https://issues.apache.org/jira/browse/SPARK-9695
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Note this will require some discussion about whether to make HasSeed the main 
> API for whether an algorithm takes a seed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-10-01 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940202#comment-14940202
 ] 

Josh Rosen commented on SPARK-7448:
---

I tried a hacky prototype of this and don't remember it showing a huge 
difference, but that's not to say that it's not worth trying again. Feel free 
to take a stab at this.

> Implement custom bye array serializer for use in PySpark shuffle
> 
>
> Key: SPARK-7448
> URL: https://issues.apache.org/jira/browse/SPARK-7448
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Shuffle
>Reporter: Josh Rosen
>Priority: Minor
>
> PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
> should implement a custom Serializer for use in these shuffles.  This will 
> allow us to take advantage of shuffle optimizations like SPARK-7311 for 
> PySpark without requiring users to change the default serializer to 
> KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940196#comment-14940196
 ] 

Joseph K. Bradley commented on SPARK-10788:
---

Updated.  Does it make more sense now?

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940197#comment-14940197
 ] 

Joseph K. Bradley commented on SPARK-10413:
---

SGTM

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10788:
--
Description: 
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data 
as needed for unordered categorical features.  Here's an example.

Say there are 3 categories A, B, C.  We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 
= 6).  However, we could instead collect statistics for the 3 subsets on the 
left-hand side of the 3 possible splits: A and A,B and A,C.  If we also have 
stats for the entire node, then we can compute the stats for the 3 subsets on 
the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) - 
stats(A)}}.

We should eliminate these extra bins within the spark.ml implementation since 
the spark.mllib implementation will be removed before long (and will instead 
call into spark.ml).

  was:
Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
copy of each split. E.g., if there are 3 categories A, B, C, then we should 
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B

Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C

This means we communicate twice as much data as needed for these features.

We should eliminate these duplicate splits within the spark.ml implementation 
since the spark.mllib implementation will be removed before long (and will 
instead call into spark.ml).


> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10901:


Assignee: Thomas Graves  (was: Apache Spark)

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940192#comment-14940192
 ] 

Apache Spark commented on SPARK-10901:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/8959

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10901) spark.yarn.user.classpath.first doesn't work

2015-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10901:


Assignee: Apache Spark  (was: Thomas Graves)

> spark.yarn.user.classpath.first doesn't work
> 
>
> Key: SPARK-10901
> URL: https://issues.apache.org/jira/browse/SPARK-10901
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Critical
>
> spark.yarn.user.classpath.first doesn't properly add the app jar to the 
> system class path first.  It has some logic there that i believe works for 
> local files but running on yarn using distributed cache to distribute the app 
> jar doesn't put __app__.jar into the classpath at all.
> This is a break in backwards compatibility.
> Note that in this case the user is trying to use different version of kryo 
> (which used to work in spark 1.2) and the new configs for this: 
> spark.{driver, executor}.userClassPathFirst don't allow this as it errors out 
> with:
> User class threw exception: java.lang.LinkageError: loader constraint 
> violation: loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading 
> for a different type with name "com/esotericsoftware/kryo/Kryo"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940186#comment-14940186
 ] 

Joseph K. Bradley commented on SPARK-10788:
---

Reading what I wrote now, I realize I didn't actually phrase it correctly.  
I'll update the description.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
> copy of each split. E.g., if there are 3 categories A, B, C, then we should 
> consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we also consider the 3 flipped splits:
> * B,C vs. A
> * C vs. A, B
> * B vs. A, C
> This means we communicate twice as much data as needed for these features.
> We should eliminate these duplicate splits within the spark.ml implementation 
> since the spark.mllib implementation will be removed before long (and will 
> instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming (umbrella JIRA)

2015-10-01 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940184#comment-14940184
 ] 

Iulian Dragos commented on SPARK-7398:
--

Hey, except the last point, everything is available in 1.5.

You can go ahead and tackle the remaining ticket, of course. 

> Add back-pressure to Spark Streaming (umbrella JIRA)
> 
>
> Key: SPARK-7398
> URL: https://issues.apache.org/jira/browse/SPARK-7398
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: François Garillot
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: streams
>
> Spark Streaming has trouble dealing with situations where 
>  batch processing time > batch interval
> Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
> from the queue.
> If this throughput is sustained for long enough, it leads to an unstable 
> situation where the memory of the Receiver's Executor is overflowed.
> This aims at transmitting a back-pressure signal back to data ingestion to 
> help with dealing with that high throughput, in a backwards-compatible way.
> The original design doc can be found here:
> https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing
> The second design doc, focusing [on the first 
> sub-task|https://issues.apache.org/jira/browse/SPARK-8834] (without all the 
> background info, and more centered on the implementation) can be found here:
> https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME

2015-10-01 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10787:
---
Priority: Major  (was: Minor)

> Consider replacing ObjectOutputStream for serialization to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
> ClosureCleaner#ensureSerializable() resulted in OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by considering alternative to 
> ObjectOutputStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME

2015-10-01 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10787:
---
Summary: Consider replacing ObjectOutputStream for serialization to prevent 
OOME  (was: Reset ObjectOutputStream more often to prevent OOME)

> Consider replacing ObjectOutputStream for serialization to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
> ClosureCleaner#ensureSerializable() resulted in OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by calling reset() more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME

2015-10-01 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10787:
---
Description: 
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow 
(http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by considering alternative to ObjectOutputStream

  was:
In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow 
(http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
ClosureCleaner#ensureSerializable() resulted in OOME.

The cause was that ObjectOutputStream keeps a strong reference of every object 
that was written to it.

This issue tries to avoid OOME by calling reset() more often.


> Consider replacing ObjectOutputStream for serialization to prevent OOME
> ---
>
> Key: SPARK-10787
> URL: https://issues.apache.org/jira/browse/SPARK-10787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> In the thread, Spark ClosureCleaner or java serializer OOM when trying to 
> grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that 
> ClosureCleaner#ensureSerializable() resulted in OOME.
> The cause was that ObjectOutputStream keeps a strong reference of every 
> object that was written to it.
> This issue tries to avoid OOME by considering alternative to 
> ObjectOutputStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10172) History Server web UI gets messed up when sorting on any column

2015-10-01 Thread Josiah Samuel Sathiadass (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940156#comment-14940156
 ] 

Josiah Samuel Sathiadass commented on SPARK-10172:
--

[~tgraves], As per the current implementation, the Spark table pagination is 
not linked with the current sorting logic. It means that we can sort the table 
content which gets displayed on any particular page. Since table creation and 
data population are done in a generic way inside Spark, modifying such logic 
will have wider impact and demands more UI  testing. 
We went ahead with a quick fix by disabling the sorting on a table if it 
contains multiple attempts. 

> History Server web UI gets messed up when sorting on any column
> ---
>
> Key: SPARK-10172
> URL: https://issues.apache.org/jira/browse/SPARK-10172
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Min Shen
>Assignee: Josiah Samuel Sathiadass
>Priority: Minor
>  Labels: regression
> Fix For: 1.5.1, 1.6.0
>
> Attachments: screen-shot.png
>
>
> If the history web UI displays the "Attempt ID" column, when clicking the 
> table header to sort on any column, the entire page gets messed up.
> This seems to be a problem with the sorttable.js not able to correctly handle 
> tables with rowspan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10897) Custom job/stage names

2015-10-01 Thread Nithin Asokan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940157#comment-14940157
 ] 

Nithin Asokan commented on SPARK-10897:
---

{quote}
For example if groupBy results in 3 stages, which one gets the name? if 3 
method calls result in 1 stage?I don't think it's impossible but not sure about 
the details of the semantics.
{quote}
This is a good point. I did not think of this scenario in my mind. 

{quote}
is the motivation really to just display something farther up the call stack?
{quote}
Yes, Crunch has a concept of DoFn which is similar to Function in spark. These 
DoFn's can take names that are usually displayed on a Job page in MR. I should 
not be comparing MR to Spark, but in my use case; we are migrating from MR to 
Spark. And our engineers are familiar with how crunch creates a MR job that has 
a nice job name which includes all DoFn name; this give more context to a user 
as what the job is processing. For example: In MR crunch can create a job name 
like this {{MyPipeline: Text("/input/path")+Filter valid 
lines+Text("/output/path")}}. In case of Spark, we are missing that 
information. I believe partly because Spark scheduler handles stage and job 
creation. A Spark job/stage name may appear as

{code}
sortByKey at PGroupedTableImpl.java:123 (job name)
mapToPair at PGroupedTableImpl.java:108 (stage name)
{code}

While this gives idea that it's processing/creating a PGroupedTable, it does 
not give me full context(atleast through Crunch) of DoFn applied. If Spark 
allows users to set Stage names, I think we can pass some DoFn information from 
Crunch. The next thing I would ask myself would be, if Crunch does not know 
what stages are created, how can it know which DoFn name to pass to Spark? 

I'm not fully sure if this can be supported because of my less knowledge in 
Spark, but if other feels it's possible it could be something that will be 
helpful for Crunch. 

> Custom job/stage names
> --
>
> Key: SPARK-10897
> URL: https://issues.apache.org/jira/browse/SPARK-10897
> Project: Spark
>  Issue Type: Wish
>  Components: Web UI
>Reporter: Nithin Asokan
>Priority: Minor
>
> Logging this jira to get some opinion about discussion I started on 
> [user-list|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-Stage-names-tt24867.html]
> I would like to get some thoughts about having custom stage/job names. 
> Currently I believe the stage names cannot be controlled by user, but if 
> allowed we can have libraries like Apache [Crunch|https://crunch.apache.org/] 
> to dynamically set stage names based on the type of 
> processing(action/transformation) it is performing. 
> Is it possible for Spark to support custom names? Will it make sense to allow 
> users set stage names?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)

2015-10-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940153#comment-14940153
 ] 

Joseph K. Bradley commented on SPARK-10779:
---

Sounds good, thanks!

> Set initialModel for KMeans model in PySpark (spark.mllib)
> --
>
> Key: SPARK-10779
> URL: https://issues.apache.org/jira/browse/SPARK-10779
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>
> Provide initialModel param for pyspark.mllib.clustering.KMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-10-01 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940133#comment-14940133
 ] 

Gayathri Murali commented on SPARK-7448:


Is anyone working on this ? If not, I would like to work on this. 

> Implement custom bye array serializer for use in PySpark shuffle
> 
>
> Key: SPARK-7448
> URL: https://issues.apache.org/jira/browse/SPARK-7448
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Shuffle
>Reporter: Josh Rosen
>Priority: Minor
>
> PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
> should implement a custom Serializer for use in these shuffles.  This will 
> allow us to take advantage of shuffle optimizations like SPARK-7311 for 
> PySpark without requiring users to change the default serializer to 
> KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-01 Thread Weiqiang Zhuang (JIRA)

Weiqiang Zhuang created SPARK-10904:
---

 Summary:   select(df, c("col1", "col2")) fails
 Key: SPARK-10904
 URL: https://issues.apache.org/jira/browse/SPARK-10904
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Weiqiang Zhuang


The help page for 'select' gives an example of 
  select(df, c("col1", "col2"))

However, this fails with assertion:

java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
at 
org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
at 
org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)

And then none of the functions will work with following error:
> head(df)
 Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 169 matches

Mail list logo