[jira] [Commented] (SPARK-10883) Be able to build each module individually
[ https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940800#comment-14940800 ] Jean-Baptiste Onofré commented on SPARK-10883: -- Fair enough. Thanks for the update Marcelo. What do you think if I update the README.md with a quick note about this ? > Be able to build each module individually > - > > Key: SPARK-10883 > URL: https://issues.apache.org/jira/browse/SPARK-10883 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Jean-Baptiste Onofré > > Right now, due to the location of the scalastyle-config.xml location, it's > not possible to build an individual module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE
[ https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940778#comment-14940778 ] Simeon Simeonov commented on SPARK-9761: [~yhuai] What about this one? The problem survives as restart so it doesn't seem to be caused by lack of refreshing. > Inconsistent metadata handling with ALTER TABLE > --- > > Key: SPARK-9761 > URL: https://issues.apache.org/jira/browse/SPARK-9761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: hive, sql > > Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. > The table in question was created with {{HiveContext.read.json()}}. > Steps: > # {{alter table dimension_components add columns (z string);}} succeeds. > # {{describe dimension_components;}} does not show the new column, even after > restarting spark-sql. > # A second {{alter table dimension_components add columns (z string);}} fails > with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: > Duplicate column name: z > Full spark-sql output > [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column
[ https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940776#comment-14940776 ] Simeon Simeonov commented on SPARK-9762: [~yhuai] the Hive compatibility section of the documentation should be updated to identify these cases. It is unfortunate to trust the docs only to discover a known lack of compatibility that was not documented. > ALTER TABLE cannot find column > -- > > Key: SPARK-9762 > URL: https://issues.apache.org/jira/browse/SPARK-9762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > > {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} > lists. > In the case of a table generated with {{HiveContext.read.json()}}, the output > of {{DESCRIBE dimension_components}} is: > {code} > comp_config > struct > comp_criteria string > comp_data_model string > comp_dimensions > struct,template:string,variation:bigint> > comp_disabled boolean > comp_id bigint > comp_path string > comp_placementDatastruct > comp_slot_types array > {code} > However, {{alter table dimension_components change comp_dimensions > comp_dimensions > struct,template:string,variation:bigint,z:string>;}} > fails with: > {code} > 15/08/08 23:13:07 ERROR exec.DDLTask: > org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference > comp_dimensions > at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155) > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473) > ... > {code} > Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: > {{col}} (which does not exist in the table) and {{z}}, which was just added. > This suggests that DDL operations in Spark SQL use table metadata > inconsistently. > Full spark-sql output > [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column
[ https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940763#comment-14940763 ] Yin Huai commented on SPARK-9762: - [~simeons] Different versions of Hive have different kinds internal restrictions. It is not always possible to store the metadata in a hive compatible way. For example, if the metastore uses Hive 0.13, Hive will rejects the create table call of a parquet table if columns have a binary or a decimal one. So, to save the table's metadata, we have to workaround it and save it in a way that is not compatible with hive. The reason that you see two different output for describe table and show columns is that spark sql has implemented describe table but we still delegate show columns command to Hive. Because the metadata is not hive compatible, show columns command gives you a different output. We have been gradually adding more coverage on native support of different kinds of commands. If there is any specific commands that are important to your use cases, please feel free to create jiras. > ALTER TABLE cannot find column > -- > > Key: SPARK-9762 > URL: https://issues.apache.org/jira/browse/SPARK-9762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > > {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} > lists. > In the case of a table generated with {{HiveContext.read.json()}}, the output > of {{DESCRIBE dimension_components}} is: > {code} > comp_config > struct > comp_criteria string > comp_data_model string > comp_dimensions > struct,template:string,variation:bigint> > comp_disabled boolean > comp_id bigint > comp_path string > comp_placementDatastruct > comp_slot_types array > {code} > However, {{alter table dimension_components change comp_dimensions > comp_dimensions > struct,template:string,variation:bigint,z:string>;}} > fails with: > {code} > 15/08/08 23:13:07 ERROR exec.DDLTask: > org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference > comp_dimensions > at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155) > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473) > ... > {code} > Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: > {{col}} (which does not exist in the table) and {{z}}, which was just added. > This suggests that DDL operations in Spark SQL use table metadata > inconsistently. > Full spark-sql output > [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column
[ https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940757#comment-14940757 ] Simeon Simeonov commented on SPARK-9762: [~yhuai] Refreshing is not the issue here. The issue is that {{DESCRIBE tbl}} and {{SHOW COLUMNS tbl}} show different columns for a table even without altering it which suggests that Spark SQL is not managing table metadata correctly. > ALTER TABLE cannot find column > -- > > Key: SPARK-9762 > URL: https://issues.apache.org/jira/browse/SPARK-9762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > > {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} > lists. > In the case of a table generated with {{HiveContext.read.json()}}, the output > of {{DESCRIBE dimension_components}} is: > {code} > comp_config > struct > comp_criteria string > comp_data_model string > comp_dimensions > struct,template:string,variation:bigint> > comp_disabled boolean > comp_id bigint > comp_path string > comp_placementDatastruct > comp_slot_types array > {code} > However, {{alter table dimension_components change comp_dimensions > comp_dimensions > struct,template:string,variation:bigint,z:string>;}} > fails with: > {code} > 15/08/08 23:13:07 ERROR exec.DDLTask: > org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference > comp_dimensions > at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155) > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326) > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473) > ... > {code} > Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: > {{col}} (which does not exist in the table) and {{z}}, which was just added. > This suggests that DDL operations in Spark SQL use table metadata > inconsistently. > Full spark-sql output > [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940731#comment-14940731 ] Joseph K. Bradley commented on SPARK-5874: -- It was delayed because it took longer than expected to finalize the rest of the API. However, it's scheduled for 1.6 now, and at least partial coverage should be complete for 1.6. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940702#comment-14940702 ] Yongjia Wang commented on SPARK-5874: - The functionality about force save/load all pipeline components is very important. In the design doc, it says to do this in 1.4 for the new Transformer/Estimator framework under the .ml package. We are at 1.5.0 right now, nothing happened on that path. I wonder if there was some major conceptual changes or just workload/resource issue. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940694#comment-14940694 ] Felix Cheung commented on SPARK-10903: -- toDF is doing checks in preference order btw, what functions we want to automatically find the sqlContext? {code} setMethod("toDF", signature(x = "RDD"), function(x, ...) { sqlContext <- if (exists(".sparkRHivesc", envir = .sparkREnv)) { get(".sparkRHivesc", envir = .sparkREnv) } else if (exists(".sparkRSQLsc", envir = .sparkREnv)) { get(".sparkRSQLsc", envir = .sparkREnv) } else { stop("no SQL context available") } createDataFrame(sqlContext, x, ...) }) {code} > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9867) Move utilities for binary data into ByteArray
[ https://issues.apache.org/jira/browse/SPARK-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9867. Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 1.6.0 > Move utilities for binary data into ByteArray > - > > Key: SPARK-9867 > URL: https://issues.apache.org/jira/browse/SPARK-9867 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 1.6.0 > > > The utilities such as Substring#substringBinarySQL and > BinaryPrefixComparator#computePrefix for binary data are put together in > ByteArray for easy-to-read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10904: Assignee: Apache Spark > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang >Assignee: Apache Spark > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10904: Assignee: (was: Apache Spark) > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940691#comment-14940691 ] Apache Spark commented on SPARK-10904: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/8961 > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7135) Expression for monotonically increasing IDs
[ https://issues.apache.org/jira/browse/SPARK-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940688#comment-14940688 ] Reynold Xin commented on SPARK-7135: Can you explain your use case a bit more? > Expression for monotonically increasing IDs > --- > > Key: SPARK-7135 > URL: https://issues.apache.org/jira/browse/SPARK-7135 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: dataframe > Fix For: 1.4.0 > > > Seems like a common use case that users might want a unique ID for each row. > It is more expensive to have consecutive IDs, since that'd require two pass > over the data. However, many use cases can be satisfied by just having unique > ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940687#comment-14940687 ] Reynold Xin commented on SPARK-7275: Sure we can. Do you want to submit a pull request? > Make LogicalRelation public > --- > > Key: SPARK-7275 > URL: https://issues.apache.org/jira/browse/SPARK-7275 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Santiago M. Mola >Priority: Minor > > It seems LogicalRelation is the only part of the LogicalPlan that is not > public. This makes it harder to work with full logical plans from third party > packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf
Naden Franciscus created SPARK-10908: Summary: ClassCastException in HadoopRDD.getJobConf Key: SPARK-10908 URL: https://issues.apache.org/jira/browse/SPARK-10908 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: Naden Franciscus Whilst running a Spark SQL job (I can't provide an explain plan as many of these are happening concurrently) the following exception is thrown: java.lang.ClassCastException: [B cannot be cast to org.apache.spark.util.SerializableConfiguration rg.apache.spark.util.SerializableConfiguration at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.ShuffleDependency.(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940678#comment-14940678 ] Apache Spark commented on SPARK-10906: -- User 'rahulpalamuttam' has created a pull request for this issue: https://github.com/apache/spark/pull/8960 > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10906: Assignee: (was: Apache Spark) > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10906: Assignee: Apache Spark > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10505) windowed form of count ( star ) fails with No handler for udf class
[ https://issues.apache.org/jira/browse/SPARK-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940675#comment-14940675 ] Xin Wu commented on SPARK-10505: This error is triggered in function HiveFunctionRegistry.lookupFunction() of org.apache.spark.sql.hive.hiveUDFs.scala.. The logic falls into this line: sys.error(s"No handler for udf ${functionInfo.getFunctionClass}"). The reason is that hive class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount, seeming to be an aggregate function class, does not extend org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver like other aggregate function classes do, such as GenericUDAFAverage. However, Spark code in HiveFunctionRegistry.lookupFunction() checks if AbstractGenericUDAFResolver is assignable from GenericUDAFCount, in this case, which is not satisfied obviously. such that the logic can not fall into doing HiveUDAFFunction(new HiveFunctionWrapper(functionClassName), children), like GenericUDAFAverage would have been able to. Futhermore, interface org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2 is implemented by all the aggregate function classes, including GenericUDAFCount. So I am wondering whether the solution may be replacing AbstractGenericUDAFResolver with GenericUDAFResolver2 in the condition else if (classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) This is to assume that class GenericUDAFCount is supposed to process "count(*) over (partition by c1)".. Spark/Hive experts, any comments? > windowed form of count ( star ) fails with No handler for udf class > --- > > Key: SPARK-10505 > URL: https://issues.apache.org/jira/browse/SPARK-10505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell > > The following statement will parse/execute in Hive 0.13 but fails in SPARK. > {code} > create a simple ORC table in Hive > create table if not exists TOLAP (RNUM int , C1 string, C2 string, C3 int, > C4 int) TERMINATED BY '\n' > STORED AS orc ; > select rnum, c1, c2, c3, count(*) over(partition by c1) from tolap > Error: java.lang.RuntimeException: No handler for udf class > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940670#comment-14940670 ] Weide Zhang edited comment on SPARK-5575 at 10/2/15 1:17 AM: - Hi Alexander, The features I am looking to add/have include : 1. more activation function such as ReLU, LeakyReLU, max pooling 2. support simultaneous testing and training phase similar to what caffe does 3. scalability change (including support larger model, parameter server, this is long term) so far i haven't made any of the change yet. if other people already have made the change to the current spark, i will be happy to take that as well. was (Author: weidezhang): Hi Alexander, The features I am looking to add include : 1. more activation function such as ReLU, LeakyReLU, max pooling 2. support simultaneous testing and training phase similar to what caffe does 3. scalability change (including support larger model, parameter server, this is long term) > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10907) Get rid of pending unroll memory
Andrew Or created SPARK-10907: - Summary: Get rid of pending unroll memory Key: SPARK-10907 URL: https://issues.apache.org/jira/browse/SPARK-10907 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.4.0 Reporter: Andrew Or It's incredibly complicated to have both unroll memory and pending unroll memory in MemoryStore.scala. We can probably express it with only unroll memory through some minor refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940670#comment-14940670 ] Weide Zhang commented on SPARK-5575: Hi Alexander, The features I am looking to add include : 1. more activation function such as ReLU, LeakyReLU, max pooling 2. support simultaneous testing and training phase similar to what caffe does 3. scalability change (including support larger model, parameter server, this is long term) > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9158) PyLint should only fail on error
[ https://issues.apache.org/jira/browse/SPARK-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940664#comment-14940664 ] Alan Chin commented on SPARK-9158: -- I'd like the opportunity to work on this. > PyLint should only fail on error > > > Key: SPARK-9158 > URL: https://issues.apache.org/jira/browse/SPARK-9158 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Davies Liu >Priority: Critical > > It's boring to fight with warning from Pylint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640 ] Rahul Palamuttam edited comment on SPARK-10906 at 10/2/15 12:32 AM: Hi, Can I tackle this? I have been working on a patch and will create a PR shortly. was (Author: rahul palamuttam): Hi, Can I tackle this? I have been working on a patch and will create a PR shortly. - Rahul P > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640 ] Rahul Palamuttam commented on SPARK-10906: -- Hi, Can I tackle this? I have been working on a patch and will create a pull RQ shortly. - Rahul P > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10906) More efficient SparseMatrix.equals
[ https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940640#comment-14940640 ] Rahul Palamuttam edited comment on SPARK-10906 at 10/2/15 12:31 AM: Hi, Can I tackle this? I have been working on a patch and will create a PR shortly. - Rahul P was (Author: rahul palamuttam): Hi, Can I tackle this? I have been working on a patch and will create a pull RQ shortly. - Rahul P > More efficient SparseMatrix.equals > -- > > Key: SPARK-10906 > URL: https://issues.apache.org/jira/browse/SPARK-10906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals > method. However, it looks like Breeze's equals is inefficient: > [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] > Breeze iterates over all values, including implicit zeros. We could make > this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10400) Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec"
[ https://issues.apache.org/jira/browse/SPARK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-10400. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8566 [https://github.com/apache/spark/pull/8566] > Rename or deprecate SQL option "spark.sql.parquet.followParquetFormatSpec" > -- > > Key: SPARK-10400 > URL: https://issues.apache.org/jira/browse/SPARK-10400 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.6.0 > > > We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while > working on implementing Parquet backwards-compatibility rules in SPARK-6777. > It indicates whether we should use legacy Parquet format adopted by Spark 1.4 > and prior versions or the standard format defined in parquet-format spec. > However, the name of this option is somewhat confusing, because it's not > super intuitive why we shouldn't follow the spec. Would be nice to rename it > to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they > have opposite meanings). Note that this option is not "public" ({{isPublic}} > is false). > At the moment of writing, 1.5 RC3 has already been cut. If we can't make this > one into 1.5, we can deprecate the old option with the new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940535#comment-14940535 ] Naden Franciscus commented on SPARK-10474: -- [~yhuai] Standalone > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940527#comment-14940527 ] Yin Huai commented on SPARK-10474: -- [~nadenf] Are you running Spark on mesos or yarn? Of, you are using the standalone mode? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940524#comment-14940524 ] Davies Liu commented on SPARK-10342: This will be used internal for SQL. For example, aggregation and sort-merge-join both will acquire large page to do in-memory aggregation or sorting, one could use most of the memory, then the other once can't have enough memory to work. Currently, each operator will preserve a page to make sure that they can start (could have to work with the only one page). The better solution could be, when one operator (for example aggregation) need more memory, other operators could be notified to release some memory by spilling. This could improve the memory utilization (don't need to preserve a page anymore) and void OOM. > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940508#comment-14940508 ] Davies Liu commented on SPARK-10903: LGTM. Another question is that can we have different SQLContext in the same time? one for HiveContext, one for SQLContext. > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940505#comment-14940505 ] Naden Franciscus commented on SPARK-10474: -- Can confirm also getting this issue now. There must be something common to both though right. An acquire should never fail unless the OS is out of memory right ? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table.
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940494#comment-14940494 ] Naden Franciscus commented on SPARK-10309: -- Has been difficult to get a clean stacktrace/explain trace because we are executing lots of SQL commands in parallel and we don't know which one is failing. We are absolutely doing lots of joins/aggregation/sorts. I have tried increase shuffle.memoryFraction to 0.8 but that didn't help. This is still an issue with the latest Spark 1.5.2 branch. > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940487#comment-14940487 ] Naden Franciscus edited comment on SPARK-10474 at 10/1/15 9:58 PM: --- I can't provide the explain plan since we are executing 1000s of SQL statement and hard to tell which is which. Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change. Will file this in another ticket. was (Author: nadenf): I can't provide the explain plan since we are executing 1000s of SQL statement and hard to tell which is which. Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change. @Andrew: is there is a ticket for this ? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940487#comment-14940487 ] Naden Franciscus commented on SPARK-10474: -- I can't provide the explain plan since we are executing 1000s of SQL statement and hard to tell which is which. Have increased heap to 50GB + shuffle.memoryFraction to 0.6 and 0.8. No change. @Andrew: is there is a ticket for this ? > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940481#comment-14940481 ] Joseph K. Bradley commented on SPARK-10780: --- Sure, please do! > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940463#comment-14940463 ] Jayant Shekhar commented on SPARK-10780: Hi [~josephkb] I can work on it? > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940446#comment-14940446 ] Alexander Ulanov commented on SPARK-5575: - Hi, Weide, Sounds good! What kind of feature are you planning to add? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10906) More efficient SparseMatrix.equals
Joseph K. Bradley created SPARK-10906: - Summary: More efficient SparseMatrix.equals Key: SPARK-10906 URL: https://issues.apache.org/jira/browse/SPARK-10906 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals method. However, it looks like Breeze's equals is inefficient: [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132] Breeze iterates over all values, including implicit zeros. We could make this more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940444#comment-14940444 ] Weiqiang Zhuang commented on SPARK-10904: - That works because it invokes select with list(). > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function
[ https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940442#comment-14940442 ] Shivaram Venkataraman commented on SPARK-10894: --- Yes, as [~felixcheung] said this is by design. The main reason is that we use `df$Age` as an easy handle or a reference to a column in a distributed data frame that can be passed to other functions without using strings ("Age"). The `df$A` also auto completes and is easy to use. The square brackets API is meant to provide some basic compatibility with R (e.g. df[, df$Age] or df[, "Age"]). However my opinion is that the overall DataFrame API is targeted to work more like dplyr and I don't think supporting all aspects of R data.frames is a design goal. > Add 'drop' support for DataFrame's subset function > -- > > Key: SPARK-10894 > URL: https://issues.apache.org/jira/browse/SPARK-10894 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Weiqiang Zhuang > > SparkR DataFrame can be subset to get one or more columns of the dataset. The > current '[' implementation does not support 'drop' when is asked for just one > column. This is not consistent with the R syntax: > x[i, j, ... , drop = TRUE] > # in R, when drop is FALSE, remain as data.frame > > class(iris[, "Sepal.Width", drop=F]) > [1] "data.frame" > # when drop is TRUE (default), drop to be a vector > > class(iris[, "Sepal.Width", drop=T]) > [1] "numeric" > > class(iris[,"Sepal.Width"]) > [1] "numeric" > > df <- createDataFrame(sqlContext, iris) > # in SparkR, 'drop' argument has no impact > > class(df[,"Sepal_Width", drop=F]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > # should have dropped to be a Column class instead > > class(df[,"Sepal_Width", drop=T]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > > class(df[,"Sepal_Width"]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > We should add the 'drop' support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmytro Bielievtsov updated SPARK-10872: --- Description: Starting from spark 1.4.0 (works well on 1.3.1), the following code fails with "XSDB6: Another instance of Derby may have already booted the database ~/metastore_db": {code:python} from pyspark import SparkContext, HiveContext sc = SparkContext("local[*]", "app1") sql = HiveContext(sc) sql.createDataFrame([[1]]).collect() sc.stop() sc = SparkContext("local[*]", "app2") sql = HiveContext(sc) sql.createDataFrame([[1]]).collect() # Py4J error {code} This is related to [#SPARK-9539], and I intend to restart spark context several times for isolated jobs to prevent cache cluttering and GC errors. Here's a larger part of the full error trace: {noformat} Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the next exception for details. org.datanucleus.exceptions.NucleusDataStoreException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the next exception for details. at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) at org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) at org.apache.hadoop.hive.ql.metad
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940432#comment-14940432 ] Shivaram Venkataraman commented on SPARK-10903: --- Yeah this sounds like a good idea as we probably don't want so support multiple SQL contexts inside the same R session. cc [~davies] [~falaki] to see if they have any scenarios where this might be a problem. > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940429#comment-14940429 ] Felix Cheung commented on SPARK-10904: -- `head(df[,c("Sepal_Width", "Sepal_Length")])` this works, I guess that's why I'm surprised. I think I know how to fix this. I will take this. > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940416#comment-14940416 ] Felix Cheung commented on SPARK-10904: -- I added that line of comment on "df$age" I could clarify that. I think we should make `select(df, c("col1", "col2"))` work > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1762) Add functionality to pin RDDs in cache
[ https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940395#comment-14940395 ] FangzhouXing commented on SPARK-1762: - What is the current eviction policy? Instead of pinning, what if we just make the eviction policy smarter? (from a quick look, it seems like the current policy is FIFO) We want developers to think about how much memory the system has less, not more. > Add functionality to pin RDDs in cache > -- > > Key: SPARK-1762 > URL: https://issues.apache.org/jira/browse/SPARK-1762 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > Right now, all RDDs are created equal, and there is no mechanism to identify > a certain RDD to be more important than the rest. This is a problem if the > RDD fraction is small, because just caching a few RDDs can evict more > important ones. > A side effect of this feature is that we can now more safely allocate a > smaller spark.storage.memoryFraction if we know how large our important RDDs > are, without having to worry about them being evicted. This allows us to use > more memory for shuffles, for instance, and avoid disk spills. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10821) RandomForest serialization OOM during findBestSplits
[ https://issues.apache.org/jira/browse/SPARK-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940370#comment-14940370 ] Jay Luan commented on SPARK-10821: -- Thank you for the insight, do you know what the status of the new implementation of decision tree is or a possible ETA for when it will be ready? Maybe I can help with either testing or contributing to the code. > RandomForest serialization OOM during findBestSplits > > > Key: SPARK-10821 > URL: https://issues.apache.org/jira/browse/SPARK-10821 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.0, 1.5.0 > Environment: Amazon EC2 Linux >Reporter: Jay Luan > Labels: OOM, out-of-memory > > I am getting OOM during serialization for a relatively small dataset for a > RandomForest. Even with spark.serializer.objectStreamReset at 1, It is still > running out of memory when attempting to serialize my data. > Stack Trace: > Traceback (most recent call last): > File "/root/random_forest/random_forest_spark.py", line 198, in > main() > File "/root/random_forest/random_forest_spark.py", line 166, in main > trainModel(dset) > File "/root/random_forest/random_forest_spark.py", line 191, in trainModel > impurity='gini', maxDepth=4, maxBins=32) > File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, > in trainClassifier > File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, > in _train > File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line > 130, in callMLlibFunc > File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line > 123, in callJavaFunc > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > py4j.protocol.Py4JJavaError15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: > Done removing RDD 7, response is 0 > 15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to > AkkaRpcEndpointRef(Actor[akka://sparkDriver/temp/$Mj]) > : An error occurred while calling o89.trainRandomForestModel. > : java.lang.OutOfMemoryError > at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2021) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702) > at > org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625) > at > org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235) > at > org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291) > at > org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflect
[jira] [Resolved] (SPARK-7218) Create a real iterator with open/close for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7218. Resolution: Fixed Assignee: Reynold Xin Target Version/s: 1.6.0 (was: ) Ah forgot to close this: this has been fixed already. Code in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/local/LocalNode.scala > Create a real iterator with open/close for Spark SQL > > > Key: SPARK-7218 > URL: https://issues.apache.org/jira/browse/SPARK-7218 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940368#comment-14940368 ] Weiqiang Zhuang commented on SPARK-10904: - Yes, list() works. Question is whether or not the c() will be supported. If not, the document for select should be updated. There are a couple of errors in the following given example. 1) select(df, c("col1", "col2")) does not work; 2) the claim of using "$" as a similar method to select is false, because df$age returns a 'Column' class while select(df, 'age') returns a 'DataFrame' class. Examples ## Not run: select(df, "*") select(df, "col1", "col2") select(df, df$name, df$age + 1) select(df, c("col1", "col2")) select(df, list(df$name, df$age + 1)) # Similar to R data frames columns can also be selected using `$` df$age > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10905) Export freqItems() for DataFrameStatFunctions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rerngvit yanggratoke updated SPARK-10905: - Summary: Export freqItems() for DataFrameStatFunctions in SparkR (was: Implement freqItems() for DataFrameStatFunctions in SparkR) > Export freqItems() for DataFrameStatFunctions in SparkR > --- > > Key: SPARK-10905 > URL: https://issues.apache.org/jira/browse/SPARK-10905 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: rerngvit yanggratoke > Fix For: 1.6.0 > > > Currently only crosstab is implemented. This subtask is about adding > freqItems() API to sparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10671) Calling a UDF with insufficient number of input arguments should throw an analysis error
[ https://issues.apache.org/jira/browse/SPARK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10671: - Assignee: Wenchen Fan (was: Yin Huai) > Calling a UDF with insufficient number of input arguments should throw an > analysis error > > > Key: SPARK-10671 > URL: https://issues.apache.org/jira/browse/SPARK-10671 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > {code} > import org.apache.spark.sql.functions._ > Seq((1,2)).toDF("a", "b").select(callUDF("percentile", $"a")) > {code} > This should throws an Analysis Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10671) Calling a UDF with insufficient number of input arguments should throw an analysis error
[ https://issues.apache.org/jira/browse/SPARK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-10671. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8941 [https://github.com/apache/spark/pull/8941] > Calling a UDF with insufficient number of input arguments should throw an > analysis error > > > Key: SPARK-10671 > URL: https://issues.apache.org/jira/browse/SPARK-10671 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > > {code} > import org.apache.spark.sql.functions._ > Seq((1,2)).toDF("a", "b").select(callUDF("percentile", $"a")) > {code} > This should throws an Analysis Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function
[ https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940359#comment-14940359 ] Weiqiang Zhuang commented on SPARK-10894: - Is there a good reason it was designed like this? The R codes we have seen are always using df[,c(1)] and df$col1 interchangeable. > Add 'drop' support for DataFrame's subset function > -- > > Key: SPARK-10894 > URL: https://issues.apache.org/jira/browse/SPARK-10894 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Weiqiang Zhuang > > SparkR DataFrame can be subset to get one or more columns of the dataset. The > current '[' implementation does not support 'drop' when is asked for just one > column. This is not consistent with the R syntax: > x[i, j, ... , drop = TRUE] > # in R, when drop is FALSE, remain as data.frame > > class(iris[, "Sepal.Width", drop=F]) > [1] "data.frame" > # when drop is TRUE (default), drop to be a vector > > class(iris[, "Sepal.Width", drop=T]) > [1] "numeric" > > class(iris[,"Sepal.Width"]) > [1] "numeric" > > df <- createDataFrame(sqlContext, iris) > # in SparkR, 'drop' argument has no impact > > class(df[,"Sepal_Width", drop=F]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > # should have dropped to be a Column class instead > > class(df[,"Sepal_Width", drop=T]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > > class(df[,"Sepal_Width"]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > We should add the 'drop' support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10788: -- Priority: Minor (was: Major) > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940345#comment-14940345 ] Joseph K. Bradley commented on SPARK-10788: --- Though I should say: I should probably put this as Minor priority. It's not a huge savings, and it's likely a somewhat complex change. If you have other things you're working on, I'd prioritize those instead. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940343#comment-14940343 ] Joseph K. Bradley commented on SPARK-10788: --- OK, thanks! > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940316#comment-14940316 ] Seth Hendrickson edited comment on SPARK-10788 at 10/1/15 8:04 PM: --- Yes, much clearer, thanks! I can work on this task. was (Author: sethah): Yes, much clearer. I can work on this task. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940315#comment-14940315 ] FangzhouXing commented on SPARK-10342: -- In my understanding, inactive spark programs will receive memory warning when memory runs low. And then a handler implemented by programmer will be called to reduce their program memory usage, just like what happened in iOS app. Is this correct? Also, what's an example use-case for this? > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940316#comment-14940316 ] Seth Hendrickson commented on SPARK-10788: -- Yes, much clearer. I can work on this task. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940310#comment-14940310 ] Shivaram Venkataraman commented on SPARK-10904: --- I don't think we support passing in `c` for a list of things in SparkR because the serialization, deserialization was not supported for it. Changing it to `list("col1", "col2")` should work here cc [~sunrui] Some of the recent serializer fixes may help here ? > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*
[ https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940253#comment-14940253 ] Deron Eriksson edited comment on SPARK-10286 at 10/1/15 7:42 PM: - Hi, I see @since annotations on clustering.py, recommendation.py, regression.py, and tuning.py in pyspark.ml.* but not currently on others or in pyspark.ml.param.*, so I would like to work on this one. was (Author: deron): Hi, I see @since annotations on clustering.py, recommendation.py, regression.py, and tuning.py in pyspark.ml.* but not on currently on others or in pyspark.ml.param.*, so I would like to work on this one. > Add @since annotation to pyspark.ml.param and pyspark.ml.* > -- > > Key: SPARK-10286 > URL: https://issues.apache.org/jira/browse/SPARK-10286 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rerngvit yanggratoke updated SPARK-10905: - Shepherd: Shivaram Venkataraman (was: Sun Rui) > Implement freqItems() for DataFrameStatFunctions in SparkR > -- > > Key: SPARK-10905 > URL: https://issues.apache.org/jira/browse/SPARK-10905 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: rerngvit yanggratoke > Fix For: 1.6.0 > > > Currently only crosstab is implemented. This subtask is about adding > freqItems() API to sparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940092#comment-14940092 ] Marcelo Vanzin edited comment on SPARK-10901 at 10/1/15 7:39 PM: - As a potential workaround, he can add his app jar (or the kryo jar) to {{spark.(executor,driver).extraClassPath}}. If he distributes the jars with the application (using {{--jars}} in cluster mode or {{spark.yarn.dist.files}} in client mode), just add the jar names without any path and they should be prepended to the app's classpath. was (Author: vanzin): As a potential workaround, he can add his app jar (or the kryo jar) to {{spark.{executor,driver}.extraClassPath}}. If he distributes the jars with the application (using {{--jars}} in cluster mode or {{spark.yarn.dist.files}} in client mode), just add the jar names without any path and they should be prepended to the app's classpath. > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940289#comment-14940289 ] Marcelo Vanzin commented on SPARK-10901: Yes I did. I'd have to look more closely for why `userClassPathFirst` is not working for this case, nothing pops up at the moment. > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940264#comment-14940264 ] Thomas Graves commented on SPARK-10901: --- [~vanzin] thanks for the suggestion. Did you work on the new user class path stuff at all? Wondering if you might have ideas why that didn't work. It looks like its loading both versions. > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940263#comment-14940263 ] Weiqiang Zhuang commented on SPARK-10904: - Here you go: sc <- sparkR.init() sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, iris) select(df, c("Sepal_Width","Sepal_Length")) > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rerngvit yanggratoke updated SPARK-10905: - Shepherd: Sun Rui Remaining Estimate: (was: 168h) Original Estimate: (was: 168h) Description: Currently only crosstab is implemented. This subtask is about adding freqItems() API to sparkR > Implement freqItems() for DataFrameStatFunctions in SparkR > -- > > Key: SPARK-10905 > URL: https://issues.apache.org/jira/browse/SPARK-10905 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: rerngvit yanggratoke > Fix For: 1.6.0 > > > Currently only crosstab is implemented. This subtask is about adding > freqItems() API to sparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940257#comment-14940257 ] Weide Zhang commented on SPARK-5575: hello, i plan to add more feature for spark dnn especially for adding more layer functionalities as well as more types activation function. shall i send pull request to https://github.com/avulanov/spark/tree/ann-interface-gemm ? thx > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*
[ https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940253#comment-14940253 ] Deron Eriksson commented on SPARK-10286: Hi, I see @since annotations on clustering.py, recommendation.py, regression.py, and tuning.py in pyspark.ml.* but not on currently on others or in pyspark.ml.param.*, so I would like to work on this one. > Add @since annotation to pyspark.ml.param and pyspark.ml.* > -- > > Key: SPARK-10286 > URL: https://issues.apache.org/jira/browse/SPARK-10286 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10905) Implement freqItems() for DataFrameStatFunctions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rerngvit yanggratoke updated SPARK-10905: - Summary: Implement freqItems() for DataFrameStatFunctions in SparkR (was: Implement freqItems for DataFrameStatFunctions in SparkR) > Implement freqItems() for DataFrameStatFunctions in SparkR > -- > > Key: SPARK-10905 > URL: https://issues.apache.org/jira/browse/SPARK-10905 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: rerngvit yanggratoke > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10894) Add 'drop' support for DataFrame's subset function
[ https://issues.apache.org/jira/browse/SPARK-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940252#comment-14940252 ] Felix Cheung commented on SPARK-10894: -- To Shivaram's point, I think it is intentional that df$Sepal_Width is Column and df[, df$Sepal_Width] is a DataFrame. > Add 'drop' support for DataFrame's subset function > -- > > Key: SPARK-10894 > URL: https://issues.apache.org/jira/browse/SPARK-10894 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Weiqiang Zhuang > > SparkR DataFrame can be subset to get one or more columns of the dataset. The > current '[' implementation does not support 'drop' when is asked for just one > column. This is not consistent with the R syntax: > x[i, j, ... , drop = TRUE] > # in R, when drop is FALSE, remain as data.frame > > class(iris[, "Sepal.Width", drop=F]) > [1] "data.frame" > # when drop is TRUE (default), drop to be a vector > > class(iris[, "Sepal.Width", drop=T]) > [1] "numeric" > > class(iris[,"Sepal.Width"]) > [1] "numeric" > > df <- createDataFrame(sqlContext, iris) > # in SparkR, 'drop' argument has no impact > > class(df[,"Sepal_Width", drop=F]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > # should have dropped to be a Column class instead > > class(df[,"Sepal_Width", drop=T]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > > class(df[,"Sepal_Width"]) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > We should add the 'drop' support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10905) Implement freqItems for DataFrameStatFunctions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940248#comment-14940248 ] rerngvit yanggratoke commented on SPARK-10905: -- I am going to work on this issue. > Implement freqItems for DataFrameStatFunctions in SparkR > > > Key: SPARK-10905 > URL: https://issues.apache.org/jira/browse/SPARK-10905 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: rerngvit yanggratoke > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10905) Implement freqItems for DataFrameStatFunctions in SparkR
rerngvit yanggratoke created SPARK-10905: Summary: Implement freqItems for DataFrameStatFunctions in SparkR Key: SPARK-10905 URL: https://issues.apache.org/jira/browse/SPARK-10905 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.5.0 Reporter: rerngvit yanggratoke Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10904) select(df, c("col1", "col2")) fails
[ https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940239#comment-14940239 ] Felix Cheung commented on SPARK-10904: -- Do you have the repo steps/code? > select(df, c("col1", "col2")) fails > - > > Key: SPARK-10904 > URL: https://issues.apache.org/jira/browse/SPARK-10904 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Weiqiang Zhuang > > The help page for 'select' gives an example of > select(df, c("col1", "col2")) > However, this fails with assertion: > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) > at > org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) > And then none of the functions will work with following error: > > head(df) > Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940232#comment-14940232 ] Andrew Or commented on SPARK-10474: --- [~nadenf] That's a different issue. In your stack trace Spark fails to acquire memory in the prepare phase. SPARK-10474 is already past that, but fails to allocate it after switching to sort-based aggregation. It's still an issue we should fix but it's a separate one. > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > - > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 >
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940231#comment-14940231 ] Felix Cheung commented on SPARK-10903: -- [~shivaram]what do you think? If ok I would love to take this change. > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940229#comment-14940229 ] Felix Cheung commented on SPARK-10903: -- +1 could/should we have a version of this that checks .sparkRSQLsc in . sparkREnv? (see https://github.com/NarineK/spark/blob/sparkrasDataFrame/R/pkg/R/sparkR.R#L224 > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10753) Implement freqItems() and sampleBy() in DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-10753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940226#comment-14940226 ] rerngvit yanggratoke commented on SPARK-10753: -- Can we break this into two sub tasks: freqItems() and sampleBy()? I would like to work on adding the freqItems(). > Implement freqItems() and sampleBy() in DataFrameStatFunctions > -- > > Key: SPARK-10753 > URL: https://issues.apache.org/jira/browse/SPARK-10753 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10866) [Spark SQL] [UDF] the floor function got wrong return value type
[ https://issues.apache.org/jira/browse/SPARK-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10866. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8933 [https://github.com/apache/spark/pull/8933] > [Spark SQL] [UDF] the floor function got wrong return value type > > > Key: SPARK-10866 > URL: https://issues.apache.org/jira/browse/SPARK-10866 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou > Fix For: 1.6.0 > > > As per floor definition,it should get BIGINT return value > -floor(DOUBLE a) > -Returns the maximum BIGINT value that is equal to or less than a. > But in current Spark implementation, it got wrong value type. > e.g., > select floor(2642.12) from udf_test_web_sales limit 1; > 2642.0 > In hive implementation, it got return value type like below: > hive> select ceil(2642.12) from udf_test_web_sales limit 1; > OK > 2642 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
[ https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10865. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8933 [https://github.com/apache/spark/pull/8933] > [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type > --- > > Key: SPARK-10865 > URL: https://issues.apache.org/jira/browse/SPARK-10865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou > Fix For: 1.6.0 > > > As per ceil/ceiling definition,it should get BIGINT return value > -ceil(DOUBLE a), ceiling(DOUBLE a) > -Returns the minimum BIGINT value that is equal to or greater than a. > But in current Spark implementation, it got wrong value type. > e.g., > select ceil(2642.12) from udf_test_web_sales limit 1; > 2643.0 > In hive implementation, it got return value type like below: > hive> select ceil(2642.12) from udf_test_web_sales limit 1; > OK > 2643 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib
[ https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940217#comment-14940217 ] Joseph K. Bradley commented on SPARK-10669: --- Sure, please do. > Link to each language's API in codetabs in ML docs: spark.mllib > --- > > Key: SPARK-10669 > URL: https://issues.apache.org/jira/browse/SPARK-10669 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley > > In the Markdown docs for the spark.mllib Programming Guide, we have code > examples with codetabs for each language. We should link to each language's > API docs within the corresponding codetab, but we are inconsistent about > this. For an example of what we want to do, see the "ChiSqSelector" section > in > [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md] > This JIRA is just for spark.mllib, not spark.ml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9695) Add random seed Param to ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940207#comment-14940207 ] Joseph K. Bradley edited comment on SPARK-9695 at 10/1/15 6:45 PM: --- That's what I would propose to. There are a few complications to figure out though. *API* * If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite that seed? I'd vote for no. *What behavior do we want in the situation below?* Situation: * User creates a Pipeline with some stages * User sets pipeline.seed * User saves pipeline to FILE * User runs pipeline and produces model A * User loads Pipeline from FILE and runs it to produce model B I'd say that the ideal behavior will be for model A and B to produce exactly the same results. However, this will require us to guarantee that each Pipeline stage is given the same seed for both A and B; i.e., the random number generator used by the Pipeline should not change behavior across Spark versions. Is that a reasonable assumption? Note: We could have a Pipeline set its stages' seeds whenever Pipeline.setSeed is called, but that would cause problems with (a) the question above under "API" and (b) if stages are modified after the seed is modified. I'll try to think of other possible issues too. CC: [~mengxr] was (Author: josephkb): That's what I would propose to. There are a few complications to figure out though. *API* * If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite that seed? I'd vote for no. *What behavior do we want in the situation below?* Situation: * User creates a Pipeline with some stages * User sets pipeline.seed * User saves pipeline to FILE * User runs pipeline and produces model A * User loads Pipeline from FILE and runs it to produce model B I'd say that the ideal behavior will be for model A and B to produce exactly the same results. However, this will require us to guarantee that each Pipeline stage is given the same seed for both A and B; i.e., the random number generator used by the Pipeline should not change behavior across Spark versions. Is that a reasonable assumption? I'll try to think of other possible issues too. CC: [~mengxr] > Add random seed Param to ML Pipeline > > > Key: SPARK-9695 > URL: https://issues.apache.org/jira/browse/SPARK-9695 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Note this will require some discussion about whether to make HasSeed the main > API for whether an algorithm takes a seed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9695) Add random seed Param to ML Pipeline
[ https://issues.apache.org/jira/browse/SPARK-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940207#comment-14940207 ] Joseph K. Bradley commented on SPARK-9695: -- That's what I would propose to. There are a few complications to figure out though. *API* * If a Pipeline stage has a seed explicitly set, should the Pipeline overwrite that seed? I'd vote for no. *What behavior do we want in the situation below?* Situation: * User creates a Pipeline with some stages * User sets pipeline.seed * User saves pipeline to FILE * User runs pipeline and produces model A * User loads Pipeline from FILE and runs it to produce model B I'd say that the ideal behavior will be for model A and B to produce exactly the same results. However, this will require us to guarantee that each Pipeline stage is given the same seed for both A and B; i.e., the random number generator used by the Pipeline should not change behavior across Spark versions. Is that a reasonable assumption? I'll try to think of other possible issues too. CC: [~mengxr] > Add random seed Param to ML Pipeline > > > Key: SPARK-9695 > URL: https://issues.apache.org/jira/browse/SPARK-9695 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Note this will require some discussion about whether to make HasSeed the main > API for whether an algorithm takes a seed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle
[ https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940202#comment-14940202 ] Josh Rosen commented on SPARK-7448: --- I tried a hacky prototype of this and don't remember it showing a huge difference, but that's not to say that it's not worth trying again. Feel free to take a stab at this. > Implement custom bye array serializer for use in PySpark shuffle > > > Key: SPARK-7448 > URL: https://issues.apache.org/jira/browse/SPARK-7448 > Project: Spark > Issue Type: Improvement > Components: PySpark, Shuffle >Reporter: Josh Rosen >Priority: Minor > > PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We > should implement a custom Serializer for use in these shuffles. This will > allow us to take advantage of shuffle optimizations like SPARK-7311 for > PySpark without requiring users to change the default serializer to > KryoSerializer (this is useful for JobServer-type applications). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940196#comment-14940196 ] Joseph K. Bradley commented on SPARK-10788: --- Updated. Does it make more sense now? > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940197#comment-14940197 ] Joseph K. Bradley commented on SPARK-10413: --- SGTM > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10788: -- Description: Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) - stats(A)}}. We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml). was: Decision trees in spark.ml (RandomForest.scala) effectively creates a second copy of each split. E.g., if there are 3 categories A, B, C, then we should consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we also consider the 3 flipped splits: * B,C vs. A * C vs. A, B * B vs. A, C This means we communicate twice as much data as needed for these features. We should eliminate these duplicate splits within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml). > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10901: Assignee: Thomas Graves (was: Apache Spark) > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940192#comment-14940192 ] Apache Spark commented on SPARK-10901: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/8959 > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10901) spark.yarn.user.classpath.first doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10901: Assignee: Apache Spark (was: Thomas Graves) > spark.yarn.user.classpath.first doesn't work > > > Key: SPARK-10901 > URL: https://issues.apache.org/jira/browse/SPARK-10901 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Critical > > spark.yarn.user.classpath.first doesn't properly add the app jar to the > system class path first. It has some logic there that i believe works for > local files but running on yarn using distributed cache to distribute the app > jar doesn't put __app__.jar into the classpath at all. > This is a break in backwards compatibility. > Note that in this case the user is trying to use different version of kryo > (which used to work in spark 1.2) and the new configs for this: > spark.{driver, executor}.userClassPathFirst don't allow this as it errors out > with: > User class threw exception: java.lang.LinkageError: loader constraint > violation: loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading > for a different type with name "com/esotericsoftware/kryo/Kryo" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940186#comment-14940186 ] Joseph K. Bradley commented on SPARK-10788: --- Reading what I wrote now, I realize I didn't actually phrase it correctly. I'll update the description. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > Decision trees in spark.ml (RandomForest.scala) effectively creates a second > copy of each split. E.g., if there are 3 categories A, B, C, then we should > consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we also consider the 3 flipped splits: > * B,C vs. A > * C vs. A, B > * B vs. A, C > This means we communicate twice as much data as needed for these features. > We should eliminate these duplicate splits within the spark.ml implementation > since the spark.mllib implementation will be removed before long (and will > instead call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming (umbrella JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940184#comment-14940184 ] Iulian Dragos commented on SPARK-7398: -- Hey, except the last point, everything is available in 1.5. You can go ahead and tackle the remaining ticket, of course. > Add back-pressure to Spark Streaming (umbrella JIRA) > > > Key: SPARK-7398 > URL: https://issues.apache.org/jira/browse/SPARK-7398 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.1 >Reporter: François Garillot >Assignee: Tathagata Das >Priority: Critical > Labels: streams > > Spark Streaming has trouble dealing with situations where > batch processing time > batch interval > Meaning a high throughput of input data w.r.t. Spark's ability to remove data > from the queue. > If this throughput is sustained for long enough, it leads to an unstable > situation where the memory of the Receiver's Executor is overflowed. > This aims at transmitting a back-pressure signal back to data ingestion to > help with dealing with that high throughput, in a backwards-compatible way. > The original design doc can be found here: > https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing > The second design doc, focusing [on the first > sub-task|https://issues.apache.org/jira/browse/SPARK-8834] (without all the > background info, and more centered on the implementation) can be found here: > https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME
[ https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-10787: --- Priority: Major (was: Minor) > Consider replacing ObjectOutputStream for serialization to prevent OOME > --- > > Key: SPARK-10787 > URL: https://issues.apache.org/jira/browse/SPARK-10787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu > > In the thread, Spark ClosureCleaner or java serializer OOM when trying to > grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that > ClosureCleaner#ensureSerializable() resulted in OOME. > The cause was that ObjectOutputStream keeps a strong reference of every > object that was written to it. > This issue tries to avoid OOME by considering alternative to > ObjectOutputStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME
[ https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-10787: --- Summary: Consider replacing ObjectOutputStream for serialization to prevent OOME (was: Reset ObjectOutputStream more often to prevent OOME) > Consider replacing ObjectOutputStream for serialization to prevent OOME > --- > > Key: SPARK-10787 > URL: https://issues.apache.org/jira/browse/SPARK-10787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > In the thread, Spark ClosureCleaner or java serializer OOM when trying to > grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that > ClosureCleaner#ensureSerializable() resulted in OOME. > The cause was that ObjectOutputStream keeps a strong reference of every > object that was written to it. > This issue tries to avoid OOME by calling reset() more often. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10787) Consider replacing ObjectOutputStream for serialization to prevent OOME
[ https://issues.apache.org/jira/browse/SPARK-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-10787: --- Description: In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME. The cause was that ObjectOutputStream keeps a strong reference of every object that was written to it. This issue tries to avoid OOME by considering alternative to ObjectOutputStream was: In the thread, Spark ClosureCleaner or java serializer OOM when trying to grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that ClosureCleaner#ensureSerializable() resulted in OOME. The cause was that ObjectOutputStream keeps a strong reference of every object that was written to it. This issue tries to avoid OOME by calling reset() more often. > Consider replacing ObjectOutputStream for serialization to prevent OOME > --- > > Key: SPARK-10787 > URL: https://issues.apache.org/jira/browse/SPARK-10787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > In the thread, Spark ClosureCleaner or java serializer OOM when trying to > grow (http://search-hadoop.com/m/q3RTtAr5X543dNn), Jay Luan reported that > ClosureCleaner#ensureSerializable() resulted in OOME. > The cause was that ObjectOutputStream keeps a strong reference of every > object that was written to it. > This issue tries to avoid OOME by considering alternative to > ObjectOutputStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10172) History Server web UI gets messed up when sorting on any column
[ https://issues.apache.org/jira/browse/SPARK-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940156#comment-14940156 ] Josiah Samuel Sathiadass commented on SPARK-10172: -- [~tgraves], As per the current implementation, the Spark table pagination is not linked with the current sorting logic. It means that we can sort the table content which gets displayed on any particular page. Since table creation and data population are done in a generic way inside Spark, modifying such logic will have wider impact and demands more UI testing. We went ahead with a quick fix by disabling the sorting on a table if it contains multiple attempts. > History Server web UI gets messed up when sorting on any column > --- > > Key: SPARK-10172 > URL: https://issues.apache.org/jira/browse/SPARK-10172 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0, 1.4.1 >Reporter: Min Shen >Assignee: Josiah Samuel Sathiadass >Priority: Minor > Labels: regression > Fix For: 1.5.1, 1.6.0 > > Attachments: screen-shot.png > > > If the history web UI displays the "Attempt ID" column, when clicking the > table header to sort on any column, the entire page gets messed up. > This seems to be a problem with the sorttable.js not able to correctly handle > tables with rowspan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10897) Custom job/stage names
[ https://issues.apache.org/jira/browse/SPARK-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940157#comment-14940157 ] Nithin Asokan commented on SPARK-10897: --- {quote} For example if groupBy results in 3 stages, which one gets the name? if 3 method calls result in 1 stage?I don't think it's impossible but not sure about the details of the semantics. {quote} This is a good point. I did not think of this scenario in my mind. {quote} is the motivation really to just display something farther up the call stack? {quote} Yes, Crunch has a concept of DoFn which is similar to Function in spark. These DoFn's can take names that are usually displayed on a Job page in MR. I should not be comparing MR to Spark, but in my use case; we are migrating from MR to Spark. And our engineers are familiar with how crunch creates a MR job that has a nice job name which includes all DoFn name; this give more context to a user as what the job is processing. For example: In MR crunch can create a job name like this {{MyPipeline: Text("/input/path")+Filter valid lines+Text("/output/path")}}. In case of Spark, we are missing that information. I believe partly because Spark scheduler handles stage and job creation. A Spark job/stage name may appear as {code} sortByKey at PGroupedTableImpl.java:123 (job name) mapToPair at PGroupedTableImpl.java:108 (stage name) {code} While this gives idea that it's processing/creating a PGroupedTable, it does not give me full context(atleast through Crunch) of DoFn applied. If Spark allows users to set Stage names, I think we can pass some DoFn information from Crunch. The next thing I would ask myself would be, if Crunch does not know what stages are created, how can it know which DoFn name to pass to Spark? I'm not fully sure if this can be supported because of my less knowledge in Spark, but if other feels it's possible it could be something that will be helpful for Crunch. > Custom job/stage names > -- > > Key: SPARK-10897 > URL: https://issues.apache.org/jira/browse/SPARK-10897 > Project: Spark > Issue Type: Wish > Components: Web UI >Reporter: Nithin Asokan >Priority: Minor > > Logging this jira to get some opinion about discussion I started on > [user-list|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-Stage-names-tt24867.html] > I would like to get some thoughts about having custom stage/job names. > Currently I believe the stage names cannot be controlled by user, but if > allowed we can have libraries like Apache [Crunch|https://crunch.apache.org/] > to dynamically set stage names based on the type of > processing(action/transformation) it is performing. > Is it possible for Spark to support custom names? Will it make sense to allow > users set stage names? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)
[ https://issues.apache.org/jira/browse/SPARK-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940153#comment-14940153 ] Joseph K. Bradley commented on SPARK-10779: --- Sounds good, thanks! > Set initialModel for KMeans model in PySpark (spark.mllib) > -- > > Key: SPARK-10779 > URL: https://issues.apache.org/jira/browse/SPARK-10779 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Joseph K. Bradley > > Provide initialModel param for pyspark.mllib.clustering.KMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle
[ https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940133#comment-14940133 ] Gayathri Murali commented on SPARK-7448: Is anyone working on this ? If not, I would like to work on this. > Implement custom bye array serializer for use in PySpark shuffle > > > Key: SPARK-7448 > URL: https://issues.apache.org/jira/browse/SPARK-7448 > Project: Spark > Issue Type: Improvement > Components: PySpark, Shuffle >Reporter: Josh Rosen >Priority: Minor > > PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We > should implement a custom Serializer for use in these shuffles. This will > allow us to take advantage of shuffle optimizations like SPARK-7311 for > PySpark without requiring users to change the default serializer to > KryoSerializer (this is useful for JobServer-type applications). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10904) select(df, c("col1", "col2")) fails
Weiqiang Zhuang created SPARK-10904: --- Summary: select(df, c("col1", "col2")) fails Key: SPARK-10904 URL: https://issues.apache.org/jira/browse/SPARK-10904 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Weiqiang Zhuang The help page for 'select' gives an example of select(df, c("col1", "col2")) However, this fails with assertion: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52) at org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182) at org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181) And then none of the functions will work with following error: > head(df) Error in if (returnStatus != 0) { : argument is of length zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org