[jira] [Commented] (SPARK-16037) use by-position resolution when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338367#comment-15338367 ] Apache Spark commented on SPARK-16037: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13766 > use by-position resolution when insert into hive table > -- > > Key: SPARK-16037 > URL: https://issues.apache.org/jira/browse/SPARK-16037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; > The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338368#comment-15338368 ] Apache Spark commented on SPARK-16034: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13766 > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
[ https://issues.apache.org/jira/browse/SPARK-16036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338366#comment-15338366 ] Apache Spark commented on SPARK-16036: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13766 > better error message if the number of columns in SELECT clause doesn't match > the table schema > - > > Key: SPARK-16036 > URL: https://issues.apache.org/jira/browse/SPARK-16036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15722) Wrong data when CTAS specifies schema
[ https://issues.apache.org/jira/browse/SPARK-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-15722: Description: {code} scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", "width").write.insertInto("boxes") scala> spark.table("boxes").show() +-+--+--+ |width|length|height| +-+--+--+ |1| 2| 3| |2| 4| 6| |3| 6| 9| +-+--+--+ scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM boxes") scala> spark.table("boxes").show() ++---+ |name|age| ++---+ | 1| 2| | 2| 4| | 3| 6| ++---+ {code} The columns don't even match in types. was: {code} scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", "width").write.insertInto("boxes") scala> spark.table("boxes").show() +-+--+--+ |width|length|height| +-+--+--+ |1| 2| 3| |2| 4| 6| |3| 6| 9| +-+--+--+ scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM boxes") scala> spark.table("students").show() ++---+ |name|age| ++---+ | 1| 2| | 2| 4| | 3| 6| ++---+ {code} The columns don't even match in types. > Wrong data when CTAS specifies schema > - > > Key: SPARK-15722 > URL: https://issues.apache.org/jira/browse/SPARK-15722 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > {code} > scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") > scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", > "width").write.insertInto("boxes") > scala> spark.table("boxes").show() > +-+--+--+ > |width|length|height| > +-+--+--+ > |1| 2| 3| > |2| 4| 6| > |3| 6| 9| > +-+--+--+ > scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM > boxes") > scala> spark.table("boxes").show() > ++---+ > |name|age| > ++---+ > | 1| 2| > | 2| 4| > | 3| 6| > ++---+ > {code} > The columns don't even match in types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16052) Add CollapseRepartitionBy optimizer
[ https://issues.apache.org/jira/browse/SPARK-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16052: Assignee: (was: Apache Spark) > Add CollapseRepartitionBy optimizer > --- > > Key: SPARK-16052 > URL: https://issues.apache.org/jira/browse/SPARK-16052 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun > > This issue adds a new optimizer, `CollapseRepartitionBy`. > **Before** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- Exchange hashpartitioning(id#0L, 1) >+- *Range (0, 10, splits=8) > {code} > **After** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- *Range (0, 10, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16052) Add CollapseRepartitionBy optimizer
[ https://issues.apache.org/jira/browse/SPARK-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16052: Assignee: Apache Spark > Add CollapseRepartitionBy optimizer > --- > > Key: SPARK-16052 > URL: https://issues.apache.org/jira/browse/SPARK-16052 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue adds a new optimizer, `CollapseRepartitionBy`. > **Before** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- Exchange hashpartitioning(id#0L, 1) >+- *Range (0, 10, splits=8) > {code} > **After** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- *Range (0, 10, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16052) Add CollapseRepartitionBy optimizer
[ https://issues.apache.org/jira/browse/SPARK-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338309#comment-15338309 ] Apache Spark commented on SPARK-16052: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13765 > Add CollapseRepartitionBy optimizer > --- > > Key: SPARK-16052 > URL: https://issues.apache.org/jira/browse/SPARK-16052 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun > > This issue adds a new optimizer, `CollapseRepartitionBy`. > **Before** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- Exchange hashpartitioning(id#0L, 1) >+- *Range (0, 10, splits=8) > {code} > **After** > {code} > scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain > == Physical Plan == > Exchange hashpartitioning(id#0L, 1) > +- *Range (0, 10, splits=8) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16052) Add CollapseRepartitionBy optimizer
Dongjoon Hyun created SPARK-16052: - Summary: Add CollapseRepartitionBy optimizer Key: SPARK-16052 URL: https://issues.apache.org/jira/browse/SPARK-16052 Project: Spark Issue Type: Improvement Components: Optimizer Reporter: Dongjoon Hyun This issue adds a new optimizer, `CollapseRepartitionBy`. **Before** {code} scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain == Physical Plan == Exchange hashpartitioning(id#0L, 1) +- Exchange hashpartitioning(id#0L, 1) +- *Range (0, 10, splits=8) {code} **After** {code} scala> spark.range(10).repartition(1, $"id").repartition(1, $"id").explain == Physical Plan == Exchange hashpartitioning(id#0L, 1) +- *Range (0, 10, splits=8) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16024: Assignee: Apache Spark > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16024: Assignee: (was: Apache Spark) > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338286#comment-15338286 ] Apache Spark commented on SPARK-16024: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13764 > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15973: Assignee: Josh Howes > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Assignee: Josh Howes >Priority: Trivial > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16051) Add `read.orc/write.orc` to SparkR
[ https://issues.apache.org/jira/browse/SPARK-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16051: Assignee: Apache Spark > Add `read.orc/write.orc` to SparkR > -- > > Key: SPARK-16051 > URL: https://issues.apache.org/jira/browse/SPARK-16051 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue adds `read.orc/write.orc` to SparkR for API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16051) Add `read.orc/write.orc` to SparkR
[ https://issues.apache.org/jira/browse/SPARK-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16051: Assignee: (was: Apache Spark) > Add `read.orc/write.orc` to SparkR > -- > > Key: SPARK-16051 > URL: https://issues.apache.org/jira/browse/SPARK-16051 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Dongjoon Hyun > > This issue adds `read.orc/write.orc` to SparkR for API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16051) Add `read.orc/write.orc` to SparkR
[ https://issues.apache.org/jira/browse/SPARK-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338187#comment-15338187 ] Apache Spark commented on SPARK-16051: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13763 > Add `read.orc/write.orc` to SparkR > -- > > Key: SPARK-16051 > URL: https://issues.apache.org/jira/browse/SPARK-16051 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Dongjoon Hyun > > This issue adds `read.orc/write.orc` to SparkR for API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16051) Add `read.orc/write.orc` to SparkR
Dongjoon Hyun created SPARK-16051: - Summary: Add `read.orc/write.orc` to SparkR Key: SPARK-16051 URL: https://issues.apache.org/jira/browse/SPARK-16051 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Dongjoon Hyun This issue adds `read.orc/write.orc` to SparkR for API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338148#comment-15338148 ] Xiao Li commented on SPARK-16024: - {noformat} test("desc table for parquet data source table") { val tabName = "tab1" withTable(tabName) { sql(s"CREATE TABLE $tabName(a int comment 'test') USING parquet ") checkAnswer( sql(s"DESC $tabName").select("comment"), Row("test") ) } } {noformat} I tried both catalogs (in-memory catalog and hive metastore). The above test case can pass in both cases. Could you explain a little bit more about the exact scenario? Thanks! > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6814) Support sorting for any data type in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338147#comment-15338147 ] Dongjoon Hyun commented on SPARK-6814: -- Hi, [~shivaram]. Since SparkR RDD is hiding from users now, can we simply close this issue? > Support sorting for any data type in SparkR > --- > > Key: SPARK-6814 > URL: https://issues.apache.org/jira/browse/SPARK-6814 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Critical > > I get various "return status == 0 is false" and "unimplemented type" errors > trying to get data out of any rdd with top() or collect(). The errors are not > consistent. I think spark is installed properly because some operations do > work. I apologize if I'm missing something easy or not providing the right > diagnostic info – I'm new to SparkR, and this seems to be the only resource > for SparkR issues. > Some logs: > {code} > Browse[1]> top(estep.rdd, 1L) > Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : > unimplemented type 'list' in 'orderVector1' > Calls: do.call ... Reduce -> -> func -> FUN -> FUN -> order > Execution halted > 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) > org.apache.spark.SparkException: R computation failed with > Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : > unimplemented type 'list' in 'orderVector1' > Calls: do.call ... Reduce -> -> func -> FUN -> FUN -> order > Execution halted > at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, > localhost): org.apache.spark.SparkException: R computation failed with > Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : > unimplemented type 'list' in 'orderVector1' > Calls: do.call ... Reduce -> -> func -> FUN -> FUN -> order > Execution halted > edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338128#comment-15338128 ] Xiao Li commented on SPARK-16024: - Thanks! In Spark 2.0, the simplest solution is to put {{comment}} into {{metadata}} of {{StructField}}. However, in the long term, I think we need to consolidate {{StrcutField}} and {{CatalogColumn}} > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338104#comment-15338104 ] Wenchen Fan commented on SPARK-16024: - yea go ahead, thanks! > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338096#comment-15338096 ] Xiao Li commented on SPARK-16024: - : ) Found a more serious bug in Json when reading the related code. > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14926) OneVsRest labelMetadata uses incorrect name
[ https://issues.apache.org/jira/browse/SPARK-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14926: Assignee: Apache Spark > OneVsRest labelMetadata uses incorrect name > --- > > Key: SPARK-14926 > URL: https://issues.apache.org/jira/browse/SPARK-14926 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Trivial > > OneVsRestModel applies {{labelMetadata}} to the output column, but the > metadata could contain the wrong name. The attribute name should be modified > to match {{predictionCol}}. > Here is the relevant location: > [[https://github.com/apache/spark/blob/2a3d39f48b1a7bb462e17e80e243bbc0a94d802e/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L200]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14926) OneVsRest labelMetadata uses incorrect name
[ https://issues.apache.org/jira/browse/SPARK-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14926: Assignee: (was: Apache Spark) > OneVsRest labelMetadata uses incorrect name > --- > > Key: SPARK-14926 > URL: https://issues.apache.org/jira/browse/SPARK-14926 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > OneVsRestModel applies {{labelMetadata}} to the output column, but the > metadata could contain the wrong name. The attribute name should be modified > to match {{predictionCol}}. > Here is the relevant location: > [[https://github.com/apache/spark/blob/2a3d39f48b1a7bb462e17e80e243bbc0a94d802e/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L200]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14926) OneVsRest labelMetadata uses incorrect name
[ https://issues.apache.org/jira/browse/SPARK-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338092#comment-15338092 ] Apache Spark commented on SPARK-14926: -- User 'josh-howes' has created a pull request for this issue: https://github.com/apache/spark/pull/13762 > OneVsRest labelMetadata uses incorrect name > --- > > Key: SPARK-14926 > URL: https://issues.apache.org/jira/browse/SPARK-14926 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.1, 1.5.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > OneVsRestModel applies {{labelMetadata}} to the output column, but the > metadata could contain the wrong name. The attribute name should be modified > to match {{predictionCol}}. > Here is the relevant location: > [[https://github.com/apache/spark/blob/2a3d39f48b1a7bb462e17e80e243bbc0a94d802e/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L200]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16050) Flaky Test: Complete aggregation with Console sink
Burak Yavuz created SPARK-16050: --- Summary: Flaky Test: Complete aggregation with Console sink Key: SPARK-16050 URL: https://issues.apache.org/jira/browse/SPARK-16050 Project: Spark Issue Type: Test Components: SQL, Streaming Reporter: Burak Yavuz Priority: Critical Please refer to the multiple failures in the last day: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1018/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1017/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1018/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16006) Attemping to write empty DataFrame with no fields throw non-intuitive exception
[ https://issues.apache.org/jira/browse/SPARK-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338083#comment-15338083 ] Dongjoon Hyun commented on SPARK-16006: --- Hi, [~tdas]. The PR is updated, could you review again? > Attemping to write empty DataFrame with no fields throw non-intuitive > exception > --- > > Key: SPARK-16006 > URL: https://issues.apache.org/jira/browse/SPARK-16006 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Priority: Minor > > Attempting to write an emptyDataFrame created with > {{sparkSession.emptyDataFrame.write.text("p")}} fails with the following > exception > {code} > org.apache.spark.sql.AnalysisException: Cannot use all columns for partition > columns; > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:355) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:435) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:213) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:196) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:525) > ... 48 elided > {code} > This is because # fields == # partitioning columns = 0 at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:355). > This is a non-intuitive error message. Better error message "Cannot write > dataset with no fields". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16034. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13749 [https://github.com/apache/spark/pull/13749] > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16037) use by-position resolution when insert into hive table
[ https://issues.apache.org/jira/browse/SPARK-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16037. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13754 [https://github.com/apache/spark/pull/13754] > use by-position resolution when insert into hive table > -- > > Key: SPARK-16037 > URL: https://issues.apache.org/jira/browse/SPARK-16037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b; > The result is 1, 3, 2 for hive table, which is wrong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16036) better error message if the number of columns in SELECT clause doesn't match the table schema
[ https://issues.apache.org/jira/browse/SPARK-16036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16036. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13754 [https://github.com/apache/spark/pull/13754] > better error message if the number of columns in SELECT clause doesn't match > the table schema > - > > Key: SPARK-16036 > URL: https://issues.apache.org/jira/browse/SPARK-16036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16034) Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-16034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16034: - Assignee: Sean Zhong > Checks the partition columns when calling > dataFrame.write.mode("append").saveAsTable > > > Key: SPARK-16034 > URL: https://issues.apache.org/jira/browse/SPARK-16034 > Project: Spark > Issue Type: Sub-task >Reporter: Sean Zhong >Assignee: Sean Zhong > > Suppose we have defined a partitioned table: > {code} > CREATE TABLE src (a INT, b INT, c INT) > USING PARQUET > PARTITIONED BY (a, b); > {code} > We should check the partition columns when appending DataFrame data to > existing table: > {code} > val df = Seq((1, 2, 3)).toDF("a", "b", "c") > df.write.partitionBy("b", "a").mode("append").saveAsTable("src") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338066#comment-15338066 ] Yin Huai commented on SPARK-16032: -- We will attach the report to here. > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Blocker > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16049) Make InsertIntoTable's expectedColumns support case-insensitive resolution properly
Yin Huai created SPARK-16049: Summary: Make InsertIntoTable's expectedColumns support case-insensitive resolution properly Key: SPARK-16049 URL: https://issues.apache.org/jira/browse/SPARK-16049 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Right now, InsertIntoTable's expectedColumns uses the method of {{contains}} to find static partitioning columns. When analyzer is case-insensitive, the initialization of this lazy val will not work as expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16024) column comment is ignored for datasource table
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338024#comment-15338024 ] Xiao Li commented on SPARK-16024: - Let me work on it? > column comment is ignored for datasource table > -- > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > CREATE TABLE src(a INT COMMENT 'bla') USING parquet. > When we describe table, the column comment is not there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16047) Sort by status and id fields in Executors table
[ https://issues.apache.org/jira/browse/SPARK-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-16047: Description: With multiple executors with the same ID the default sorting *seems* to be by ID (descending) first and status (alphabetically ascending). I'd like webUI to sort the Executors table by status first (with Active first) followed by ID (ascending with driver being the last one). was: With multiple executors with the same ID the default sorting *seems* to be by ID (descending) first and status (alphabetically ascending). I'd like to sort the table by status first (with Active first) followed by ID (ascending with driver being the last one). > Sort by status and id fields in Executors table > --- > > Key: SPARK-16047 > URL: https://issues.apache.org/jira/browse/SPARK-16047 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors.png > > > With multiple executors with the same ID the default sorting *seems* to be by > ID (descending) first and status (alphabetically ascending). > I'd like webUI to sort the Executors table by status first (with Active > first) followed by ID (ascending with driver being the last one). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16048) spark-shell unresponsive after "FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager" with YARN and spark.shuffle.service.enabled
[ https://issues.apache.org/jira/browse/SPARK-16048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-16048: Description: With Spark on YARN with external shuffle service {{java.lang.UnsupportedOperationException: Unsupported shuffle manager: org.apache.spark.shuffle.sort.SortShuffleManager}} exception makes spark-shell unresponsive. {code} $ YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c spark.shuffle.service.enabled=true --deploy-mode client -c spark.scheduler.mode=FAIR --num-executors 2 ... Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = yarn, app id = application_1466255040841_0002). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> sc.parallelize(0 to 4, 1).map(n => (n % 2, n)).groupByKey.map(n => { Thread.sleep(5 * 1000); n }).count org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1 (count at :25) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager: org.apache.spark.shuffle.sort.SortShuffleManager at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:191) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
[jira] [Created] (SPARK-16048) spark-shell unresponsive after "FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager" with YARN and spark.shuffle.service.enabled
Jacek Laskowski created SPARK-16048: --- Summary: spark-shell unresponsive after "FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager" with YARN and spark.shuffle.service.enabled Key: SPARK-16048 URL: https://issues.apache.org/jira/browse/SPARK-16048 Project: Spark Issue Type: Bug Components: Shuffle, Spark Shell, YARN Affects Versions: 2.0.0 Reporter: Jacek Laskowski With Spark on YARN with external shuffle service {{java.lang.UnsupportedOperationException: Unsupported shuffle manager: org.apache.spark.shuffle.sort.SortShuffleManager}} exception makes spark-shell unresponsive. {quote} $ YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c spark.shuffle.service.enabled=true --deploy-mode client -c spark.scheduler.mode=FAIR --num-executors 2 ... Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = yarn, app id = application_1466255040841_0002). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> sc.parallelize(0 to 4, 1).map(n => (n % 2, n)).groupByKey.map(n => { Thread.sleep(5 * 1000); n }).count org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1 (count at :25) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException: Unsupported shuffle manager: org.apache.spark.shuffle.sort.SortShuffleManager at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:191) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at
[jira] [Updated] (SPARK-16047) Sort by status and id fields in Executors table
[ https://issues.apache.org/jira/browse/SPARK-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-16047: Attachment: spark-webui-executors.png Current default sorting > Sort by status and id fields in Executors table > --- > > Key: SPARK-16047 > URL: https://issues.apache.org/jira/browse/SPARK-16047 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors.png > > > With multiple executors with the same ID the default sorting *seems* to be by > ID (descending) first and status (alphabetically ascending). > I'd like to sort the table by status first (with Active first) followed by ID > (ascending with driver being the last one). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16047) Sort by status and id fields in Executors table
Jacek Laskowski created SPARK-16047: --- Summary: Sort by status and id fields in Executors table Key: SPARK-16047 URL: https://issues.apache.org/jira/browse/SPARK-16047 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.0.0 Reporter: Jacek Laskowski Priority: Minor With multiple executors with the same ID the default sorting *seems* to be by ID (descending) first and status (alphabetically ascending). I'd like to sort the table by status first (with Active first) followed by ID (ascending with driver being the last one). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pedro Rodriguez updated SPARK-16046: Description: Issue to update the Spark SQL guide to provide more content around using Datasets. This would expand the Creating Datasets section of the Spark SQL documentation. Goals 1. Add more examples of column access via $ and ` 2. Add examples of aggregates 3. Add examples of using Spark SQL functions What else would be useful to have? was: Issue to update the Spark SQL guide to provide more content around using Datasets. This would expand the Creating Datasets section of the Spark SQL documentation. Goals 1. Add more examples of column access via $ and ` 2. Add examples of aggregates 3. Add examples of using Spark SQL functions What else would be useful to have > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337869#comment-15337869 ] Pedro Rodriguez commented on SPARK-16046: - I would like to take on this issue and will base work off of https://issues.apache.org/jira/browse/SPARK-15863 > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16046) Add Spark SQL Dataset Tutorial
[ https://issues.apache.org/jira/browse/SPARK-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pedro Rodriguez updated SPARK-16046: Component/s: SQL Documentation > Add Spark SQL Dataset Tutorial > -- > > Key: SPARK-16046 > URL: https://issues.apache.org/jira/browse/SPARK-16046 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Pedro Rodriguez > > Issue to update the Spark SQL guide to provide more content around using > Datasets. This would expand the Creating Datasets section of the Spark SQL > documentation. > Goals > 1. Add more examples of column access via $ and ` > 2. Add examples of aggregates > 3. Add examples of using Spark SQL functions > What else would be useful to have -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16046) Add Spark SQL Dataset Tutorial
Pedro Rodriguez created SPARK-16046: --- Summary: Add Spark SQL Dataset Tutorial Key: SPARK-16046 URL: https://issues.apache.org/jira/browse/SPARK-16046 Project: Spark Issue Type: Documentation Affects Versions: 2.0.0 Reporter: Pedro Rodriguez Issue to update the Spark SQL guide to provide more content around using Datasets. This would expand the Creating Datasets section of the Spark SQL documentation. Goals 1. Add more examples of column access via $ and ` 2. Add examples of aggregates 3. Add examples of using Spark SQL functions What else would be useful to have -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12197) Kryo's Avro Serializer add support for dynamic schemas using SchemaRepository
[ https://issues.apache.org/jira/browse/SPARK-12197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337838#comment-15337838 ] Apache Spark commented on SPARK-12197: -- User 'RotemShaul' has created a pull request for this issue: https://github.com/apache/spark/pull/13761 > Kryo's Avro Serializer add support for dynamic schemas using SchemaRepository > - > > Key: SPARK-12197 > URL: https://issues.apache.org/jira/browse/SPARK-12197 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Rotem Shaul > Labels: avro, kryo, schema, serialization > Original Estimate: 72h > Remaining Estimate: 72h > > The original problem: Serializing GenericRecords in Spark Core results in a > very high overhead, as the schema is serialized per record. (When in the > actual input data of HDFS it's stored once per file. ) > The extended problem: Spark 1.5 introduced the ability to register Avro > schemas ahead of time using SparkConf. This solution is partial as some > applications may not know exactly which schemas they're going to read ahead > of time. > Extended solution: > Adding a schema repository to the Serializer. Assuming the generic record has > schemaId on them, it's possible to extract them dynamically from the read > records and serialize only the schemaId. > Upon deserialization the schemaRepo will be queried once again. > The local caching mechanism will remain in tact - so in fact each Task will > query the schema repo only once per schemaId. > The previous static registering of schemas will remain in place, as it is > more efficient when the schemas are known ahead of time. > New flow of serializing generic record: > 1) check the pre-registered schema list, if found the schema, serialize only > its finger print > 2) if not found, and schema repo has been set, attempt to extract the > schemaId from record and check if repo contains the id. If so - serialize > only the schema id > 3) if no schema repo set or didn't find the schemaId in repo - compress and > send the entire schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12947) Spark with Swift throws EOFException when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-12947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337808#comment-15337808 ] Ovidiu Marcu commented on SPARK-12947: -- Hi, do you filled an issue with Ceph for the errors you point here? > Spark with Swift throws EOFException when reading parquet file > -- > > Key: SPARK-12947 > URL: https://issues.apache.org/jira/browse/SPARK-12947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Spark 1.6.0-SNAPSHOT >Reporter: Sam Stoelinga > > I'm using Swift as underlying storage for my spark jobs but it sometimes > throws EOFExceptions for some parts of the data. > Another user has hit the same issue: > http://stackoverflow.com/questions/32400137/spark-swift-integration-parquet > Code to reproduce: > ``` > val features = sqlContext.read.parquet(featurePath) > // Flatten the features into the array exploded > val exploded = > features.select(explode(features("features"))).toDF("features") > val kmeans = new KMeans() > .setK(k) > .setFeaturesCol("features") > .setPredictionCol("prediction") > val model = kmeans.fit(exploded) > ``` > val features is a dataframe with 2 columns: > image: String, features: Array[Vector] > val exploded is a dataframe with a single column: > features: Vector > The following exception is shown when running takeSample on a large dataset > saved as parquet file (~1+GB): > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:756) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$30$$anon$1.hasNext(RDD.scala:827) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1563) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1119) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1119) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1840) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1840) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14533) RowMatrix.computeCovariance inaccurate when values are very large
[ https://issues.apache.org/jira/browse/SPARK-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14533: -- Target Version/s: (was: 2.0.0) > RowMatrix.computeCovariance inaccurate when values are very large > - > > Key: SPARK-14533 > URL: https://issues.apache.org/jira/browse/SPARK-14533 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > The following code will produce a Pearson correlation that's quite different > from 0, sometimes outside [-1,1] or even NaN: > {code} > val a = RandomRDDs.normalRDD(sc, 10, 10).map(_ + 10.0) > val b = RandomRDDs.normalRDD(sc, 10, 10).map(_ + 10.0) > val p = Statistics.corr(a, b, method = "pearson") > {code} > This is a "known issue" to some degree, given how Cov(X,Y) is calculated in > {{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and > more accurate approach involves just centering the input before computing the > Gramian, but this would be inefficient for sparse data. > However, for dense data -- which includes the code paths that compute > correlations -- this approach is quite sensible. This would improve accuracy > for the dense row case, at least. > Also, the mean column values computed in this method can be computed more > simply and accurately from {{computeColumnSummaryStatistics()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15909) PySpark classpath uri incorrectly set
[ https://issues.apache.org/jira/browse/SPARK-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337774#comment-15337774 ] Liam Fisk commented on SPARK-15909: --- Cluster mode isn't used here, I have a mesos cluster (and therefore am in client mode, as you said). In client mode, the remote mesos executors need to be able to retrieve any dependencies, and they can't do that if they are attempting to contact localhost. The bug here is that there is completely different behaviour on startup vs within the REPL. If I stop the spark context, clone the config, and construct a new spark context it will no longer work. > PySpark classpath uri incorrectly set > - > > Key: SPARK-15909 > URL: https://issues.apache.org/jira/browse/SPARK-15909 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Liam Fisk > > PySpark behaves differently if the SparkContext is created within the REPL > (vs initialised by the shell). > My conf/spark-env.sh file contains: > {code} > #!/bin/bash > export SPARK_LOCAL_IP=172.20.30.158 > export LIBPROCESS_IP=172.20.30.158 > export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so > {code} > And when running pyspark it will correctly initialize my SparkContext. > However, when I run: > {code} > from pyspark import SparkContext, SparkConf > sc.stop() > conf = ( > SparkConf() > .setMaster("mesos://zk://foo:2181/mesos") > .setAppName("Jupyter PySpark") > ) > sc = SparkContext(conf=conf) > {code} > my _spark.driver.uri_ and URL classpath will point to localhost (preventing > my mesos cluster from accessing the appropriate files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14533) RowMatrix.computeCovariance inaccurate when values are very large
[ https://issues.apache.org/jira/browse/SPARK-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14533: -- Target Version/s: 2.0.0 (was: 1.6.2, 2.0.0) > RowMatrix.computeCovariance inaccurate when values are very large > - > > Key: SPARK-14533 > URL: https://issues.apache.org/jira/browse/SPARK-14533 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > The following code will produce a Pearson correlation that's quite different > from 0, sometimes outside [-1,1] or even NaN: > {code} > val a = RandomRDDs.normalRDD(sc, 10, 10).map(_ + 10.0) > val b = RandomRDDs.normalRDD(sc, 10, 10).map(_ + 10.0) > val p = Statistics.corr(a, b, method = "pearson") > {code} > This is a "known issue" to some degree, given how Cov(X,Y) is calculated in > {{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and > more accurate approach involves just centering the input before computing the > Gramian, but this would be inefficient for sparse data. > However, for dense data -- which includes the code paths that compute > correlations -- this approach is quite sensible. This would improve accuracy > for the dense row case, at least. > Also, the mean column values computed in this method can be computed more > simply and accurately from {{computeColumnSummaryStatistics()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15893. --- Resolution: Duplicate Target Version/s: (was: 2.0.0) Same issue; there's a bit broader discussion in the other JIRA. > spark.createDataFrame raises an exception in Spark 2.0 tests on Windows > --- > > Key: SPARK-15893 > URL: https://issues.apache.org/jira/browse/SPARK-15893 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.0 >Reporter: Alexander Ulanov > > spark.createDataFrame raises an exception in Spark 2.0 tests on Windows > For example, LogisticRegressionSuite fails at Line 46: > Exception encountered when invoking run on a nested suite - > java.net.URISyntaxException: Relative path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109) > Another example, DataFrameSuite raises: > java.net.URISyntaxException: Relative path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > file:C:/dev/spark/external/flume-assembly/spark-warehouse > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6817. -- Resolution: Done > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15521) Add high level APIs based on dapply and gapply for easier usage
[ https://issues.apache.org/jira/browse/SPARK-15521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15521: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-6817) > Add high level APIs based on dapply and gapply for easier usage > --- > > Key: SPARK-15521 > URL: https://issues.apache.org/jira/browse/SPARK-15521 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Sun Rui > > dapply() and gapply() of SparkDataFrame are two basic functions. For easier > usage to users in the R community, some high level functions can be added > based on them. > Candidates are: > http://exposurescience.org/heR.doc/library/heR.Misc/html/dapply.html > http://exposurescience.org/heR.doc/library/stats/html/aggregate.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16012) add gapplyCollect() for SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16012: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-6817) > add gapplyCollect() for SparkDataFrame > -- > > Key: SPARK-16012 > URL: https://issues.apache.org/jira/browse/SPARK-16012 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > Add a new API method called gapplyCollect() for SparkDataFrame. It does > gapply on a SparkDataFrame and collect the result back to R. Compared to > gapply() + collect(), gapplyCollect() offers performance optimization as well > as programming convenience, as no schema is needed to be provided. > This is similar to dapplyCollect(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337728#comment-15337728 ] Sean Owen commented on SPARK-6817: -- No, the best thing is just bulk-changing the issues to stand-alone issues. I can do that. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12923) Optimize successive dapply() calls in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12923: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-6817) > Optimize successive dapply() calls in SparkR > > > Key: SPARK-12923 > URL: https://issues.apache.org/jira/browse/SPARK-12923 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > For consecutive dapply() calls on a same DataFrame, optimize them to launch R > worker once instead of multiple times for performance improvement -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15984) WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN
[ https://issues.apache.org/jira/browse/SPARK-15984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337693#comment-15337693 ] Jacek Laskowski commented on SPARK-15984: - The problem is that I am *not* changing Spark at all and so by default it gives the warning. If it's a warning and Spark does it, it'd be better (?) to play nicer with YARN. I could fix it easily if I was told it changes nothing else in Spark. I don't know so that's why I reported it (since it's a warning anyway). > WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max > attempts: 0 for application: 8 is invalid" when starting application on YARN > -- > > Key: SPARK-15984 > URL: https://issues.apache.org/jira/browse/SPARK-15984 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > When executing {{spark-shell}} on Spark on YARN 2.7.2 on Mac OS as follows: > {code} > YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c > spark.shuffle.service.enabled=true --deploy-mode client -c > spark.scheduler.mode=FAIR > {code} > it ends up with the following WARN in the logs: > {code} > 2016-06-16 08:33:05,308 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new > applicationId: 8 > 2016-06-16 08:33:07,305 WARN > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific > max attempts: 0 for application: 8 is invalid, because it is out of the range > [1, 2]. Use the global max attempts instead. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16012) add gapplyCollect() for SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16012: Assignee: (was: Apache Spark) > add gapplyCollect() for SparkDataFrame > -- > > Key: SPARK-16012 > URL: https://issues.apache.org/jira/browse/SPARK-16012 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > Add a new API method called gapplyCollect() for SparkDataFrame. It does > gapply on a SparkDataFrame and collect the result back to R. Compared to > gapply() + collect(), gapplyCollect() offers performance optimization as well > as programming convenience, as no schema is needed to be provided. > This is similar to dapplyCollect(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16012) add gapplyCollect() for SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16012: Assignee: Apache Spark > add gapplyCollect() for SparkDataFrame > -- > > Key: SPARK-16012 > URL: https://issues.apache.org/jira/browse/SPARK-16012 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Assignee: Apache Spark > > Add a new API method called gapplyCollect() for SparkDataFrame. It does > gapply on a SparkDataFrame and collect the result back to R. Compared to > gapply() + collect(), gapplyCollect() offers performance optimization as well > as programming convenience, as no schema is needed to be provided. > This is similar to dapplyCollect(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16012) add gapplyCollect() for SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337669#comment-15337669 ] Apache Spark commented on SPARK-16012: -- User 'NarineK' has created a pull request for this issue: https://github.com/apache/spark/pull/13760 > add gapplyCollect() for SparkDataFrame > -- > > Key: SPARK-16012 > URL: https://issues.apache.org/jira/browse/SPARK-16012 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > Add a new API method called gapplyCollect() for SparkDataFrame. It does > gapply on a SparkDataFrame and collect the result back to R. Compared to > gapply() + collect(), gapplyCollect() offers performance optimization as well > as programming convenience, as no schema is needed to be provided. > This is similar to dapplyCollect(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16045: Assignee: Apache Spark > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16045: Assignee: (was: Apache Spark) > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337625#comment-15337625 ] Apache Spark commented on SPARK-16045: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/13375 > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
yuhao yang created SPARK-16045: -- Summary: Spark 2.0 ML.feature: doc update for stopwords and binarizer Key: SPARK-16045 URL: https://issues.apache.org/jira/browse/SPARK-16045 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor 2.0 Audit: Update document for StopWordsRemover (load stop words) and Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.
[ https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16044: Assignee: Apache Spark > input_file_name() returns empty strings in data sources based on NewHadoopRDD. > -- > > Key: SPARK-16044 > URL: https://issues.apache.org/jira/browse/SPARK-16044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > The issue is, {{input_file_name()}} function does not contain file paths when > data sources use {{NewHadoopRDD}}. This is currently only supported for > {{FileScanRDD}} and {{HadoopRDD}}. > To be clear, this does not affect Spark's internal data sources because > currently they all do not use {{NewHadoopRDD}}. > However, there are several datasources using this. For example, > > spark-redshift - > [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149] > spark-xml - > [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47] > Currently, using this functions shows the output below: > {code} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.
[ https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16044: Assignee: (was: Apache Spark) > input_file_name() returns empty strings in data sources based on NewHadoopRDD. > -- > > Key: SPARK-16044 > URL: https://issues.apache.org/jira/browse/SPARK-16044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > The issue is, {{input_file_name()}} function does not contain file paths when > data sources use {{NewHadoopRDD}}. This is currently only supported for > {{FileScanRDD}} and {{HadoopRDD}}. > To be clear, this does not affect Spark's internal data sources because > currently they all do not use {{NewHadoopRDD}}. > However, there are several datasources using this. For example, > > spark-redshift - > [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149] > spark-xml - > [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47] > Currently, using this functions shows the output below: > {code} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.
[ https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337624#comment-15337624 ] Apache Spark commented on SPARK-16044: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/13759 > input_file_name() returns empty strings in data sources based on NewHadoopRDD. > -- > > Key: SPARK-16044 > URL: https://issues.apache.org/jira/browse/SPARK-16044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > The issue is, {{input_file_name()}} function does not contain file paths when > data sources use {{NewHadoopRDD}}. This is currently only supported for > {{FileScanRDD}} and {{HadoopRDD}}. > To be clear, this does not affect Spark's internal data sources because > currently they all do not use {{NewHadoopRDD}}. > However, there are several datasources using this. For example, > > spark-redshift - > [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149] > spark-xml - > [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47] > Currently, using this functions shows the output below: > {code} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > | | > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16043: Assignee: (was: Apache Spark) > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.
Hyukjin Kwon created SPARK-16044: Summary: input_file_name() returns empty strings in data sources based on NewHadoopRDD. Key: SPARK-16044 URL: https://issues.apache.org/jira/browse/SPARK-16044 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon The issue is, {{input_file_name()}} function does not contain file paths when data sources use {{NewHadoopRDD}}. This is currently only supported for {{FileScanRDD}} and {{HadoopRDD}}. To be clear, this does not affect Spark's internal data sources because currently they all do not use {{NewHadoopRDD}}. However, there are several datasources using this. For example, spark-redshift - [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149] spark-xml - [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47] Currently, using this functions shows the output below: {code} +-+ |input_file_name()| +-+ | | | | | | | | | | | | | | | | | | | | | | +-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337617#comment-15337617 ] Apache Spark commented on SPARK-16043: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/13758 > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16043: Assignee: Apache Spark > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16022) Input size is different when I use 1 or 3 nodes but the shufle size remains +- icual, do you know why?
[ https://issues.apache.org/jira/browse/SPARK-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337615#comment-15337615 ] Sean Owen commented on SPARK-16022: --- The u...@spark.apache.org mailing list http://spark.apache.org/community.html > Input size is different when I use 1 or 3 nodes but the shufle size remains > +- icual, do you know why? > -- > > Key: SPARK-16022 > URL: https://issues.apache.org/jira/browse/SPARK-16022 > Project: Spark > Issue Type: Test >Reporter: jon > > I run some queries on spark with just one node and then with 3 nodes. And in > the spark:4040 UI I see something that I am not understanding. > For example after executing a query with 3 nodes and check the results in the > spark UI, in the "input" tab appears 2,8gb, so spark read 2,8gb from hadoop. > The same query on hadoop with just one node in local mode appears 7,3gb, the > spark read 7,3GB from hadoop. But this value shouldnt be equal? > For example the value of shuffle remains +- equal in one node vs 3. Why the > input value doesn't stay equal? The same amount of data must be read from the > hdfs, so I am not understanding. > Do you know? > Single node: > Input: 7,3 GB > Shuffle read: 208.1kb > Shuffle write: 208.1kb > 3 nodes: > Input: 2,8 GB > Shuffle read: 193,3 kb > Shuffle write; 208.1 kb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16040) spark.mllib PIC document extra line of refernece
[ https://issues.apache.org/jira/browse/SPARK-16040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16040: -- Priority: Trivial (was: Minor) OK, this does not need a JIRA > spark.mllib PIC document extra line of refernece > > > Key: SPARK-16040 > URL: https://issues.apache.org/jira/browse/SPARK-16040 > Project: Spark > Issue Type: Documentation >Reporter: Miao Wang >Priority: Trivial > > In the 2.0 document, Line "A full example that produces the experiment > described in the PIC paper can be found under examples/." is redundant. > There is already "Find full example code at > "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" > in the Spark repo.". > We should remove the first line, which is consistent with other documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
Kazuaki Ishizaki created SPARK-16043: Summary: Prepare GenericArrayData implementation specialized for a primitive array Key: SPARK-16043 URL: https://issues.apache.org/jira/browse/SPARK-16043 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kazuaki Ishizaki There is a ToDo of GenericArrayData class, which is to eliminate boxing/unboxing for a primitive array (described [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) It would be good to prepare GenericArrayData implementation specialized for a primitive array to eliminate boxing/unboxing from the view of runtime memory footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15973. - Resolution: Fixed Fix Version/s: 2.0.0 > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16025) Document OFF_HEAP storage level in 2.0
[ https://issues.apache.org/jira/browse/SPARK-16025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16025: -- Priority: Minor (was: Major) > Document OFF_HEAP storage level in 2.0 > -- > > Key: SPARK-16025 > URL: https://issues.apache.org/jira/browse/SPARK-16025 > Project: Spark > Issue Type: Documentation >Reporter: Eric Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16023) Move InMemoryRelation to its own file
[ https://issues.apache.org/jira/browse/SPARK-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16023: -- Issue Type: Improvement (was: Bug) > Move InMemoryRelation to its own file > - > > Key: SPARK-16023 > URL: https://issues.apache.org/jira/browse/SPARK-16023 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > Just to make InMemoryTableScanExec a little smaller and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16023) Move InMemoryRelation to its own file
[ https://issues.apache.org/jira/browse/SPARK-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16023. - Resolution: Fixed Fix Version/s: 2.0.0 > Move InMemoryRelation to its own file > - > > Key: SPARK-16023 > URL: https://issues.apache.org/jira/browse/SPARK-16023 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > Just to make InMemoryTableScanExec a little smaller and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16042: Assignee: Apache Spark > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16042: Assignee: (was: Apache Spark) > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337594#comment-15337594 ] Apache Spark commented on SPARK-16042: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/13757 > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16042) Eliminate nullcheck code at projection for an array type
Kazuaki Ishizaki created SPARK-16042: Summary: Eliminate nullcheck code at projection for an array type Key: SPARK-16042 URL: https://issues.apache.org/jira/browse/SPARK-16042 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kazuaki Ishizaki When we run a spark program with a projection for a array type, nullcheck at a call to write each element of an array is generated. If we know all of the elements do not have {{null}} at compilation time, we can eliminate code for nullcheck. {code} val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") df.selectExpr("Array(v + 2.2, v + 3.3)").collect {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org