[jira] [Assigned] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition
[ https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17660: Assignee: (was: Apache Spark) > DESC FORMATTED for VIEW Lacks View Definition > - > > Key: SPARK-17660 > URL: https://issues.apache.org/jira/browse/SPARK-17660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > Currently, DESC FORMATTED does not have a section for the view definition. We > should add it for permanent views, like what Hive does. Below is an example > with the desired view definition. > {noformat} > ++-+---+ > |col_name|data_type > >|comment| > ++-+---+ > |a |int > >|null | > || > >| | > |# Detailed Table Information| > >| | > |Database: |default > >| | > |Owner: |xiaoli > >| | > |Create Time:|Sat Sep 24 21:46:19 PDT 2016 > >| | > |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 > >| | > |Location: | > >| | > |Table Type: |VIEW > >| | > |Table Parameters: | > >| | > | transient_lastDdlTime |1474778779 > >| | > || > >| | > |# Storage Information | > >| | > |SerDe Library: > |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > | | > |InputFormat: > |org.apache.hadoop.mapred.SequenceFileInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > | | > |Compressed: |No > >| | > |Storage Desc Parameters:| > >| | > | serialization.format |1 > >| | > || > >| | > |# View Information |
[jira] [Assigned] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition
[ https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17660: Assignee: Apache Spark > DESC FORMATTED for VIEW Lacks View Definition > - > > Key: SPARK-17660 > URL: https://issues.apache.org/jira/browse/SPARK-17660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Currently, DESC FORMATTED does not have a section for the view definition. We > should add it for permanent views, like what Hive does. Below is an example > with the desired view definition. > {noformat} > ++-+---+ > |col_name|data_type > >|comment| > ++-+---+ > |a |int > >|null | > || > >| | > |# Detailed Table Information| > >| | > |Database: |default > >| | > |Owner: |xiaoli > >| | > |Create Time:|Sat Sep 24 21:46:19 PDT 2016 > >| | > |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 > >| | > |Location: | > >| | > |Table Type: |VIEW > >| | > |Table Parameters: | > >| | > | transient_lastDdlTime |1474778779 > >| | > || > >| | > |# Storage Information | > >| | > |SerDe Library: > |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > | | > |InputFormat: > |org.apache.hadoop.mapred.SequenceFileInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > | | > |Compressed: |No > >| | > |Storage Desc Parameters:| > >| | > | serialization.format |1 > >| | > || > >| | > |# View
[jira] [Commented] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition
[ https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15520189#comment-15520189 ] Apache Spark commented on SPARK-17660: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15234 > DESC FORMATTED for VIEW Lacks View Definition > - > > Key: SPARK-17660 > URL: https://issues.apache.org/jira/browse/SPARK-17660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > Currently, DESC FORMATTED does not have a section for the view definition. We > should add it for permanent views, like what Hive does. Below is an example > with the desired view definition. > {noformat} > ++-+---+ > |col_name|data_type > >|comment| > ++-+---+ > |a |int > >|null | > || > >| | > |# Detailed Table Information| > >| | > |Database: |default > >| | > |Owner: |xiaoli > >| | > |Create Time:|Sat Sep 24 21:46:19 PDT 2016 > >| | > |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 > >| | > |Location: | > >| | > |Table Type: |VIEW > >| | > |Table Parameters: | > >| | > | transient_lastDdlTime |1474778779 > >| | > || > >| | > |# Storage Information | > >| | > |SerDe Library: > |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > | | > |InputFormat: > |org.apache.hadoop.mapred.SequenceFileInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > | | > |Compressed: |No > >| | > |Storage Desc Parameters:| > >| | > | serialization.format |1 > >| | > || >
[jira] [Created] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition
Xiao Li created SPARK-17660: --- Summary: DESC FORMATTED for VIEW Lacks View Definition Key: SPARK-17660 URL: https://issues.apache.org/jira/browse/SPARK-17660 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Xiao Li Currently, DESC FORMATTED does not have a section for the view definition. We should add it for permanent views, like what Hive does. Below is an example with the desired view definition. {noformat} ++-+---+ |col_name|data_type |comment| ++-+---+ |a |int |null | || | | |# Detailed Table Information| | | |Database: |default | | |Owner: |xiaoli | | |Create Time:|Sat Sep 24 21:46:19 PDT 2016 | | |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 | | |Location: | | | |Table Type: |VIEW | | |Table Parameters: | | | | transient_lastDdlTime |1474778779 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat:|org.apache.hadoop.mapred.SequenceFileInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | | |Compressed: |No | | |Storage Desc Parameters:| | | | serialization.format |1 | | || | | |# View Information | | | |View Original Text: |SELECT * FROM tbl | | |View Expanded Text: |SELECT
[jira] [Commented] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15520095#comment-15520095 ] Apache Spark commented on SPARK-17659: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15233 > Partitioned View is Not Supported In SHOW CREATE TABLE > -- > > Key: SPARK-17659 > URL: https://issues.apache.org/jira/browse/SPARK-17659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, > SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE > TABLE should not support it like the other Hive-only features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17659: Assignee: Apache Spark > Partitioned View is Not Supported In SHOW CREATE TABLE > -- > > Key: SPARK-17659 > URL: https://issues.apache.org/jira/browse/SPARK-17659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, > SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE > TABLE should not support it like the other Hive-only features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17659: Assignee: (was: Apache Spark) > Partitioned View is Not Supported In SHOW CREATE TABLE > -- > > Key: SPARK-17659 > URL: https://issues.apache.org/jira/browse/SPARK-17659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, > SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE > TABLE should not support it like the other Hive-only features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE
Xiao Li created SPARK-17659: --- Summary: Partitioned View is Not Supported In SHOW CREATE TABLE Key: SPARK-17659 URL: https://issues.apache.org/jira/browse/SPARK-17659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Xiao Li `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE TABLE should not support it like the other Hive-only features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17631) Structured Streaming - Add Http Stream Sink
[ https://issues.apache.org/jira/browse/SPARK-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519970#comment-15519970 ] zhangxinyu commented on SPARK-17631: h4. A short design for this feature h5. Goal Build an http sink for structured streaming. Streaming query results can be written out to http servers. h5. Usage # The streaming query results should have a single string column. # We should configure ".format("http").option("url", yourHttpUrl)" in our programs to create http sinks. e.g. val query = counts.writeStream .outputMode("complete") .format("http") .option("url", "yourHttpUrl") .start() h5. Design # Add a class "HttpSink" that extends trait "Sink", and override function "addBatch". Override "addBatch": echo Row in dataFrame will be written out through sending an http post request. # Add a class "HttpStreamSink" that extends both trait "StreamSinkProvider" and trait "DataSourceRegister". It overrides two functions: - shortname: return an string "http" - createSink: return an HttpSink instance h5. Other features to debate # should we support https too? # do we need to set any headers (i.e. maybe the batch id?) > Structured Streaming - Add Http Stream Sink > --- > > Key: SPARK-17631 > URL: https://issues.apache.org/jira/browse/SPARK-17631 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu >Priority: Minor > > Streaming query results can be sinked to http server through http post request > github: https://github.com/apache/spark/pull/15194 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519953#comment-15519953 ] Xiao Li commented on SPARK-17653: - Yeah. You are right. It does not work. : ) > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519950#comment-15519950 ] Xiao Li commented on SPARK-17653: - I see. After rethinking it, Union is special. My PR is not applicable to it. We are unable to eliminate the Distinct in this pattern. I think what you said is correct. We can do it for UNION. Do you want me to try it? Or somebody else already started it? Thanks! BTW, in traditional RDBMS, many optimizer rules are based on the unique constraints. However, Spark SQL does not have the concept of primary key or unique constraints. If we allow users specify unique constraints using Hints, we could further optimize the plan and the execution. Do you think adding such a HINT is OK to Spark SQL? > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519923#comment-15519923 ] Reynold Xin commented on SPARK-17653: - [~smilegator] - I just took a quick look at #11930. It looks to me it mainly propagates uniqueness property up. In this case we want to remove distincts down a subtree. How would it work in your case? > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519710#comment-15519710 ] Reynold Xin commented on SPARK-17653: - There are different ways to fix this, from fairly general ones to more surgical ones. The most surgical fix I can think of is to just match a bunch of Distinct(Union(Distinct(Union(...))) and combine them into a single Distinct(Union(...)). If the more general fix is simple enough, that could be a good idea too. cc [~vssrinath] > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17609) SessionCatalog.tableExists should not check temp view
[ https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17609: Fix Version/s: (was: 2.0.2) 2.0.1 > SessionCatalog.tableExists should not check temp view > - > > Key: SPARK-17609 > URL: https://issues.apache.org/jira/browse/SPARK-17609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry
[ https://issues.apache.org/jira/browse/SPARK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17640: Fix Version/s: (was: 2.0.2) 2.0.1 > Avoid using -1 as the default batchId for FileStreamSource.FileEntry > > > Key: SPARK-17640 > URL: https://issues.apache.org/jira/browse/SPARK-17640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views
[ https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17502: Fix Version/s: (was: 2.0.2) 2.0.1 > Multiple Bugs in DDL Statements on Temporary Views > --- > > Key: SPARK-17502 > URL: https://issues.apache.org/jira/browse/SPARK-17502 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.1, 2.1.0 > > > - When the permanent tables/views do not exist but the temporary view exists, > the expected error should be `NoSuchTableException` for partition-related > ALTER TABLE commands. However, it always reports a confusing error message. > For example, > {noformat} > Partition spec is invalid. The spec (a, b) must match the partition spec () > defined in table '`testview`'; > {noformat} > - When the permanent tables/views do not exist but the temporary view exists, > the expected error should be `NoSuchTableException` for `ALTER TABLE ... > UNSET TBLPROPERTIES`. However, it reports missing table property. However, > the expected error should be `NoSuchTableException`. For example, > {noformat} > Attempted to unset non-existent property 'p' in table '`testView`'; > {noformat} > - When `ANALYZE TABLE` is called on a view or a temporary view, we should > issue an error message. However, it reports a strange error: > {noformat} > ANALYZE TABLE is not supported for Project > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17210) sparkr.zip is not distributed to executors when run sparkr in RStudio
[ https://issues.apache.org/jira/browse/SPARK-17210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17210: Fix Version/s: (was: 2.0.2) 2.0.1 > sparkr.zip is not distributed to executors when run sparkr in RStudio > - > > Key: SPARK-17210 > URL: https://issues.apache.org/jira/browse/SPARK-17210 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Fix For: 2.0.1, 2.1.0 > > > Here's the code to reproduce this issue. > {code} > Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") > .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) > library(SparkR) > sparkR.session(master="yarn-client", sparkConfig = > list(spark.executor.instances="1")) > df <- as.DataFrame(mtcars) > head(df) > {code} > And this is the exception in executor log. > {noformat} > 16/08/24 15:33:45 INFO BufferedStreamThread: Fatal error: cannot open file > '/Users/jzhang/Temp/hadoop_tmp/nm-local-dir/usercache/jzhang/appcache/application_1471846125517_0022/container_1471846125517_0022_01_02/sparkr/SparkR/worker/daemon.R': > No such file or directory > 16/08/24 15:33:55 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 6) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:367) > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16240: Fix Version/s: (was: 2.0.2) 2.0.1 > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Assignee: Gayathri Murali > Fix For: 2.0.1, 2.1.0 > > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17651) Automate Spark version update for documentations
[ https://issues.apache.org/jira/browse/SPARK-17651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17651: Fix Version/s: (was: 2.0.2) 2.0.1 > Automate Spark version update for documentations > > > Key: SPARK-17651 > URL: https://issues.apache.org/jira/browse/SPARK-17651 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Reynold Xin >Assignee: Shivaram Venkataraman > Fix For: 2.0.1, 2.1.0 > > > Both the Jekyll generated docs and SparkR API reference docs have a version > number in them. It would be great to automate those in the release script > without having to manually update using a commit. > cc [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519497#comment-15519497 ] Yan commented on SPARK-17556: - For 2), I think BitTorrent won't help in the case of all-to-all transfers, unlike the one-to-all such as the driver-to-cluster broadcast, or few-to-all, transfers. Thanks. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17631) Structured Streaming - Add Http Stream Sink
[ https://issues.apache.org/jira/browse/SPARK-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17631: -- Fix Version/s: (was: 2.0.0) > Structured Streaming - Add Http Stream Sink > --- > > Key: SPARK-17631 > URL: https://issues.apache.org/jira/browse/SPARK-17631 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu >Priority: Minor > > Streaming query results can be sinked to http server through http post request > github: https://github.com/apache/spark/pull/15194 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518938#comment-15518938 ] Apache Spark commented on SPARK-17499: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15232 > make the default params in sparkR spark.mlp consistent with > MultilayerPerceptronClassifier > -- > > Key: SPARK-17499 > URL: https://issues.apache.org/jira/browse/SPARK-17499 > Project: Spark > Issue Type: Improvement >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > several default params in sparkR spark.mlp is wrong, > layers should be null > tol should be 1e-6 > stepSize should be 0.03 > seed should be -763139545 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17658: Assignee: (was: Apache Spark) > write.df API requires path which is not actually always nessasary in SparkR > --- > > Key: SPARK-17658 > URL: https://issues.apache.org/jira/browse/SPARK-17658 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{write.df}} in SparkR always requires taking {{path}}. This is > actually not always nessasary. > For example, if we have a datasource extending {{CreatableRelationProvider}}, > it might not request {{path}}. > FWIW, Python/Scala do not require this in the API already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518856#comment-15518856 ] Apache Spark commented on SPARK-17658: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15231 > write.df API requires path which is not actually always nessasary in SparkR > --- > > Key: SPARK-17658 > URL: https://issues.apache.org/jira/browse/SPARK-17658 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{write.df}} in SparkR always requires taking {{path}}. This is > actually not always nessasary. > For example, if we have a datasource extending {{CreatableRelationProvider}}, > it might not request {{path}}. > FWIW, Python/Scala do not require this in the API already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17658: Assignee: Apache Spark > write.df API requires path which is not actually always nessasary in SparkR > --- > > Key: SPARK-17658 > URL: https://issues.apache.org/jira/browse/SPARK-17658 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > It seems {{write.df}} in SparkR always requires taking {{path}}. This is > actually not always nessasary. > For example, if we have a datasource extending {{CreatableRelationProvider}}, > it might not request {{path}}. > FWIW, Python/Scala do not require this in the API already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR
Hyukjin Kwon created SPARK-17658: Summary: write.df API requires path which is not actually always nessasary in SparkR Key: SPARK-17658 URL: https://issues.apache.org/jira/browse/SPARK-17658 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor It seems {{write.df}} in SparkR always requires taking {{path}}. This is actually not always nessasary. For example, if we have a datasource extending {{CreatableRelationProvider}}, it might not request {{path}}. FWIW, Python/Scala do not require this in the API already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518790#comment-15518790 ] Liang-Chi Hsieh commented on SPARK-17556: - For 1). It is true only if your driver is outside of the cluster. So you can avoid uploading data from the driver to the cluster. If it is in cluster mode, then I think it is no obvious difference between uploading data from the driver and any executor. For 2). I think it is not exactly correct. Basically we perform a BitTorrent-like approach to fetch block, the slaves do need to connect to all others by the end. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-17556: Attachment: executor-side-broadcast.pdf > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518598#comment-15518598 ] Yan commented on SPARK-17556: - A few comments of mine are as follows: 1) The "one-executor collection" approach is different from the driver-side collection and broadcasting, in that it avoids uploading data from the driver back to cluster. The primary concern of the "one-executor collection" approach, as pointed out, is that the sole executor could get bottlenecked similar to the latency issue with the "driver-side collection" approach, to a large degree; 2) The "all-executor collection" approach is more balanced and scalable, but it might suffer from the network storming since all slaves needs to connect to all others. 3) the real issue is the repeated, and thus wasted, work of collection of pieces of the broadcast data by multiple collectors/broadcasters, against the extended latency if the collection/broadcasting is performed once and for all. This is actually not quite different from the scenario of multiple- vs single-reducer in a map/reduce execution. Final output from a single reducer is ready to use; while those from multiple-reducers require final assemblies by the end users, particularly if the final result is to be organized, e.g., totally ordered. But using multiple-reducers is more scalable, balanced and likely faster. 4) It's probably good to have a configurable # of executors acting as collectors/broadcasters, each of which just collects and broadcasts a portion of the broadcast table for the final join executions. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17057) ProbabilisticClassifierModels' thresholds should have at most one 0
[ https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17057. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15149 [https://github.com/apache/spark/pull/15149] > ProbabilisticClassifierModels' thresholds should have at most one 0 > --- > > Key: SPARK-17057 > URL: https://issues.apache.org/jira/browse/SPARK-17057 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: zhengruifeng >Assignee: Sean Owen >Priority: Minor > Fix For: 2.1.0 > > > {code} > val path = "./data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rfm = rf.fit(data) > scala> rfm.setThresholds(Array(0.0,0.0,0.0)) > res4: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees > scala> rfm.transform(data).show(5) > +-++--+-+--+ > |label|features| rawPrediction| probability|prediction| > +-++--+-+--+ > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]| 0.0| > +-++--+-+--+ > only showing top 5 rows > {code} > If multi thresholds are set zero, the prediction of > {{ProbabilisticClassificationModel}} is the first index whose corresponding > threshold is 0. > However, in this case, the index with max {{probability}} among indices with > 0-threshold should be more reasonable to mark as > {{prediction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17057) ProbabilisticClassifierModels' thresholds should have at most one 0
[ https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17057: -- Issue Type: Improvement (was: Bug) Summary: ProbabilisticClassifierModels' thresholds should have at most one 0 (was: ProbabilisticClassifierModels' thresholds should be > 0) > ProbabilisticClassifierModels' thresholds should have at most one 0 > --- > > Key: SPARK-17057 > URL: https://issues.apache.org/jira/browse/SPARK-17057 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: zhengruifeng >Assignee: Sean Owen >Priority: Minor > > {code} > val path = "./data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rfm = rf.fit(data) > scala> rfm.setThresholds(Array(0.0,0.0,0.0)) > res4: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees > scala> rfm.transform(data).show(5) > +-++--+-+--+ > |label|features| rawPrediction| probability|prediction| > +-++--+-+--+ > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]| 0.0| > +-++--+-+--+ > only showing top 5 rows > {code} > If multi thresholds are set zero, the prediction of > {{ProbabilisticClassificationModel}} is the first index whose corresponding > threshold is 0. > However, in this case, the index with max {{probability}} among indices with > 0-threshold should be more reasonable to mark as > {{prediction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17656) Decide on the variant of @scala.annotation.varargs and use consistently
[ https://issues.apache.org/jira/browse/SPARK-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17656: -- Affects Version/s: (was: 2.0.2) 2.0.0 Priority: Trivial (was: Major) (Not Major, can't affect unreleased 2.0.2) There is only one annotation, it's a question of how to import it. The normal thing to do is {{import scala.annotation.varargs}} and then {{@varargs}}. The {{_root_}} prefix has to be used where necessary to disambiguate the import, but isn't apparently needed in any case in the code right now. > Decide on the variant of @scala.annotation.varargs and use consistently > --- > > Key: SPARK-17656 > URL: https://issues.apache.org/jira/browse/SPARK-17656 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Trivial > > After the [discussion at > dev@spark|http://apache-spark-developers-list.1001551.n3.nabble.com/scala-annotation-varargs-or-root-scala-annotation-varargs-td18898.html] > it appears there's a consensus to review the use of > {{@scala.annotation.varargs}} throughout the codebase and use one variant and > use it consistently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10835. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 Issue resolved by pull request 15179 [https://github.com/apache/spark/pull/15179] > Word2Vec should accept non-null string array, in addition to existing null > string array > --- > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)
[ https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518577#comment-15518577 ] Xiao Li commented on SPARK-17653: - [~rxin] I submitted a PR https://github.com/apache/spark/pull/11930 for resolving a related issue. If you think that is a right direction, I will continue/enhance it and write the design doc. > Optimizer should remove unnecessary distincts (in multiple unions) > -- > > Key: SPARK-17653 > URL: https://issues.apache.org/jira/browse/SPARK-17653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Reynold Xin > > Query: > {code} > select 1 a union select 2 b union select 3 c > {code} > Explain plan: > {code} > == Physical Plan == > *HashAggregate(keys=[a#13], functions=[]) > +- Exchange hashpartitioning(a#13, 200) >+- *HashAggregate(keys=[a#13], functions=[]) > +- Union > :- *HashAggregate(keys=[a#13], functions=[]) > : +- Exchange hashpartitioning(a#13, 200) > : +- *HashAggregate(keys=[a#13], functions=[]) > :+- Union > : :- *Project [1 AS a#13] > : : +- Scan OneRowRelation[] > : +- *Project [2 AS b#14] > : +- Scan OneRowRelation[] > +- *Project [3 AS c#15] > +- Scan OneRowRelation[] > {code} > Only one distinct should be necessary. This makes a bunch of unions slower > than a bunch of union alls followed by a distinct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10835: -- Summary: Word2Vec should accept non-null string array, in addition to existing null string array (was: [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array) > Word2Vec should accept non-null string array, in addition to existing null > string array > --- > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10835) [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10835: -- Summary: [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array (was: Change Output of NGram to Array(String, True)) > [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition > to existing null string array > -- > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10835: -- Shepherd: Sean Owen (was: Joseph K. Bradley) > Word2Vec should accept non-null string array, in addition to existing null > string array > --- > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17657) Disallow Users to Change Table Type
[ https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518556#comment-15518556 ] Apache Spark commented on SPARK-17657: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15230 > Disallow Users to Change Table Type > > > Key: SPARK-17657 > URL: https://issues.apache.org/jira/browse/SPARK-17657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > Hive allows users to change the table type from `Managed` to `External` or > from `External` to `Managed` by altering table's property `EXTERNAL`. See the > JIRA: https://issues.apache.org/jira/browse/HIVE-1329 > So far, Spark SQL does not correctly support it, although users can do it. > Many assumptions are broken in the implementation. Thus, this PR is to > disallow users to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17657) Disallow Users to Change Table Type
[ https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17657: Assignee: (was: Apache Spark) > Disallow Users to Change Table Type > > > Key: SPARK-17657 > URL: https://issues.apache.org/jira/browse/SPARK-17657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li > > Hive allows users to change the table type from `Managed` to `External` or > from `External` to `Managed` by altering table's property `EXTERNAL`. See the > JIRA: https://issues.apache.org/jira/browse/HIVE-1329 > So far, Spark SQL does not correctly support it, although users can do it. > Many assumptions are broken in the implementation. Thus, this PR is to > disallow users to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17657) Disallow Users to Change Table Type
[ https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17657: Assignee: Apache Spark > Disallow Users to Change Table Type > > > Key: SPARK-17657 > URL: https://issues.apache.org/jira/browse/SPARK-17657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Hive allows users to change the table type from `Managed` to `External` or > from `External` to `Managed` by altering table's property `EXTERNAL`. See the > JIRA: https://issues.apache.org/jira/browse/HIVE-1329 > So far, Spark SQL does not correctly support it, although users can do it. > Many assumptions are broken in the implementation. Thus, this PR is to > disallow users to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17657) Disallow Users to Change Table Type
Xiao Li created SPARK-17657: --- Summary: Disallow Users to Change Table Type Key: SPARK-17657 URL: https://issues.apache.org/jira/browse/SPARK-17657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Xiao Li Hive allows users to change the table type from `Managed` to `External` or from `External` to `Managed` by altering table's property `EXTERNAL`. See the JIRA: https://issues.apache.org/jira/browse/HIVE-1329 So far, Spark SQL does not correctly support it, although users can do it. Many assumptions are broken in the implementation. Thus, this PR is to disallow users to do it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17656) Decide on the variant of @scala.annotation.varargs and use consistently
Jacek Laskowski created SPARK-17656: --- Summary: Decide on the variant of @scala.annotation.varargs and use consistently Key: SPARK-17656 URL: https://issues.apache.org/jira/browse/SPARK-17656 Project: Spark Issue Type: Improvement Affects Versions: 2.0.2 Reporter: Jacek Laskowski After the [discussion at dev@spark|http://apache-spark-developers-list.1001551.n3.nabble.com/scala-annotation-varargs-or-root-scala-annotation-varargs-td18898.html] it appears there's a consensus to review the use of {{@scala.annotation.varargs}} throughout the codebase and use one variant and use it consistently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8987) Increase test coverage of DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518537#comment-15518537 ] OuyangJin commented on SPARK-8987: -- I'd like to work on this > Increase test coverage of DAGScheduler > -- > > Key: SPARK-8987 > URL: https://issues.apache.org/jira/browse/SPARK-8987 > Project: Spark > Issue Type: Umbrella > Components: Scheduler, Tests >Affects Versions: 1.0.0 >Reporter: Andrew Or > > DAGScheduler is one of the most monstrous piece of code in Spark. Every time > someone changes something there something like the following happens: > (1) Someone pings a committer > (2) The committer pings a scheduler maintainer > (3) Scheduler maintainer correctly points out bugs in the patch > (4) Author of patch fixes bug but introduces more bugs > (5) Repeat steps 3 - 4 N times > (6) Other committers / contributors jump in and start debating > (7) The patch goes stale for months > All of this happens because no one, including the committers, has high > confidence that a particular change doesn't break some corner case in the > scheduler. I believe one of the main issues is the lack of sufficient test > coverage, which is not a luxury but a necessity for logic as complex as the > DAGScheduler. > As of the writing of this JIRA, DAGScheduler has ~1500 lines, while the > DAGSchedulerSuite only has ~900 lines. I would argue that the suite line > count should actually be many multiples of that of the original code. > If you wish to work on this, let me know and I will assign it to you. Anyone > is welcome. :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17210) sparkr.zip is not distributed to executors when run sparkr in RStudio
[ https://issues.apache.org/jira/browse/SPARK-17210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518507#comment-15518507 ] Felix Cheung commented on SPARK-17210: -- Got it, sorry about that, I should have noticed. > sparkr.zip is not distributed to executors when run sparkr in RStudio > - > > Key: SPARK-17210 > URL: https://issues.apache.org/jira/browse/SPARK-17210 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Fix For: 2.0.2, 2.1.0 > > > Here's the code to reproduce this issue. > {code} > Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") > .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) > library(SparkR) > sparkR.session(master="yarn-client", sparkConfig = > list(spark.executor.instances="1")) > df <- as.DataFrame(mtcars) > head(df) > {code} > And this is the exception in executor log. > {noformat} > 16/08/24 15:33:45 INFO BufferedStreamThread: Fatal error: cannot open file > '/Users/jzhang/Temp/hadoop_tmp/nm-local-dir/usercache/jzhang/appcache/application_1471846125517_0022/container_1471846125517_0022_01_02/sparkr/SparkR/worker/daemon.R': > No such file or directory > 16/08/24 15:33:55 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 6) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:367) > at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org