[jira] [Created] (SPARK-20273) No non-deterministic Filter push-down into Join Conditions
Xiao Li created SPARK-20273: --- Summary: No non-deterministic Filter push-down into Join Conditions Key: SPARK-20273 URL: https://issues.apache.org/jira/browse/SPARK-20273 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li {noformat} sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show() {noformat} We will get the following error: {noformat} Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) {noformat} Filters could be pushed down to the join conditions by the optimizer rule {{PushPredicateThroughJoin}}. However, we block users to add non-deterministics conditions by the analyzer (For details, see the PR https://github.com/apache/spark/pull/7535). We should not push down non-deterministic conditions; otherwise, we should allow users to do it by explicitly initialize the non-deterministic expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions
[ https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20273: Summary: Disallow Non-deterministic Filter push-down into Join Conditions (was: No non-deterministic Filter push-down into Join Conditions) > Disallow Non-deterministic Filter push-down into Join Conditions > > > Key: SPARK-20273 > URL: https://issues.apache.org/jira/browse/SPARK-20273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b > having r > 0.5").show() > {noformat} > We will get the following error: > {noformat} > Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > {noformat} > Filters could be pushed down to the join conditions by the optimizer rule > {{PushPredicateThroughJoin}}. However, we block users to add > non-deterministics conditions by the analyzer (For details, see the PR > https://github.com/apache/spark/pull/7535). > We should not push down non-deterministic conditions; otherwise, we should > allow users to do it by explicitly initialize the non-deterministic > expressions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20272) about graph shortestPath algorithm question
huangjunjun created SPARK-20272: --- Summary: about graph shortestPath algorithm question Key: SPARK-20272 URL: https://issues.apache.org/jira/browse/SPARK-20272 Project: Spark Issue Type: Question Components: GraphX Affects Versions: 2.1.0 Reporter: huangjunjun we all know that shortestPath algorithm should be to comput the distance between source vertex id and destination vertex id.In fact, the shortestPath algorithm in graphX is computting the least passed vertex number from source to destination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962455#comment-15962455 ] Saisai Shao commented on SPARK-16742: - Hi [~mgummelt], I'm working on the design of SPARK-19143, by looking at your comments, I think part of the works are overlapped, especially the RPC part to propagate Credentials. Here is my current WIP design (https://docs.google.com/document/d/1Y8CY3XViViTYiIQO9ySoid0t9q3H163fmroCV1K3NTk/edit?usp=sharing). In my current design I offer a standard RPC solution to support different cluster managers. It would be great if we could collaborate together to meet the same goal. My main concern is that if Mesos's implementation is quite different from Yarn's, then it requires more effort to align with different cluster managers, if your proposal is similar to what I proposed here, then my work can be based on yours. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20229) add semanticHash to QueryPlan
[ https://issues.apache.org/jira/browse/SPARK-20229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20229. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17541 [https://github.com/apache/spark/pull/17541] > add semanticHash to QueryPlan > - > > Key: SPARK-20229 > URL: https://issues.apache.org/jira/browse/SPARK-20229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962450#comment-15962450 ] Michael Gummelt commented on SPARK-16742: - Also, note that the above Mesos implementation is not dependent on Mesos in any way. It just uses Spark's existing RPC mechanisms to transmit delegation tokens. I see that there's a related effort here to standardize this RPC mechanism: https://issues.apache.org/jira/browse/SPARK-19143. We'd be more than happy to adopt that standard once it exists. But hopefully our one-off RPC that we're currently using is acceptable in the interim. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962440#comment-15962440 ] Michael Gummelt edited comment on SPARK-16742 at 4/10/17 5:28 AM: -- Hi [~vanzin], [~ganger85] and Strat.io are pulling back their Mesos Kerberos implementation for now, and we at Mesosphere are about to submit a PR to upstream our implementation. I have a few questions I'd like to run by you to make sure that PR goes smoothly. 1) I've been following your comments on this Spark Standalone Kerberos PR: https://github.com/apache/spark/pull/17530. It looks like your concern is that in *cluster mode*, the keytab is written to a file on the host running the driver, and is owned by the user of the Spark Worker, which will be the same for each job. So jobs submitted by multiple users will be able to read each other's keytabs. In *client mode*, it looks like the delegation tokens are written to a file (HADOOP_TOKEN_FILE_LOCATION) on the host running the executor, which suffers from the same problem as the keytab in cluster mode. The problem is then that a kerberos-authenticated user submitting their job would be unaware that their credentials are being leaked to other users. Is this an accurate description of the issue? 2) I understand that YARN writes delegation tokens via {{amContainer.setTokens()}}, which ultimately results in the delegation token being written to a file owned by the submitting user. However, since the "submitting user" is a Kerberos user, not a Unix user, I'm assuming that {{hadoop.security.auth_to_local}} is what maps the Kerberos user to the Unix user who runs the ApplicationMaster and owns that file. Is that correct? To avoid the shared-file problem for delegation tokens, our Mesos implementation currently has the Executor issue an RPC call to fetch the delegation token from the driver. There therefore isn't any need for at-rest access control, and if in-motion interception is in the user's threat model, then can be sure to run Spark with SSL. We avoid the shared-file problem for keytabs entirely, because there's no need to distribute the keytab, at least in client mode. Unlike YARN, the driver and the equivalent of the "ApplicationMaster" in Mesos are one and the same. They both exist in the same process, the {{spark-submit}} process. We're probably going to punt on cluster mode for now, just for simplicity, but we should be able to solve this in cluster mode as well, because unlike standalone, and much like YARN, Mesos controls what user the driver runs as. What do you think of the above approach? If you see any blockers, I would very much appreciate teasing those out now rather than during the PR. Thanks! was (Author: mgummelt): Hi [~vanzin], [~ganger85] and Strat.io are pulling back their Mesos Kerberos implementation for now, and we at Mesosphere are about to submit a PR to upstream our implementation. I have a few questions I'd like to run by you to make sure that PR goes smoothly. 1) I've been following your comments on this Spark Standalone Kerberos PR: https://github.com/apache/spark/pull/17530. It looks like your concern is that in *cluster mode*, the keytab is written to a file on the host running the driver, and is owned by the user of the Spark Worker, which will be the same for each job. So jobs submitted by multiple users will be able to read each other's keytabs. In *client mode*, it looks like the delegation tokens are written to a file (HADOOP_TOKEN_FILE_LOCATION) on the host running the executor, which suffers from the same problem as the keytab in cluster mode. The problem is then that a kerberos-authenticated user submitting their job would be unaware that their credentials are being leaked to other users. Is this an accurate description of the issue? 2) I understand that YARN writes delegation tokens via {{amContainer.setTokens()}}, which ultimately results in the delegation token being written to a file owned by the submitting user. However, since the "submitting user" is a Kerberos user, not a Unix user, I'm assuming that {{hadoop.security.auth_to_local}} is what maps the Kerberos user to the Unix user who runs the ApplicationMaster and owns that file. Is that correct? To avoid the shared-file problem for delegation tokens, our Mesos implementation currently has the Executor issue an RPC call to fetch the delegation token from the driver. There therefore isn't any need for at-rest encryption, and if in-motion encryption is in the user's threat model, then can be sure to run Spark with SSL. We avoid the shared-file problem for keytabs entirely, because there's no need to distribute the keytab, at least in client mode. Unlike YARN, the driver and the equivalent of the "ApplicationMaster" in Mesos are one and the same. They both exist in the same process,
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962440#comment-15962440 ] Michael Gummelt commented on SPARK-16742: - Hi [~vanzin], [~ganger85] and Strat.io are pulling back their Mesos Kerberos implementation for now, and we at Mesosphere are about to submit a PR to upstream our implementation. I have a few questions I'd like to run by you to make sure that PR goes smoothly. 1) I've been following your comments on this Spark Standalone Kerberos PR: https://github.com/apache/spark/pull/17530. It looks like your concern is that in *cluster mode*, the keytab is written to a file on the host running the driver, and is owned by the user of the Spark Worker, which will be the same for each job. So jobs submitted by multiple users will be able to read each other's keytabs. In *client mode*, it looks like the delegation tokens are written to a file (HADOOP_TOKEN_FILE_LOCATION) on the host running the executor, which suffers from the same problem as the keytab in cluster mode. The problem is then that a kerberos-authenticated user submitting their job would be unaware that their credentials are being leaked to other users. Is this an accurate description of the issue? 2) I understand that YARN writes delegation tokens via {{amContainer.setTokens()}}, which ultimately results in the delegation token being written to a file owned by the submitting user. However, since the "submitting user" is a Kerberos user, not a Unix user, I'm assuming that {{hadoop.security.auth_to_local}} is what maps the Kerberos user to the Unix user who runs the ApplicationMaster and owns that file. Is that correct? To avoid the shared-file problem for delegation tokens, our Mesos implementation currently has the Executor issue an RPC call to fetch the delegation token from the driver. There therefore isn't any need for at-rest encryption, and if in-motion encryption is in the user's threat model, then can be sure to run Spark with SSL. We avoid the shared-file problem for keytabs entirely, because there's no need to distribute the keytab, at least in client mode. Unlike YARN, the driver and the equivalent of the "ApplicationMaster" in Mesos are one and the same. They both exist in the same process, the {{spark-submit}} process. We're probably going to punt on cluster mode for now, just for simplicity, but we should be able to solve this in cluster mode as well, because unlike standalone, and much like YARN, Mesos controls what user the driver runs as. What do you think of the above approach? If you see any blockers, I would very much appreciate teasing those out now rather than during the PR. Thanks! > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
[ https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-20270. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17577 [https://github.com/apache/spark/pull/17577] > na.fill will change the values in long or integer when the default value is > in double > - > > Key: SPARK-20270 > URL: https://issues.apache.org/jira/browse/SPARK-20270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Critical > Fix For: 2.2.0 > > > This bug was partially addressed in SPARK-18555, but the root cause isn't > completely solved. This bug is pretty critical since it changes the member id > in Long in our application if the member id can not be represented by Double > losslessly when the member id is very big. > Here is an example how this happens, with > {code} > Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), > (9123146099426677101L, null), > (9123146560113991650L, 1.6), (null, null)).toDF("a", > "b").na.fill(0.2), > {code} > the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as > bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as > double) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code}. > Note that even the value is not null, Spark will cast the Long into Double > first. Then if it's not null, Spark will cast it back to Long which results > in losing precision. > The behavior should be that the original value should not be changed if it's > not null, but Spark will change the value which is wrong. > With the PR, the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, > coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code} > which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.
[ https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962422#comment-15962422 ] Hyukjin Kwon commented on SPARK-20193: -- Maybe, yea. but I guess we can't change the method signature as it breaks binary compatibility. > Selecting empty struct causes ExpressionEncoder error. > -- > > Key: SPARK-20193 > URL: https://issues.apache.org/jira/browse/SPARK-20193 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Priority: Minor > > {{def struct(cols: Column*): Column}} > Given the above signature and the lack of any note in the docs saying that a > struct with no columns is not supported, I would expect the following to work: > {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}} > However, this results in: > {quote} > java.lang.AssertionError: assertion failed: each serializer expression should > contains at least one `BoundReference` > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1131) > ... 39 elided > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry.X.He closed SPARK-20266. -- > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962415#comment-15962415 ] Jerry.X.He commented on SPARK-20266: [~hyukjin.kwon] thank you, I don't think it is Spark's problem so, tomorrow, I reinstall other version of Spark, but the same question is still here, and there may be some problem in environment ... I will report this question in appropriate channel, thank you .. > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962407#comment-15962407 ] Hyukjin Kwon commented on SPARK-20266: -- Please refer "Mailing Lists" in http://spark.apache.org/community.html. Subscribe into the mailing list and send a email to the address. I resolved this JIRA as it is apparently a question and it does not look like a Spark's problem. It might be an issue but it looked to me that it does not indicate it is an issue within Spark assuming from the details in the current JIRA. If you are pretty sure that it is an issue in Spark. Please reopen with more details. Otherwise, I guess asking first is better. > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20259. -- Resolution: Duplicate Actually, the title refers pushing down the join. I am resolving this. > Support push down join optimizations in DataFrameReader when loading from JDBC > -- > > Key: SPARK-20259 > URL: https://issues.apache.org/jira/browse/SPARK-20259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: John Muller >Priority: Minor > > Given two dataframes loaded from the same JDBC connection: > {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} > val ordersDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.orders") > .option("user", "username") > .option("password", "password") > .load().toDS > > val productDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.product") > .option("user", "username") > .option("password", "password") > .load().toDS > > ordersDF.createOrReplaceTempView("orders") > productDF.createOrReplaceTempView("product") > // Followed by a join between them: > val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o > INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") > {code} > Catalyst should optimize the query to be: > SELECT northwind.product.name, SUM(northwind.orders.qty) > FROM northwind.orders > INNER JOIN northwind.product ON > northwind.orders.product_id = northwind.product.product_id > GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962404#comment-15962404 ] Hyukjin Kwon commented on SPARK-20259: -- If so, I guess it is a duplicate of SPARK-12449. I'd close this if this gets not updated for a long time like few days a couple of weeks assuming it refers pushing down the join. > Support push down join optimizations in DataFrameReader when loading from JDBC > -- > > Key: SPARK-20259 > URL: https://issues.apache.org/jira/browse/SPARK-20259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: John Muller >Priority: Minor > > Given two dataframes loaded from the same JDBC connection: > {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} > val ordersDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.orders") > .option("user", "username") > .option("password", "password") > .load().toDS > > val productDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.product") > .option("user", "username") > .option("password", "password") > .load().toDS > > ordersDF.createOrReplaceTempView("orders") > productDF.createOrReplaceTempView("product") > // Followed by a join between them: > val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o > INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") > {code} > Catalyst should optimize the query to be: > SELECT northwind.product.name, SUM(northwind.orders.qty) > FROM northwind.orders > INNER JOIN northwind.product ON > northwind.orders.product_id = northwind.product.product_id > GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962401#comment-15962401 ] Jerry.X.He commented on SPARK-20266: [~hyukjin.kwon] I'm sorry, I don't know how to ask questions, and I see there could feedback question, so I submitted here, sorry, could you tell me where is "user mailing list", I'm green hand. thank you. and these posts I've searched before, and the not fix this problem, there are some log in my cluster about tests. or maybe I consider wrong, please help me check that. 1. ufw status and ssh connectivity root@master:/usr/local/ProgramFiles# ufw status Status: inactive root@master:/usr/local/ProgramFiles# ssh slave01 Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support:https://ubuntu.com/advantage Last login: Sat Apr 8 21:33:44 2017 from 192.168.0.119 root@slave01:~# ufw status Status: inactive root@slave01:~# ssh slave02 Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support:https://ubuntu.com/advantage Last login: Sat Apr 8 21:10:33 2017 from 192.168.0.119 root@slave02:~# ufw status Status: inactive root@slave02:~# 2. network connectivity by ip or FQDN 2.1. nc in master root@master:/usr/local/ProgramFiles# netcat -l 12306 root@master:/usr/local/ProgramFiles# nc -l 12306 root@master:/usr/local/ProgramFiles# nc -l 12306 root@master:/usr/local/ProgramFiles# nc -l 12306 2.2. nc in slave01 root@slave01:~# nc -vz 192.168.0.180 12306 Connection to 192.168.0.180 12306 port [tcp/*] succeeded! root@slave01:~# nc -vz master 12306 Connection to master 12306 port [tcp/*] succeeded! 2.3. nc in slave02 root@slave02:/usr/local/ProgramFiles# nc -vz 192.168.0.180 12306 Connection to 192.168.0.180 12306 port [tcp/*] succeeded! root@slave02:/usr/local/ProgramFiles# nc -vz master 12306 Connection to master 12306 port [tcp/*] succeeded! root@slave02:/usr/local/ProgramFiles# > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20264) asm should be non-test dependency in sql/core
[ https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20264. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 > asm should be non-test dependency in sql/core > - > > Key: SPARK-20264 > URL: https://issues.apache.org/jira/browse/SPARK-20264 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.2, 2.2.0 > > > sq/core module currently declares asm as a test scope dependency. > Transitively it should actually be a normal dependency since the actual core > module defines it. This occasionally confuses IntelliJ. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962400#comment-15962400 ] Xiao Li commented on SPARK-20259: - Pushing join into JDBC data sources? > Support push down join optimizations in DataFrameReader when loading from JDBC > -- > > Key: SPARK-20259 > URL: https://issues.apache.org/jira/browse/SPARK-20259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: John Muller >Priority: Minor > > Given two dataframes loaded from the same JDBC connection: > {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid} > val ordersDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.orders") > .option("user", "username") > .option("password", "password") > .load().toDS > > val productDF = spark.read > .format("jdbc") > .option("url", "jdbc:postgresql:dbserver") > .option("dbtable", "northwind.product") > .option("user", "username") > .option("password", "password") > .load().toDS > > ordersDF.createOrReplaceTempView("orders") > productDF.createOrReplaceTempView("product") > // Followed by a join between them: > val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o > INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name") > {code} > Catalyst should optimize the query to be: > SELECT northwind.product.name, SUM(northwind.orders.qty) > FROM northwind.orders > INNER JOIN northwind.product ON > northwind.orders.product_id = northwind.product.product_id > GROUP BY p.name -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
[ https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20271: --- Description: Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases like conditional conversion(if...else...), , type conversion, to/from Array, to/from Vector and many string ops.. was: Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases and you can find some examples in the PR. > Add FuncTransformer to simplify custom transformer creation > --- > > Key: SPARK-20271 > URL: https://issues.apache.org/jira/browse/SPARK-20271 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Just to share some code I implemented to help easily create a custom > Transformer in one line of code w. > {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 > else 0) {code} > This was used in many of my projects and is pretty helpful (Maybe I'm > lazy..). The transformer can be saved/loaded as other transformer and can be > integrated into a pipeline normally. It can be used widely in many use cases > like conditional conversion(if...else...), , type conversion, to/from Array, > to/from Vector and many string ops.. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
[ https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20271: Assignee: Apache Spark > Add FuncTransformer to simplify custom transformer creation > --- > > Key: SPARK-20271 > URL: https://issues.apache.org/jira/browse/SPARK-20271 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Apache Spark > > Just to share some code I implemented to help easily create a custom > Transformer in one line of code w. > {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 > else 0) {code} > This was used in many of my projects and is pretty helpful (Maybe I'm > lazy..). The transformer can be saved/loaded as other transformer and can be > integrated into a pipeline normally. It can be used widely in many use cases > and you can find some examples in the PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
[ https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20271: Assignee: (was: Apache Spark) > Add FuncTransformer to simplify custom transformer creation > --- > > Key: SPARK-20271 > URL: https://issues.apache.org/jira/browse/SPARK-20271 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Just to share some code I implemented to help easily create a custom > Transformer in one line of code w. > {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 > else 0) {code} > This was used in many of my projects and is pretty helpful (Maybe I'm > lazy..). The transformer can be saved/loaded as other transformer and can be > integrated into a pipeline normally. It can be used widely in many use cases > and you can find some examples in the PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
[ https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962398#comment-15962398 ] Apache Spark commented on SPARK-20271: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/17583 > Add FuncTransformer to simplify custom transformer creation > --- > > Key: SPARK-20271 > URL: https://issues.apache.org/jira/browse/SPARK-20271 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Just to share some code I implemented to help easily create a custom > Transformer in one line of code w. > {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 > else 0) {code} > This was used in many of my projects and is pretty helpful (Maybe I'm > lazy..). The transformer can be saved/loaded as other transformer and can be > integrated into a pipeline normally. It can be used widely in many use cases > and you can find some examples in the PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20239) Improve HistoryServer ACL mechanism
[ https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20239: Assignee: Apache Spark > Improve HistoryServer ACL mechanism > --- > > Key: SPARK-20239 > URL: https://issues.apache.org/jira/browse/SPARK-20239 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark > > Current SHS (Spark History Server) two different ACLs. > * ACL of base URL, it is controlled by "spark.acls.enabled" or > "spark.ui.acls.enabled", and with this enabled, only user configured with > "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user > who started SHS could list all the applications, otherwise none of them can > be listed. This will also affect REST APIs which listing the summary of all > apps and one app. > * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". > With this enabled only history admin user and user/group who ran this app can > access the details of this app. > With this two ACLs, we may encounter several unexpected behaviors: > 1. if base URL's ACL is enabled but user A has no view permission. User "A" > cannot see the app list but could still access details of it's own app. > 2. if ACLs of base URL is disabled. Then user "A" could see the summary of > all the apps, even some didn't run by user "A", but cannot access the details. > 3. history admin ACL has no permission to list all apps if this admin user is > not added to base URL's ACL. > The unexpected behaviors is mainly because we have two different ACLs, > ideally we should have only one to manage all. > So to improve SHS's ACL mechanism, we should: > * Unify two different ACLs into one, and always honor this one (both in base > URL and app details). > * User could partially list and display apps which ran by him according to > the ACLs in event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20239) Improve HistoryServer ACL mechanism
[ https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962394#comment-15962394 ] Apache Spark commented on SPARK-20239: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/17582 > Improve HistoryServer ACL mechanism > --- > > Key: SPARK-20239 > URL: https://issues.apache.org/jira/browse/SPARK-20239 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao > > Current SHS (Spark History Server) two different ACLs. > * ACL of base URL, it is controlled by "spark.acls.enabled" or > "spark.ui.acls.enabled", and with this enabled, only user configured with > "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user > who started SHS could list all the applications, otherwise none of them can > be listed. This will also affect REST APIs which listing the summary of all > apps and one app. > * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". > With this enabled only history admin user and user/group who ran this app can > access the details of this app. > With this two ACLs, we may encounter several unexpected behaviors: > 1. if base URL's ACL is enabled but user A has no view permission. User "A" > cannot see the app list but could still access details of it's own app. > 2. if ACLs of base URL is disabled. Then user "A" could see the summary of > all the apps, even some didn't run by user "A", but cannot access the details. > 3. history admin ACL has no permission to list all apps if this admin user is > not added to base URL's ACL. > The unexpected behaviors is mainly because we have two different ACLs, > ideally we should have only one to manage all. > So to improve SHS's ACL mechanism, we should: > * Unify two different ACLs into one, and always honor this one (both in base > URL and app details). > * User could partially list and display apps which ran by him according to > the ACLs in event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20239) Improve HistoryServer ACL mechanism
[ https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20239: Assignee: (was: Apache Spark) > Improve HistoryServer ACL mechanism > --- > > Key: SPARK-20239 > URL: https://issues.apache.org/jira/browse/SPARK-20239 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao > > Current SHS (Spark History Server) two different ACLs. > * ACL of base URL, it is controlled by "spark.acls.enabled" or > "spark.ui.acls.enabled", and with this enabled, only user configured with > "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user > who started SHS could list all the applications, otherwise none of them can > be listed. This will also affect REST APIs which listing the summary of all > apps and one app. > * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". > With this enabled only history admin user and user/group who ran this app can > access the details of this app. > With this two ACLs, we may encounter several unexpected behaviors: > 1. if base URL's ACL is enabled but user A has no view permission. User "A" > cannot see the app list but could still access details of it's own app. > 2. if ACLs of base URL is disabled. Then user "A" could see the summary of > all the apps, even some didn't run by user "A", but cannot access the details. > 3. history admin ACL has no permission to list all apps if this admin user is > not added to base URL's ACL. > The unexpected behaviors is mainly because we have two different ACLs, > ideally we should have only one to manage all. > So to improve SHS's ACL mechanism, we should: > * Unify two different ACLs into one, and always honor this one (both in base > URL and app details). > * User could partially list and display apps which ran by him according to > the ACLs in event log. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
yuhao yang created SPARK-20271: -- Summary: Add FuncTransformer to simplify custom transformer creation Key: SPARK-20271 URL: https://issues.apache.org/jira/browse/SPARK-20271 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases and you can find some examples in the PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
[ https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20253. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17569 [https://github.com/apache/spark/pull/17569] > Remove unnecessary nullchecks of a return value from Spark runtime routines > in generated Java code > -- > > Key: SPARK-20253 > URL: https://issues.apache.org/jira/browse/SPARK-20253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > While we know several Spark runtime routines never return null (e.g. > {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always > checks whether the return value is null or not. > It is good to remove this nullcheck for reducing Java bytecode size and > reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code
[ https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20253: --- Assignee: Kazuaki Ishizaki > Remove unnecessary nullchecks of a return value from Spark runtime routines > in generated Java code > -- > > Key: SPARK-20253 > URL: https://issues.apache.org/jira/browse/SPARK-20253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > While we know several Spark runtime routines never return null (e.g. > {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always > checks whether the return value is null or not. > It is good to remove this nullcheck for reducing Java bytecode size and > reducing the native code size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.
[ https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962359#comment-15962359 ] Apache Spark commented on SPARK-20248: -- User 'shaolinliu' has created a pull request for this issue: https://github.com/apache/spark/pull/17581 > Spark SQL add limit parameter to enhance the reliability. > - > > Key: SPARK-20248 > URL: https://issues.apache.org/jira/browse/SPARK-20248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: 2.1.0 >Reporter: shaolinliu >Priority: Minor > > When we using thrift server, it is difficult to constrain the user's sql > statement; > When the user query a large table without limit, this will lead to thrift > server process memory occupancy lead to service instability; > In general, the user is not used correctly, because if you really need to > return the whole table: > 1, if you use this data to compute , you can complete the computation in > the cluster and then return > 2, if you want obtain the data, you can store it in hdfs > For the above scene, it is recommended to add a > "spark.sql.thriftserver.retainedResults" parameter, > 1, when it is 0, we don not restrict user's operation > 2, when it is greater than 0, if user query with limit, we use user's > limit;if not we use this to limit query's result > Priority user's limit is because, if the user consider the limit, in > general, the user is aware of the exact meaning of this query -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962318#comment-15962318 ] Nan Zhu edited comment on SPARK-20251 at 4/10/17 12:16 AM: --- more details here, it is expected that the compute() method for the next batch was executed before the app is shutdown, however, the app should be eventually shutdown since we have signalled the awaiting condition set in awaitTermination() however, this "eventual shutdown" was not happened...(this issue did not consistently happen) was (Author: codingcat): more details here, by "be proceeding", I mean it is expected that the compute() method for the next batch was executed before the app is shutdown, however, the app should be eventually shutdown since we have signalled the awaiting condition set in awaitTermination() however, this "eventual shutdown" was not happened...(this issue did not consistently happen) > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962318#comment-15962318 ] Nan Zhu commented on SPARK-20251: - more details here, by "be proceeding", I mean it is expected that the compute() method for the next batch was executed before the app is shutdown, however, the app should be eventually shutdown since we have signalled the awaiting condition set in awaitTermination() however, this "eventual shutdown" was not happened...(this issue did not consistently happen) > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962313#comment-15962313 ] Nan Zhu edited comment on SPARK-20251 at 4/9/17 11:57 PM: -- why this is an invalid report? I have been observing the same behavior recently when I upgrade to Spark 2.1 The basic idea (in my side) is that an exception thrown from DStream.compute() method should close the app instead of be proceeding (as the error handling in Spark Streaming is to release the await lock set in awaitTermination) I am still looking at those threads within Spark Streaming to see what was happening, can we change it back to a valid case and give me more time to investigate? was (Author: codingcat): why this is an invalid report? I have been observing the same behavior recently when I upgrade to Spark 2.1 The basic idea (in my side), an exception thrown from DStream.compute() method should close the app instead of proceeding (as the error handling in Spark Streaming is to release the await lock set in awaitTermination) I am still looking at those threads within Spark Streaming to see what was happening, can we change it back to a valid case and give me more time to investigate? > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure
[ https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962313#comment-15962313 ] Nan Zhu commented on SPARK-20251: - why this is an invalid report? I have been observing the same behavior recently when I upgrade to Spark 2.1 The basic idea (in my side), an exception thrown from DStream.compute() method should close the app instead of proceeding (as the error handling in Spark Streaming is to release the await lock set in awaitTermination) I am still looking at those threads within Spark Streaming to see what was happening, can we change it back to a valid case and give me more time to investigate? > Spark streaming skips batches in a case of failure > -- > > Key: SPARK-20251 > URL: https://issues.apache.org/jira/browse/SPARK-20251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Roman Studenikin > > We are experiencing strange behaviour of spark streaming application. > Sometimes it just skips batch in a case of job failure and starts working on > the next one. > We expect it to attempt to reprocess batch, but not to skip it. Is it a bug > or we are missing any important configuration params? > Screenshots from spark UI: > http://pasteboard.co/1oRW0GDUX.png > http://pasteboard.co/1oSjdFpbc.png -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
[ https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20260: - Assignee: Vijay Krishna Ramesh Priority: Minor (was: Trivial) > MLUtils parseLibSVMRecord has incorrect string interpolation for error message > -- > > Key: SPARK-20260 > URL: https://issues.apache.org/jira/browse/SPARK-20260 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Vijay Krishna Ramesh >Assignee: Vijay Krishna Ramesh >Priority: Minor > Fix For: 2.1.2, 2.2.0 > > > There is missing string interpolation for the error message, which causes it > to not actually display the line that failed. See > https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message
[ https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20260. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 17572 [https://github.com/apache/spark/pull/17572] > MLUtils parseLibSVMRecord has incorrect string interpolation for error message > -- > > Key: SPARK-20260 > URL: https://issues.apache.org/jira/browse/SPARK-20260 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Vijay Krishna Ramesh >Priority: Trivial > Fix For: 2.2.0, 2.1.2 > > > There is missing string interpolation for the error message, which causes it > to not actually display the line that failed. See > https://github.com/apache/spark/pull/17572/files for a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962160#comment-15962160 ] Hyukjin Kwon edited comment on SPARK-20266 at 4/9/17 3:21 PM: -- I am resolving this as It sounds like a question and questions should be asked to user mailing list first. Maybe http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically is helpful. was (Author: hyukjin.kwon): I am resolving this as It sounds like a question and questions should be asked to user mailing list first. Maybe http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically is helpful. > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"
[ https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20266. -- Resolution: Invalid I am resolving this as It sounds like a question and questions should be asked to user mailing list first. Maybe http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically is helpful. > ExecutorBackend blocked at "UserGroupInformation.doAs" > -- > > Key: SPARK-20266 > URL: https://issues.apache.org/jira/browse/SPARK-20266 > Project: Spark > Issue Type: Question > Components: Project Infra >Affects Versions: 1.6.2 >Reporter: Jerry.X.He >Priority: Minor > Attachments: logsSubmitByIdeaAtClient.zip, > logsSubmitBySparkSubmitAtSlave02.zip > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962141#comment-15962141 ] Maciej Szymkiewicz commented on SPARK-10931: [~vlad.feinberg] It is worth noting that without {{parent}} some features (like {{CrossValidator}} or {{TrainValidationSplit}}) are crippled. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20269: --- Description: 1.run example of streaming kafka,currently missing java word count producer,not conducive to java developers to learn and test. add a class JavaKafkaWordCountProducer. 2.run example of JavaKafkaWordCount.I find no java word count producer. run example of KafkaWordCount.I find have scala word count producer. I think we should provide the corresponding example code to facilitate java developers to learn and test. 3.My project team develops spark applications,basically with java statements and java API. was: run example of streaming kafka,currently missing java word count producer,not conducive to java developers to learn and test. > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.run example of streaming kafka,currently missing java word count > producer,not conducive to java developers to learn and test. > add a class JavaKafkaWordCountProducer. > 2.run example of JavaKafkaWordCount.I find no java word count producer. > run example of KafkaWordCount.I find have scala word count producer. > I think we should provide the corresponding example code to facilitate java > developers to learn and test. > 3.My project team develops spark applications,basically with java statements > and java API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20269) add java class 'JavaWordCountProducer' to 'provide java word count producer'.
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20269: --- Summary: add java class 'JavaWordCountProducer' to 'provide java word count producer'. (was: add JavaWordCountProducer in steaming examples) > add java class 'JavaWordCountProducer' to 'provide java word count producer'. > - > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.run example of streaming kafka,currently missing java word count > producer,not conducive to java developers to learn and test. > add a class JavaKafkaWordCountProducer. > 2.run example of JavaKafkaWordCount.I find no java word count producer. > run example of KafkaWordCount.I find have scala word count producer. > I think we should provide the corresponding example code to facilitate java > developers to learn and test. > 3.My project team develops spark applications,basically with java statements > and java API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first
[ https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20268. --- Resolution: Not A Problem > Arbitrary RDD element (Fast return) instead of using first > -- > > Key: SPARK-20268 > URL: https://issues.apache.org/jira/browse/SPARK-20268 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Hayri Volkan Agun >Priority: Minor > > Most of the ML and MLLIB algorithms somehow need the column size of the rdd > vector (RDD[Vector]). So instead of getting the first element by rdd.first(), > a fast return can be made to calculate the length of the vector of a > arbitrary rdd element. It can also be be named any(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first
[ https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-20268: --- > Arbitrary RDD element (Fast return) instead of using first > -- > > Key: SPARK-20268 > URL: https://issues.apache.org/jira/browse/SPARK-20268 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Hayri Volkan Agun >Priority: Minor > > Most of the ML and MLLIB algorithms somehow need the column size of the rdd > vector (RDD[Vector]). So instead of getting the first element by rdd.first(), > a fast return can be made to calculate the length of the vector of a > arbitrary rdd element. It can also be be named any(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962101#comment-15962101 ] guoxiaolongzte commented on SPARK-20269: https://github.com/apache/spark/pull/17578 is invalid.I have closed this PR. please see https://github.com/apache/spark/pull/17580.Thank you. > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > run example of streaming kafka,currently missing java word count producer,not > conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first
[ https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hayri Volkan Agun closed SPARK-20268. - Resolution: Fixed > Arbitrary RDD element (Fast return) instead of using first > -- > > Key: SPARK-20268 > URL: https://issues.apache.org/jira/browse/SPARK-20268 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Hayri Volkan Agun >Priority: Minor > > Most of the ML and MLLIB algorithms somehow need the column size of the rdd > vector (RDD[Vector]). So instead of getting the first element by rdd.first(), > a fast return can be made to calculate the length of the vector of a > arbitrary rdd element. It can also be be named any(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first
[ https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962098#comment-15962098 ] Hayri Volkan Agun commented on SPARK-20268: --- Hi Owen, If the first element is the fastest let's close it. > Arbitrary RDD element (Fast return) instead of using first > -- > > Key: SPARK-20268 > URL: https://issues.apache.org/jira/browse/SPARK-20268 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Hayri Volkan Agun >Priority: Minor > > Most of the ML and MLLIB algorithms somehow need the column size of the rdd > vector (RDD[Vector]). So instead of getting the first element by rdd.first(), > a fast return can be made to calculate the length of the vector of a > arbitrary rdd element. It can also be be named any(). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962088#comment-15962088 ] Apache Spark commented on SPARK-20269: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17578 > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > run example of streaming kafka,currently missing java word count producer,not > conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962084#comment-15962084 ] Apache Spark commented on SPARK-20269: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17578 > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > run example of streaming kafka,currently missing java word count producer,not > conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20269: Assignee: Apache Spark > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > run example of streaming kafka,currently missing java word count producer,not > conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20269) add JavaWordCountProducer in steaming examples
[ https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20269: Assignee: (was: Apache Spark) > add JavaWordCountProducer in steaming examples > -- > > Key: SPARK-20269 > URL: https://issues.apache.org/jira/browse/SPARK-20269 > Project: Spark > Issue Type: Improvement > Components: Examples, Structured Streaming >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Priority: Minor > > run example of streaming kafka,currently missing java word count producer,not > conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
[ https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20270: Assignee: Apache Spark (was: DB Tsai) > na.fill will change the values in long or integer when the default value is > in double > - > > Key: SPARK-20270 > URL: https://issues.apache.org/jira/browse/SPARK-20270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Critical > > This bug was partially addressed in SPARK-18555, but the root cause isn't > completely solved. This bug is pretty critical since it changes the member id > in Long in our application if the member id can not be represented by Double > losslessly when the member id is very big. > Here is an example how this happens, with > {code} > Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), > (9123146099426677101L, null), > (9123146560113991650L, 1.6), (null, null)).toDF("a", > "b").na.fill(0.2), > {code} > the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as > bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as > double) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code}. > Note that even the value is not null, Spark will cast the Long into Double > first. Then if it's not null, Spark will cast it back to Long which results > in losing precision. > The behavior should be that the original value should not be changed if it's > not null, but Spark will change the value which is wrong. > With the PR, the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, > coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code} > which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
[ https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20270: Assignee: DB Tsai (was: Apache Spark) > na.fill will change the values in long or integer when the default value is > in double > - > > Key: SPARK-20270 > URL: https://issues.apache.org/jira/browse/SPARK-20270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Critical > > This bug was partially addressed in SPARK-18555, but the root cause isn't > completely solved. This bug is pretty critical since it changes the member id > in Long in our application if the member id can not be represented by Double > losslessly when the member id is very big. > Here is an example how this happens, with > {code} > Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), > (9123146099426677101L, null), > (9123146560113991650L, 1.6), (null, null)).toDF("a", > "b").na.fill(0.2), > {code} > the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as > bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as > double) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code}. > Note that even the value is not null, Spark will cast the Long into Double > first. Then if it's not null, Spark will cast it back to Long which results > in losing precision. > The behavior should be that the original value should not be changed if it's > not null, but Spark will change the value which is wrong. > With the PR, the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, > coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code} > which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
[ https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962065#comment-15962065 ] Apache Spark commented on SPARK-20270: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/17577 > na.fill will change the values in long or integer when the default value is > in double > - > > Key: SPARK-20270 > URL: https://issues.apache.org/jira/browse/SPARK-20270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Critical > > This bug was partially addressed in SPARK-18555, but the root cause isn't > completely solved. This bug is pretty critical since it changes the member id > in Long in our application if the member id can not be represented by Double > losslessly when the member id is very big. > Here is an example how this happens, with > {code} > Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), > (9123146099426677101L, null), > (9123146560113991650L, 1.6), (null, null)).toDF("a", > "b").na.fill(0.2), > {code} > the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as > bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as > double) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code}. > Note that even the value is not null, Spark will cast the Long into Double > first. Then if it's not null, Spark will cast it back to Long which results > in losing precision. > The behavior should be that the original value should not be changed if it's > not null, but Spark will change the value which is wrong. > With the PR, the logical plan will be > {code} > == Analyzed Logical Plan == > a: bigint, b: double > Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, > coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] > +- Project [_1#229L AS a#232L, _2#230 AS b#233] >+- LocalRelation [_1#229L, _2#230] > {code} > which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double
DB Tsai created SPARK-20270: --- Summary: na.fill will change the values in long or integer when the default value is in double Key: SPARK-20270 URL: https://issues.apache.org/jira/browse/SPARK-20270 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: DB Tsai Assignee: DB Tsai Priority: Critical This bug was partially addressed in SPARK-18555, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with {code} Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), {code} the logical plan will be {code} == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] {code}. Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be {code} == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] {code} which behaves correctly without changing the original Long values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19991) FileSegmentManagedBuffer performance improvement.
[ https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19991: - Assignee: Sean Owen > FileSegmentManagedBuffer performance improvement. > - > > Key: SPARK-19991 > URL: https://issues.apache.org/jira/browse/SPARK-19991 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.0.2, 2.1.0 >Reporter: Guoqiang Li >Assignee: Sean Owen >Priority: Minor > Fix For: 2.2.0 > > > When we do not set the value of the configuration items > {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, > each call to the cFileSegmentManagedBuffer.nioByteBuffer or > FileSegmentManagedBuffer.createInputStream method creates a > NoSuchElementException instance. This is a more time-consuming operation. > The shuffle-server thread`s stack: > {noformat} > "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 > nid=0x28d12 runnable [0x7f71af93e000] >java.lang.Thread.State: RUNNABLE > at java.lang.Throwable.fillInStackTrace(Native Method) > at java.lang.Throwable.fillInStackTrace(Throwable.java:783) > - locked <0x0007a930f080> (a java.util.NoSuchElementException) > at java.lang.Throwable.(Throwable.java:265) > at java.lang.Exception.(Exception.java:66) > at java.lang.RuntimeException.(RuntimeException.java:62) > at > java.util.NoSuchElementException.(NoSuchElementException.java:57) > at > org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38) > at > org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31) > at > org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50) > at > org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157) > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728) > at > org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835) > at > org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017) > at > org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256) > at > org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) > at > org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) > at >
[jira] [Resolved] (SPARK-19991) FileSegmentManagedBuffer performance improvement.
[ https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19991. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17567 [https://github.com/apache/spark/pull/17567] > FileSegmentManagedBuffer performance improvement. > - > > Key: SPARK-19991 > URL: https://issues.apache.org/jira/browse/SPARK-19991 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.0.2, 2.1.0 >Reporter: Guoqiang Li >Priority: Minor > Fix For: 2.2.0 > > > When we do not set the value of the configuration items > {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, > each call to the cFileSegmentManagedBuffer.nioByteBuffer or > FileSegmentManagedBuffer.createInputStream method creates a > NoSuchElementException instance. This is a more time-consuming operation. > The shuffle-server thread`s stack: > {noformat} > "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 > nid=0x28d12 runnable [0x7f71af93e000] >java.lang.Thread.State: RUNNABLE > at java.lang.Throwable.fillInStackTrace(Native Method) > at java.lang.Throwable.fillInStackTrace(Throwable.java:783) > - locked <0x0007a930f080> (a java.util.NoSuchElementException) > at java.lang.Throwable.(Throwable.java:265) > at java.lang.Exception.(Exception.java:66) > at java.lang.RuntimeException.(RuntimeException.java:62) > at > java.util.NoSuchElementException.(NoSuchElementException.java:57) > at > org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38) > at > org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31) > at > org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50) > at > org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157) > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728) > at > org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835) > at > org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017) > at > org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256) > at > org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) > at > org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
[jira] [Created] (SPARK-20269) add JavaWordCountProducer in steaming examples
guoxiaolongzte created SPARK-20269: -- Summary: add JavaWordCountProducer in steaming examples Key: SPARK-20269 URL: https://issues.apache.org/jira/browse/SPARK-20269 Project: Spark Issue Type: Improvement Components: Examples, Structured Streaming Affects Versions: 2.1.0 Reporter: guoxiaolongzte Priority: Minor run example of streaming kafka,currently missing java word count producer,not conducive to java developers to learn and test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org