date:20170409

Xiao Li created SPARK-20273:
---

 Summary: No non-deterministic Filter push-down into Join Conditions
 Key: SPARK-20273
 URL: https://issues.apache.org/jira/browse/SPARK-20273
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


{noformat}
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b 
having r > 0.5").show()
{noformat}

We will get the following error:
{noformat}
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most 
recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): 
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
{noformat}

Filters could be pushed down to the join conditions by the optimizer rule 
{{PushPredicateThroughJoin}}. However, we block users to add non-deterministics 
conditions by the analyzer (For details, see the PR 
https://github.com/apache/spark/pull/7535). 

We should not push down non-deterministic conditions; otherwise, we should 
allow users to do it by explicitly initialize the non-deterministic expressions




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20273) Disallow Non-deterministic Filter push-down into Join Conditions


 [ 
https://issues.apache.org/jira/browse/SPARK-20273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20273:

Summary: Disallow Non-deterministic Filter push-down into Join Conditions  
(was: No non-deterministic Filter push-down into Join Conditions)

> Disallow Non-deterministic Filter push-down into Join Conditions
> 
>
> Key: SPARK-20273
> URL: https://issues.apache.org/jira/browse/SPARK-20273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
> sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b 
> having r > 0.5").show()
> {noformat}
> We will get the following error:
> {noformat}
> Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> {noformat}
> Filters could be pushed down to the join conditions by the optimizer rule 
> {{PushPredicateThroughJoin}}. However, we block users to add 
> non-deterministics conditions by the analyzer (For details, see the PR 
> https://github.com/apache/spark/pull/7535). 
> We should not push down non-deterministic conditions; otherwise, we should 
> allow users to do it by explicitly initialize the non-deterministic 
> expressions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20272) about graph shortestPath algorithm question

2017-04-09 Thread huangjunjun (JIRA)

huangjunjun created SPARK-20272:
---

 Summary: about graph shortestPath algorithm question
 Key: SPARK-20272
 URL: https://issues.apache.org/jira/browse/SPARK-20272
 Project: Spark
  Issue Type: Question
  Components: GraphX
Affects Versions: 2.1.0
Reporter: huangjunjun


we all know that shortestPath algorithm should be to comput the distance 
between source vertex id and destination vertex id.In fact, the shortestPath 
algorithm in graphX is computting the least passed vertex number from source to 
destination.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962455#comment-15962455
 ] 

Saisai Shao commented on SPARK-16742:
-

Hi [~mgummelt], I'm working on the design of SPARK-19143, by looking at your 
comments, I think part of the works are overlapped, especially the RPC part to 
propagate Credentials. Here is my current WIP design 
(https://docs.google.com/document/d/1Y8CY3XViViTYiIQO9ySoid0t9q3H163fmroCV1K3NTk/edit?usp=sharing).
 In my current design I offer a standard RPC solution to support different 
cluster managers.

It would be great if we could collaborate together to meet the same goal. My 
main concern is that if Mesos's implementation is quite different from Yarn's, 
then it requires more effort to align with different cluster managers, if your 
proposal is similar to what I proposed here, then my work can be based on yours.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20229) add semanticHash to QueryPlan

2017-04-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20229.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17541
[https://github.com/apache/spark/pull/17541]

> add semanticHash to QueryPlan
> -
>
> Key: SPARK-20229
> URL: https://issues.apache.org/jira/browse/SPARK-20229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-09 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962450#comment-15962450
 ] 

Michael Gummelt commented on SPARK-16742:
-

Also, note that the above Mesos implementation is not dependent on Mesos in any 
way.  It just uses Spark's existing RPC mechanisms to transmit delegation 
tokens.  I see that there's a related effort here to standardize this RPC 
mechanism: https://issues.apache.org/jira/browse/SPARK-19143.  We'd be more 
than happy to adopt that standard once it exists.  But hopefully our one-off 
RPC that we're currently using is acceptable in the interim.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-09 Thread Michael Gummelt (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962440#comment-15962440
]

Michael Gummelt edited comment on SPARK-16742 at 4/10/17 5:28 AM:
--

Hi [~vanzin],

[~ganger85] and Strat.io are pulling back their Mesos Kerberos implementation
for now, and we at Mesosphere are about to submit a PR to upstream our
implementation. I have a few questions I'd like to run by you to make sure
that PR goes smoothly.

1) I've been following your comments on this Spark Standalone Kerberos PR:
https://github.com/apache/spark/pull/17530. It looks like your concern is that
in *cluster mode*, the keytab is written to a file on the host running the
driver, and is owned by the user of the Spark Worker, which will be the same
for each job. So jobs submitted by multiple users will be able to read each
other's keytabs. In *client mode*, it looks like the delegation tokens are
written to a file (HADOOP_TOKEN_FILE_LOCATION) on the host running the
executor, which suffers from the same problem as the keytab in cluster mode.

The problem is then that a kerberos-authenticated user submitting their job
would be unaware that their credentials are being leaked to other users. Is
this an accurate description of the issue?

2) I understand that YARN writes delegation tokens via
{{amContainer.setTokens()}}, which ultimately results in the delegation token
being written to a file owned by the submitting user. However, since the
"submitting user" is a Kerberos user, not a Unix user, I'm assuming that
{{hadoop.security.auth_to_local}} is what maps the Kerberos user to the Unix
user who runs the ApplicationMaster and owns that file. Is that correct?

To avoid the shared-file problem for delegation tokens, our Mesos
implementation currently has the Executor issue an RPC call to fetch the
delegation token from the driver. There therefore isn't any need for at-rest
access control, and if in-motion interception is in the user's threat model,
then can be sure to run Spark with SSL.

We avoid the shared-file problem for keytabs entirely, because there's no need
to distribute the keytab, at least in client mode. Unlike YARN, the driver and
the equivalent of the "ApplicationMaster" in Mesos are one and the same. They
both exist in the same process, the {{spark-submit}} process.

We're probably going to punt on cluster mode for now, just for simplicity, but
we should be able to solve this in cluster mode as well, because unlike
standalone, and much like YARN, Mesos controls what user the driver runs as.

What do you think of the above approach? If you see any blockers, I would very
much appreciate teasing those out now rather than during the PR. Thanks!

was (Author: mgummelt):
Hi [~vanzin],

The problem is then that a kerberos-authenticated user submitting their job
would be unaware that their credentials are being leaked to other users. Is
this an accurate description of the issue?

To avoid the shared-file problem for delegation tokens, our Mesos
implementation currently has the Executor issue an RPC call to fetch the
delegation token from the driver. There therefore isn't any need for at-rest
encryption, and if in-motion encryption is in the user's threat model, then can
be sure to run Spark with SSL.

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-09 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962440#comment-15962440
 ] 

Michael Gummelt commented on SPARK-16742:
-

Hi [~vanzin],

[~ganger85] and Strat.io are pulling back their Mesos Kerberos implementation 
for now, and we at Mesosphere are about to submit a PR to upstream our 
implementation.  I have a few questions I'd like to run by you to make sure 
that PR goes smoothly.

1) I've been following your comments on this Spark Standalone Kerberos PR: 
https://github.com/apache/spark/pull/17530.  It looks like your concern is that 
in *cluster mode*, the keytab is written to a file on the host running the 
driver, and is owned by the user of the Spark Worker, which will be the same 
for each job.  So jobs submitted by multiple users will be able to read each 
other's keytabs.  In *client mode*, it looks like the delegation tokens are 
written to a file (HADOOP_TOKEN_FILE_LOCATION) on the host running the 
executor, which suffers from the same problem as the keytab in cluster mode.

The problem is then that a kerberos-authenticated user submitting their job 
would be unaware that their credentials are being leaked to other users.  Is 
this an accurate description of the issue?  

2) I understand that YARN writes delegation tokens via 
{{amContainer.setTokens()}}, which ultimately results in the delegation token 
being written to a file owned by the submitting user.  However, since the 
"submitting user" is a Kerberos user, not a Unix user, I'm assuming that 
{{hadoop.security.auth_to_local}} is what maps the Kerberos user to the Unix 
user who runs the ApplicationMaster and owns that file.  Is that correct?

To avoid the shared-file problem for delegation tokens, our Mesos 
implementation currently has the Executor issue an RPC call to fetch the 
delegation token from the driver.  There therefore isn't any need for at-rest 
encryption, and if in-motion encryption is in the user's threat model, then can 
be sure to run Spark with SSL.

We avoid the shared-file problem for keytabs entirely, because there's no need 
to distribute the keytab, at least in client mode.  Unlike YARN, the driver and 
the equivalent of the "ApplicationMaster" in Mesos are one and the same.  They 
both exist in the same process, the {{spark-submit}} process.

We're probably going to punt on cluster mode for now, just for simplicity, but 
we should be able to solve this in cluster mode as well, because unlike 
standalone, and much like YARN, Mesos controls what user the driver runs as.

What do you think of the above approach?  If you see any blockers, I would very 
much appreciate teasing those out now rather than during the PR.  Thanks!

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double

2017-04-09 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-20270.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17577
[https://github.com/apache/spark/pull/17577]

> na.fill will change the values in long or integer when the default value is 
> in double
> -
>
> Key: SPARK-20270
> URL: https://issues.apache.org/jira/browse/SPARK-20270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 2.2.0
>
>
> This bug was partially addressed in SPARK-18555, but the root cause isn't 
> completely solved. This bug is pretty critical since it changes the member id 
> in Long in our application if the member id can not be represented by Double 
> losslessly when the member id is very big. 
> Here is an example how this happens, with
> {code}
>   Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
> (9123146099426677101L, null),
> (9123146560113991650L, 1.6), (null, null)).toDF("a", 
> "b").na.fill(0.2),
> {code}
> the logical plan will be
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
> bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
> double) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}.
> Note that even the value is not null, Spark will cast the Long into Double 
> first. Then if it's not null, Spark will cast it back to Long which results 
> in losing precision. 
> The behavior should be that the original value should not be changed if it's 
> not null, but Spark will change the value which is wrong.
> With the PR, the logical plan will be 
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
> coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}
> which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.


[ 
https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962422#comment-15962422
 ] 

Hyukjin Kwon commented on SPARK-20193:
--

Maybe, yea. but I guess we can't change the method signature as it breaks 
binary compatibility.

> Selecting empty struct causes ExpressionEncoder error.
> --
>
> Key: SPARK-20193
> URL: https://issues.apache.org/jira/browse/SPARK-20193
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Priority: Minor
>
> {{def struct(cols: Column*): Column}}
> Given the above signature and the lack of any note in the docs saying that a 
> struct with no columns is not supported, I would expect the following to work:
> {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}}
> However, this results in:
> {quote}
> java.lang.AssertionError: assertion failed: each serializer expression should 
> contains at least one `BoundReference`
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1131)
>   ... 39 elided
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"

2017-04-09 Thread Jerry.X.He (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry.X.He closed SPARK-20266.
--

> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"

2017-04-09 Thread Jerry.X.He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962415#comment-15962415
 ] 

Jerry.X.He commented on SPARK-20266:


[~hyukjin.kwon] thank you, I don't think it is Spark's problem so, tomorrow, I 
reinstall other version of Spark, but the same question is still here, and 
there may be some problem in environment ...
I will report this question in appropriate channel, thank you ..

> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"


[ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962407#comment-15962407
 ] 

Hyukjin Kwon commented on SPARK-20266:
--

Please refer "Mailing Lists" in http://spark.apache.org/community.html. 
Subscribe into the mailing list and send a email to the address.

I resolved this JIRA as it is apparently a question and it does not look like a 
Spark's problem. It might be an issue but it looked to me that it does not 
indicate it is an issue within Spark assuming from the details in the current 
JIRA. If you are pretty sure that it is an issue in Spark. Please reopen with 
more details.

Otherwise, I guess asking first is better. 

> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC


 [ 
https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20259.
--
Resolution: Duplicate

Actually, the title refers pushing down the join. I am resolving this.

> Support push down join optimizations in DataFrameReader when loading from JDBC
> --
>
> Key: SPARK-20259
> URL: https://issues.apache.org/jira/browse/SPARK-20259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: John Muller
>Priority: Minor
>
> Given two dataframes loaded from the same JDBC connection:
> {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
> val ordersDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.orders")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> val productDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.product")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> ordersDF.createOrReplaceTempView("orders")
> productDF.createOrReplaceTempView("product")
> // Followed by a join between them:
> val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
> INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
> {code}
> Catalyst should optimize the query to be:
> SELECT northwind.product.name, SUM(northwind.orders.qty)
> FROM northwind.orders
> INNER JOIN northwind.product ON
>   northwind.orders.product_id = northwind.product.product_id
> GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC


[ 
https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962404#comment-15962404
 ] 

Hyukjin Kwon commented on SPARK-20259:
--

If so, I guess it is a duplicate of SPARK-12449. I'd close this if this gets 
not updated for a long time like few days a couple of weeks assuming it refers 
pushing down the join.

> Support push down join optimizations in DataFrameReader when loading from JDBC
> --
>
> Key: SPARK-20259
> URL: https://issues.apache.org/jira/browse/SPARK-20259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: John Muller
>Priority: Minor
>
> Given two dataframes loaded from the same JDBC connection:
> {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
> val ordersDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.orders")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> val productDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.product")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> ordersDF.createOrReplaceTempView("orders")
> productDF.createOrReplaceTempView("product")
> // Followed by a join between them:
> val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
> INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
> {code}
> Catalyst should optimize the query to be:
> SELECT northwind.product.name, SUM(northwind.orders.qty)
> FROM northwind.orders
> INNER JOIN northwind.product ON
>   northwind.orders.product_id = northwind.product.product_id
> GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"

2017-04-09 Thread Jerry.X.He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962401#comment-15962401
 ] 

Jerry.X.He commented on SPARK-20266:


[~hyukjin.kwon] I'm sorry, I don't know how to ask questions, and I see there 
could feedback question, so I submitted here, sorry, could you tell me where is 
"user mailing list", I'm green hand. thank you.
and these posts I've searched before, and the not fix this problem, there are 
some log in my cluster about tests.
or maybe I consider wrong, please help me check that.

 
1. ufw status and ssh connectivity
root@master:/usr/local/ProgramFiles# ufw status
Status: inactive
root@master:/usr/local/ProgramFiles# ssh slave01
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support:https://ubuntu.com/advantage
Last login: Sat Apr  8 21:33:44 2017 from 192.168.0.119
root@slave01:~# ufw status
Status: inactive
root@slave01:~# ssh slave02
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support:https://ubuntu.com/advantage
Last login: Sat Apr  8 21:10:33 2017 from 192.168.0.119
root@slave02:~# ufw status
Status: inactive
root@slave02:~# 


 
2. network connectivity by ip or FQDN
2.1. nc in master
root@master:/usr/local/ProgramFiles# netcat -l 12306
root@master:/usr/local/ProgramFiles# nc -l 12306
root@master:/usr/local/ProgramFiles# nc -l 12306
root@master:/usr/local/ProgramFiles# nc -l 12306


 
2.2. nc in slave01
root@slave01:~# nc -vz 192.168.0.180 12306
Connection to 192.168.0.180 12306 port [tcp/*] succeeded!
root@slave01:~# nc -vz master 12306
Connection to master 12306 port [tcp/*] succeeded!


 
2.3. nc in slave02
root@slave02:/usr/local/ProgramFiles# nc -vz 192.168.0.180 12306
Connection to 192.168.0.180 12306 port [tcp/*] succeeded!
root@slave02:/usr/local/ProgramFiles# nc -vz master 12306
Connection to master 12306 port [tcp/*] succeeded!
root@slave02:/usr/local/ProgramFiles# 


 


> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20264) asm should be non-test dependency in sql/core


 [ 
https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20264.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> asm should be non-test dependency in sql/core
> -
>
> Key: SPARK-20264
> URL: https://issues.apache.org/jira/browse/SPARK-20264
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.2, 2.2.0
>
>
> sq/core module currently declares asm as a test scope dependency. 
> Transitively it should actually be a normal dependency since the actual core 
> module defines it. This occasionally confuses IntelliJ.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC


[ 
https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962400#comment-15962400
 ] 

Xiao Li commented on SPARK-20259:
-

Pushing join into JDBC data sources?

> Support push down join optimizations in DataFrameReader when loading from JDBC
> --
>
> Key: SPARK-20259
> URL: https://issues.apache.org/jira/browse/SPARK-20259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: John Muller
>Priority: Minor
>
> Given two dataframes loaded from the same JDBC connection:
> {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
> val ordersDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.orders")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> val productDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.product")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> ordersDF.createOrReplaceTempView("orders")
> productDF.createOrReplaceTempView("product")
> // Followed by a join between them:
> val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
> INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
> {code}
> Catalyst should optimize the query to be:
> SELECT northwind.product.name, SUM(northwind.orders.qty)
> FROM northwind.orders
> INNER JOIN northwind.product ON
>   northwind.orders.product_id = northwind.product.product_id
> GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation

2017-04-09 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20271:
---
Description: 
Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally.  It can be used widely in many use cases like 
conditional conversion(if...else...), , type conversion, to/from Array, to/from 
Vector and many string ops..




  was:
Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally. It can be used widely in many use cases and you can 
find some examples in the PR.





> Add FuncTransformer to simplify custom transformer creation
> ---
>
> Key: SPARK-20271
> URL: https://issues.apache.org/jira/browse/SPARK-20271
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Just to share some code I implemented to help easily create a custom 
> Transformer in one line of code w.
> {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
> else 0) {code}
> This was used in many of my projects and is pretty helpful (Maybe I'm 
> lazy..). The transformer can be saved/loaded as other transformer and can be 
> integrated into a pipeline normally.  It can be used widely in many use cases 
> like conditional conversion(if...else...), , type conversion, to/from Array, 
> to/from Vector and many string ops..



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation


 [ 
https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20271:


Assignee: Apache Spark

> Add FuncTransformer to simplify custom transformer creation
> ---
>
> Key: SPARK-20271
> URL: https://issues.apache.org/jira/browse/SPARK-20271
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>
> Just to share some code I implemented to help easily create a custom 
> Transformer in one line of code w.
> {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
> else 0) {code}
> This was used in many of my projects and is pretty helpful (Maybe I'm 
> lazy..). The transformer can be saved/loaded as other transformer and can be 
> integrated into a pipeline normally. It can be used widely in many use cases 
> and you can find some examples in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation


 [ 
https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20271:


Assignee: (was: Apache Spark)

> Add FuncTransformer to simplify custom transformer creation
> ---
>
> Key: SPARK-20271
> URL: https://issues.apache.org/jira/browse/SPARK-20271
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Just to share some code I implemented to help easily create a custom 
> Transformer in one line of code w.
> {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
> else 0) {code}
> This was used in many of my projects and is pretty helpful (Maybe I'm 
> lazy..). The transformer can be saved/loaded as other transformer and can be 
> integrated into a pipeline normally. It can be used widely in many use cases 
> and you can find some examples in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation


[ 
https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962398#comment-15962398
 ] 

Apache Spark commented on SPARK-20271:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17583

> Add FuncTransformer to simplify custom transformer creation
> ---
>
> Key: SPARK-20271
> URL: https://issues.apache.org/jira/browse/SPARK-20271
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Just to share some code I implemented to help easily create a custom 
> Transformer in one line of code w.
> {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
> else 0) {code}
> This was used in many of my projects and is pretty helpful (Maybe I'm 
> lazy..). The transformer can be saved/loaded as other transformer and can be 
> integrated into a pipeline normally. It can be used widely in many use cases 
> and you can find some examples in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20239) Improve HistoryServer ACL mechanism


 [ 
https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20239:


Assignee: Apache Spark

> Improve HistoryServer ACL mechanism
> ---
>
> Key: SPARK-20239
> URL: https://issues.apache.org/jira/browse/SPARK-20239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> Current SHS (Spark History Server) two different ACLs. 
> * ACL of base URL, it is controlled by "spark.acls.enabled" or 
> "spark.ui.acls.enabled", and with this enabled, only user configured with 
> "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
> who started SHS could list all the applications, otherwise none of them can 
> be listed. This will also affect REST APIs which listing the summary of all 
> apps and one app.
> * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". 
> With this enabled only history admin user and user/group who ran this app can 
> access the details of this app. 
> With this two ACLs, we may encounter several unexpected behaviors:
> 1. if base URL's ACL is enabled but user A has no view permission. User "A" 
> cannot see the app list but could still access details of it's own app.
> 2. if ACLs of base URL is disabled. Then user "A" could see the summary of 
> all the apps, even some didn't run by user "A", but cannot access the details.
> 3. history admin ACL has no permission to list all apps if this admin user is 
> not added to base URL's ACL.
> The unexpected behaviors is mainly because we have two different ACLs, 
> ideally we should have only one to manage all.
> So to improve SHS's ACL mechanism, we should:
> * Unify two different ACLs into one, and always honor this one (both in base 
> URL and app details).
> * User could partially list and display apps which ran by him according to 
> the ACLs in event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20239) Improve HistoryServer ACL mechanism


[ 
https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962394#comment-15962394
 ] 

Apache Spark commented on SPARK-20239:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17582

> Improve HistoryServer ACL mechanism
> ---
>
> Key: SPARK-20239
> URL: https://issues.apache.org/jira/browse/SPARK-20239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>
> Current SHS (Spark History Server) two different ACLs. 
> * ACL of base URL, it is controlled by "spark.acls.enabled" or 
> "spark.ui.acls.enabled", and with this enabled, only user configured with 
> "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
> who started SHS could list all the applications, otherwise none of them can 
> be listed. This will also affect REST APIs which listing the summary of all 
> apps and one app.
> * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". 
> With this enabled only history admin user and user/group who ran this app can 
> access the details of this app. 
> With this two ACLs, we may encounter several unexpected behaviors:
> 1. if base URL's ACL is enabled but user A has no view permission. User "A" 
> cannot see the app list but could still access details of it's own app.
> 2. if ACLs of base URL is disabled. Then user "A" could see the summary of 
> all the apps, even some didn't run by user "A", but cannot access the details.
> 3. history admin ACL has no permission to list all apps if this admin user is 
> not added to base URL's ACL.
> The unexpected behaviors is mainly because we have two different ACLs, 
> ideally we should have only one to manage all.
> So to improve SHS's ACL mechanism, we should:
> * Unify two different ACLs into one, and always honor this one (both in base 
> URL and app details).
> * User could partially list and display apps which ran by him according to 
> the ACLs in event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20239) Improve HistoryServer ACL mechanism


 [ 
https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20239:


Assignee: (was: Apache Spark)

> Improve HistoryServer ACL mechanism
> ---
>
> Key: SPARK-20239
> URL: https://issues.apache.org/jira/browse/SPARK-20239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>
> Current SHS (Spark History Server) two different ACLs. 
> * ACL of base URL, it is controlled by "spark.acls.enabled" or 
> "spark.ui.acls.enabled", and with this enabled, only user configured with 
> "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
> who started SHS could list all the applications, otherwise none of them can 
> be listed. This will also affect REST APIs which listing the summary of all 
> apps and one app.
> * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". 
> With this enabled only history admin user and user/group who ran this app can 
> access the details of this app. 
> With this two ACLs, we may encounter several unexpected behaviors:
> 1. if base URL's ACL is enabled but user A has no view permission. User "A" 
> cannot see the app list but could still access details of it's own app.
> 2. if ACLs of base URL is disabled. Then user "A" could see the summary of 
> all the apps, even some didn't run by user "A", but cannot access the details.
> 3. history admin ACL has no permission to list all apps if this admin user is 
> not added to base URL's ACL.
> The unexpected behaviors is mainly because we have two different ACLs, 
> ideally we should have only one to manage all.
> So to improve SHS's ACL mechanism, we should:
> * Unify two different ACLs into one, and always honor this one (both in base 
> URL and app details).
> * User could partially list and display apps which ran by him according to 
> the ACLs in event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation

2017-04-09 Thread yuhao yang (JIRA)

yuhao yang created SPARK-20271:
--

 Summary: Add FuncTransformer to simplify custom transformer 
creation
 Key: SPARK-20271
 URL: https://issues.apache.org/jira/browse/SPARK-20271
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally. It can be used widely in many use cases and you can 
find some examples in the PR.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20253.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17569
[https://github.com/apache/spark/pull/17569]

> Remove unnecessary nullchecks of a return value from Spark runtime routines 
> in generated Java code
> --
>
> Key: SPARK-20253
> URL: https://issues.apache.org/jira/browse/SPARK-20253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> While we know several Spark runtime routines never return null (e.g. 
> {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
> checks whether the return value is null or not.
> It is good to remove this nullcheck for reducing Java bytecode size and 
> reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20253:
---

Assignee: Kazuaki Ishizaki

> Remove unnecessary nullchecks of a return value from Spark runtime routines 
> in generated Java code
> --
>
> Key: SPARK-20253
> URL: https://issues.apache.org/jira/browse/SPARK-20253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> While we know several Spark runtime routines never return null (e.g. 
> {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
> checks whether the return value is null or not.
> It is good to remove this nullcheck for reducing Java bytecode size and 
> reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20248) Spark SQL add limit parameter to enhance the reliability.


[ 
https://issues.apache.org/jira/browse/SPARK-20248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962359#comment-15962359
 ] 

Apache Spark commented on SPARK-20248:
--

User 'shaolinliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17581

> Spark SQL add limit parameter to enhance the reliability.
> -
>
> Key: SPARK-20248
> URL: https://issues.apache.org/jira/browse/SPARK-20248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: 2.1.0
>Reporter: shaolinliu
>Priority: Minor
>
>   When we using thrift server, it is difficult to constrain the user's sql 
> statement;
>   When the user query a large table without limit, this will lead to thrift 
> server process memory occupancy lead to service instability;
>   In general, the user is not used correctly, because if you really need to 
> return the whole table:
>   1, if you use this data to compute , you can complete the computation in 
> the cluster and then return
>   2, if you want obtain the data, you can store it in hdfs
>   For the above scene, it is recommended to add a 
> "spark.sql.thriftserver.retainedResults" parameter,
>   1, when it is 0, we don not restrict user's operation
>   2, when it is greater than 0, if user query with limit, we use user's 
> limit;if not we use this to limit query's result
>   Priority user's limit is because, if the user consider the limit, in 
> general, the user is aware of the exact meaning of this query



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20251) Spark streaming skips batches in a case of failure


[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962318#comment-15962318
 ] 

Nan Zhu edited comment on SPARK-20251 at 4/10/17 12:16 AM:
---

more details here, it is expected that the compute() method for the next batch 
was executed before the app is shutdown, however, the app should be eventually 
shutdown since we have signalled the awaiting condition set in 
awaitTermination()

however, this "eventual shutdown" was not happened...(this issue did not 
consistently happen)


was (Author: codingcat):
more details here, by "be proceeding", I mean it is expected that the compute() 
method for the next batch was executed before the app is shutdown, however, the 
app should be eventually shutdown since we have signalled the awaiting 
condition set in awaitTermination()

however, this "eventual shutdown" was not happened...(this issue did not 
consistently happen)

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure


[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962318#comment-15962318
 ] 

Nan Zhu commented on SPARK-20251:
-

more details here, by "be proceeding", I mean it is expected that the compute() 
method for the next batch was executed before the app is shutdown, however, the 
app should be eventually shutdown since we have signalled the awaiting 
condition set in awaitTermination()

however, this "eventual shutdown" was not happened...(this issue did not 
consistently happen)

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20251) Spark streaming skips batches in a case of failure


[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962313#comment-15962313
 ] 

Nan Zhu edited comment on SPARK-20251 at 4/9/17 11:57 PM:
--

why this is an invalid report? I have been observing the same behavior recently 
when I upgrade to Spark 2.1 

The basic idea (in my side) is that an exception thrown from DStream.compute() 
method should close the app instead of be proceeding (as the error handling in 
Spark Streaming is to release the await lock set in awaitTermination)

I am still looking at those threads within Spark Streaming to see what was 
happening, 

can we change it back to a valid case and give me more time to investigate?


was (Author: codingcat):
why this is an invalid report? I have been observing the same behavior recently 
when I upgrade to Spark 2.1 

The basic idea (in my side), an exception thrown from DStream.compute() method 
should close the app instead of proceeding (as the error handling in Spark 
Streaming is to release the await lock set in awaitTermination)

I am still looking at those threads within Spark Streaming to see what was 
happening, 

can we change it back to a valid case and give me more time to investigate?

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure


[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962313#comment-15962313
 ] 

Nan Zhu commented on SPARK-20251:
-

why this is an invalid report? I have been observing the same behavior recently 
when I upgrade to Spark 2.1 

The basic idea (in my side), an exception thrown from DStream.compute() method 
should close the app instead of proceeding (as the error handling in Spark 
Streaming is to release the await lock set in awaitTermination)

I am still looking at those threads within Spark Streaming to see what was 
happening, 

can we change it back to a valid case and give me more time to investigate?

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message


 [ 
https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20260:
-

Assignee: Vijay Krishna Ramesh
Priority: Minor  (was: Trivial)

> MLUtils parseLibSVMRecord has incorrect string interpolation for error message
> --
>
> Key: SPARK-20260
> URL: https://issues.apache.org/jira/browse/SPARK-20260
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Vijay Krishna Ramesh
>Assignee: Vijay Krishna Ramesh
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> There is missing string interpolation for the error message, which causes it 
> to not actually display the line that failed. See 
> https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message


 [ 
https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20260.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17572
[https://github.com/apache/spark/pull/17572]

> MLUtils parseLibSVMRecord has incorrect string interpolation for error message
> --
>
> Key: SPARK-20260
> URL: https://issues.apache.org/jira/browse/SPARK-20260
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Vijay Krishna Ramesh
>Priority: Trivial
> Fix For: 2.2.0, 2.1.2
>
>
> There is missing string interpolation for the error message, which causes it 
> to not actually display the line that failed. See 
> https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"


[ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962160#comment-15962160
 ] 

Hyukjin Kwon edited comment on SPARK-20266 at 4/9/17 3:21 PM:
--

I am resolving this as It sounds like a question and questions should be asked 
to user mailing list first.

 Maybe 
http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically
 is helpful.



was (Author: hyukjin.kwon):
I am resolving this as It sounds like a question and 
questions should be asked to user mailing list first.

 Maybe 
http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically
 is helpful.


> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20266) ExecutorBackend blocked at "UserGroupInformation.doAs"


 [ 
https://issues.apache.org/jira/browse/SPARK-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20266.
--
Resolution: Invalid

I am resolving this as It sounds like a question and 
questions should be asked to user mailing list first.

 Maybe 
http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically
 is helpful.


> ExecutorBackend blocked at "UserGroupInformation.doAs"
> --
>
> Key: SPARK-20266
> URL: https://issues.apache.org/jira/browse/SPARK-20266
> Project: Spark
>  Issue Type: Question
>  Components: Project Infra
>Affects Versions: 1.6.2
>Reporter: Jerry.X.He
>Priority: Minor
> Attachments: logsSubmitByIdeaAtClient.zip, 
> logsSubmitBySparkSubmitAtSlave02.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2017-04-09 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962141#comment-15962141
 ] 

Maciej Szymkiewicz commented on SPARK-10931:


[~vlad.feinberg] It is worth noting that without {{parent}} some features (like 
{{CrossValidator}} or  {{TrainValidationSplit}}) are crippled. 

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20269) add JavaWordCountProducer in steaming examples


 [ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20269:
---
Description: 
1.run example of streaming kafka,currently missing java word count producer,not 
conducive to java developers to learn and test.
add a class JavaKafkaWordCountProducer.

2.run example of JavaKafkaWordCount.I find no java word count producer.
run example of KafkaWordCount.I find have scala word count producer.
I think we should provide the corresponding example code to facilitate java 
developers to learn and test.

3.My project team develops spark applications,basically with java statements 
and java API.



  was:
run example of streaming kafka,currently missing java word count producer,not 
conducive to java developers to learn and test.




> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.run example of streaming kafka,currently missing java word count 
> producer,not conducive to java developers to learn and test.
> add a class JavaKafkaWordCountProducer.
> 2.run example of JavaKafkaWordCount.I find no java word count producer.
> run example of KafkaWordCount.I find have scala word count producer.
> I think we should provide the corresponding example code to facilitate java 
> developers to learn and test.
> 3.My project team develops spark applications,basically with java statements 
> and java API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20269) add java class 'JavaWordCountProducer' to 'provide java word count producer'.


 [ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20269:
---
Summary: add java class 'JavaWordCountProducer' to 'provide java word count 
producer'.  (was: add JavaWordCountProducer in steaming examples)

> add java class 'JavaWordCountProducer' to 'provide java word count producer'.
> -
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.run example of streaming kafka,currently missing java word count 
> producer,not conducive to java developers to learn and test.
> add a class JavaKafkaWordCountProducer.
> 2.run example of JavaKafkaWordCount.I find no java word count producer.
> run example of KafkaWordCount.I find have scala word count producer.
> I think we should provide the corresponding example code to facilitate java 
> developers to learn and test.
> 3.My project team develops spark applications,basically with java statements 
> and java API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first


 [ 
https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20268.
---
Resolution: Not A Problem

> Arbitrary RDD element (Fast return) instead of using first
> --
>
> Key: SPARK-20268
> URL: https://issues.apache.org/jira/browse/SPARK-20268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Hayri Volkan Agun
>Priority: Minor
>
> Most of the ML and MLLIB algorithms somehow need the column size of the rdd 
> vector (RDD[Vector]). So instead of getting the first element by rdd.first(), 
> a fast return can be made to calculate the length of the vector of a 
> arbitrary rdd element. It can also be be named any(). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first


 [ 
https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-20268:
---

> Arbitrary RDD element (Fast return) instead of using first
> --
>
> Key: SPARK-20268
> URL: https://issues.apache.org/jira/browse/SPARK-20268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Hayri Volkan Agun
>Priority: Minor
>
> Most of the ML and MLLIB algorithms somehow need the column size of the rdd 
> vector (RDD[Vector]). So instead of getting the first element by rdd.first(), 
> a fast return can be made to calculate the length of the vector of a 
> arbitrary rdd element. It can also be be named any(). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples


[ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962101#comment-15962101
 ] 

guoxiaolongzte commented on SPARK-20269:


https://github.com/apache/spark/pull/17578 is invalid.I have closed this PR.
please see https://github.com/apache/spark/pull/17580.Thank you.

> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> run example of streaming kafka,currently missing java word count producer,not 
> conducive to java developers to learn and test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first

2017-04-09 Thread Hayri Volkan Agun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hayri Volkan Agun closed SPARK-20268.
-
Resolution: Fixed

> Arbitrary RDD element (Fast return) instead of using first
> --
>
> Key: SPARK-20268
> URL: https://issues.apache.org/jira/browse/SPARK-20268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Hayri Volkan Agun
>Priority: Minor
>
> Most of the ML and MLLIB algorithms somehow need the column size of the rdd 
> vector (RDD[Vector]). So instead of getting the first element by rdd.first(), 
> a fast return can be made to calculate the length of the vector of a 
> arbitrary rdd element. It can also be be named any(). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20268) Arbitrary RDD element (Fast return) instead of using first

2017-04-09 Thread Hayri Volkan Agun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962098#comment-15962098
 ] 

Hayri Volkan Agun commented on SPARK-20268:
---

Hi Owen,

If the first element is the fastest let's close it. 

> Arbitrary RDD element (Fast return) instead of using first
> --
>
> Key: SPARK-20268
> URL: https://issues.apache.org/jira/browse/SPARK-20268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Hayri Volkan Agun
>Priority: Minor
>
> Most of the ML and MLLIB algorithms somehow need the column size of the rdd 
> vector (RDD[Vector]). So instead of getting the first element by rdd.first(), 
> a fast return can be made to calculate the length of the vector of a 
> arbitrary rdd element. It can also be be named any(). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples


[ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962088#comment-15962088
 ] 

Apache Spark commented on SPARK-20269:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17578

> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> run example of streaming kafka,currently missing java word count producer,not 
> conducive to java developers to learn and test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20269) add JavaWordCountProducer in steaming examples


[ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962084#comment-15962084
 ] 

Apache Spark commented on SPARK-20269:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17578

> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> run example of streaming kafka,currently missing java word count producer,not 
> conducive to java developers to learn and test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20269) add JavaWordCountProducer in steaming examples


 [ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20269:


Assignee: Apache Spark

> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> run example of streaming kafka,currently missing java word count producer,not 
> conducive to java developers to learn and test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20269) add JavaWordCountProducer in steaming examples


 [ 
https://issues.apache.org/jira/browse/SPARK-20269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20269:


Assignee: (was: Apache Spark)

> add JavaWordCountProducer in steaming examples
> --
>
> Key: SPARK-20269
> URL: https://issues.apache.org/jira/browse/SPARK-20269
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> run example of streaming kafka,currently missing java word count producer,not 
> conducive to java developers to learn and test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double


 [ 
https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20270:


Assignee: Apache Spark  (was: DB Tsai)

> na.fill will change the values in long or integer when the default value is 
> in double
> -
>
> Key: SPARK-20270
> URL: https://issues.apache.org/jira/browse/SPARK-20270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Critical
>
> This bug was partially addressed in SPARK-18555, but the root cause isn't 
> completely solved. This bug is pretty critical since it changes the member id 
> in Long in our application if the member id can not be represented by Double 
> losslessly when the member id is very big. 
> Here is an example how this happens, with
> {code}
>   Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
> (9123146099426677101L, null),
> (9123146560113991650L, 1.6), (null, null)).toDF("a", 
> "b").na.fill(0.2),
> {code}
> the logical plan will be
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
> bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
> double) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}.
> Note that even the value is not null, Spark will cast the Long into Double 
> first. Then if it's not null, Spark will cast it back to Long which results 
> in losing precision. 
> The behavior should be that the original value should not be changed if it's 
> not null, but Spark will change the value which is wrong.
> With the PR, the logical plan will be 
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
> coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}
> which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double


 [ 
https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20270:


Assignee: DB Tsai  (was: Apache Spark)

> na.fill will change the values in long or integer when the default value is 
> in double
> -
>
> Key: SPARK-20270
> URL: https://issues.apache.org/jira/browse/SPARK-20270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
>
> This bug was partially addressed in SPARK-18555, but the root cause isn't 
> completely solved. This bug is pretty critical since it changes the member id 
> in Long in our application if the member id can not be represented by Double 
> losslessly when the member id is very big. 
> Here is an example how this happens, with
> {code}
>   Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
> (9123146099426677101L, null),
> (9123146560113991650L, 1.6), (null, null)).toDF("a", 
> "b").na.fill(0.2),
> {code}
> the logical plan will be
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
> bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
> double) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}.
> Note that even the value is not null, Spark will cast the Long into Double 
> first. Then if it's not null, Spark will cast it back to Long which results 
> in losing precision. 
> The behavior should be that the original value should not be changed if it's 
> not null, but Spark will change the value which is wrong.
> With the PR, the logical plan will be 
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
> coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}
> which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double


[ 
https://issues.apache.org/jira/browse/SPARK-20270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962065#comment-15962065
 ] 

Apache Spark commented on SPARK-20270:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/17577

> na.fill will change the values in long or integer when the default value is 
> in double
> -
>
> Key: SPARK-20270
> URL: https://issues.apache.org/jira/browse/SPARK-20270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
>
> This bug was partially addressed in SPARK-18555, but the root cause isn't 
> completely solved. This bug is pretty critical since it changes the member id 
> in Long in our application if the member id can not be represented by Double 
> losslessly when the member id is very big. 
> Here is an example how this happens, with
> {code}
>   Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
> (9123146099426677101L, null),
> (9123146560113991650L, 1.6), (null, null)).toDF("a", 
> "b").na.fill(0.2),
> {code}
> the logical plan will be
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
> bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
> double) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}.
> Note that even the value is not null, Spark will cast the Long into Double 
> first. Then if it's not null, Spark will cast it back to Long which results 
> in losing precision. 
> The behavior should be that the original value should not be changed if it's 
> not null, but Spark will change the value which is wrong.
> With the PR, the logical plan will be 
> {code}
> == Analyzed Logical Plan ==
> a: bigint, b: double
> Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
> coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
> +- Project [_1#229L AS a#232L, _2#230 AS b#233]
>+- LocalRelation [_1#229L, _2#230]
> {code}
> which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20270) na.fill will change the values in long or integer when the default value is in double

2017-04-09 Thread DB Tsai (JIRA)

DB Tsai created SPARK-20270:
---

 Summary: na.fill will change the values in long or integer when 
the default value is in double
 Key: SPARK-20270
 URL: https://issues.apache.org/jira/browse/SPARK-20270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: DB Tsai
Assignee: DB Tsai
Priority: Critical


This bug was partially addressed in SPARK-18555, but the root cause isn't 
completely solved. This bug is pretty critical since it changes the member id 
in Long in our application if the member id can not be represented by Double 
losslessly when the member id is very big. 

Here is an example how this happens, with
{code}
  Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
(9123146099426677101L, null),
(9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
{code}
the logical plan will be
{code}
== Analyzed Logical Plan ==
a: bigint, b: double
Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) 
AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS 
b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
   +- LocalRelation [_1#229L, _2#230]
{code}.
Note that even the value is not null, Spark will cast the Long into Double 
first. Then if it's not null, Spark will cast it back to Long which results in 
losing precision. 

The behavior should be that the original value should not be changed if it's 
not null, but Spark will change the value which is wrong.

With the PR, the logical plan will be 
{code}
== Analyzed Logical Plan ==
a: bigint, b: double
Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, 
cast(null as double)), cast(0.2 as double)) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
   +- LocalRelation [_1#229L, _2#230]
{code}
which behaves correctly without changing the original Long values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19991) FileSegmentManagedBuffer performance improvement.


 [ 
https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19991:
-

Assignee: Sean Owen

> FileSegmentManagedBuffer performance improvement.
> -
>
> Key: SPARK-19991
> URL: https://issues.apache.org/jira/browse/SPARK-19991
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Guoqiang Li
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.0
>
>
> When we do not set the value of the configuration items 
> {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, 
> each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
> FileSegmentManagedBuffer.createInputStream method creates a 
> NoSuchElementException instance. This is a more time-consuming operation.
> The shuffle-server thread`s stack:
> {noformat}
> "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 
> nid=0x28d12 runnable [0x7f71af93e000]
>java.lang.Thread.State: RUNNABLE
> at java.lang.Throwable.fillInStackTrace(Native Method)
> at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> - locked <0x0007a930f080> (a java.util.NoSuchElementException)
> at java.lang.Throwable.(Throwable.java:265)
> at java.lang.Exception.(Exception.java:66)
> at java.lang.RuntimeException.(RuntimeException.java:62)
> at 
> java.util.NoSuchElementException.(NoSuchElementException.java:57)
> at 
> org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38)
> at 
> org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31)
> at 
> org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50)
> at 
> org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157)
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)
> at 
> org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
> at 
> org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
> at 
> org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
> at 
> org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
> at 
>

[jira] [Resolved] (SPARK-19991) FileSegmentManagedBuffer performance improvement.


 [ 
https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19991.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17567
[https://github.com/apache/spark/pull/17567]

> FileSegmentManagedBuffer performance improvement.
> -
>
> Key: SPARK-19991
> URL: https://issues.apache.org/jira/browse/SPARK-19991
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Guoqiang Li
>Priority: Minor
> Fix For: 2.2.0
>
>
> When we do not set the value of the configuration items 
> {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, 
> each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
> FileSegmentManagedBuffer.createInputStream method creates a 
> NoSuchElementException instance. This is a more time-consuming operation.
> The shuffle-server thread`s stack:
> {noformat}
> "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 
> nid=0x28d12 runnable [0x7f71af93e000]
>java.lang.Thread.State: RUNNABLE
> at java.lang.Throwable.fillInStackTrace(Native Method)
> at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> - locked <0x0007a930f080> (a java.util.NoSuchElementException)
> at java.lang.Throwable.(Throwable.java:265)
> at java.lang.Exception.(Exception.java:66)
> at java.lang.RuntimeException.(RuntimeException.java:62)
> at 
> java.util.NoSuchElementException.(NoSuchElementException.java:57)
> at 
> org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38)
> at 
> org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31)
> at 
> org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50)
> at 
> org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157)
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)
> at 
> org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
> at 
> org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
> at 
> org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
> at 
> org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)

[jira] [Created] (SPARK-20269) add JavaWordCountProducer in steaming examples