[jira] [Commented] (SPARK-16683) Group by does not work after multiple joins of the same dataframe

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095790#comment-16095790
 ] 

Apache Spark commented on SPARK-16683:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/18697

> Group by does not work after multiple joins of the same dataframe
> -
>
> Key: SPARK-16683
> URL: https://issues.apache.org/jira/browse/SPARK-16683
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
> Environment: local and yarn
>Reporter: Witold Jędrzejewski
> Attachments: code_2.0.txt, Duplicates Problem Presentation.json
>
>
> When I join a dataframe, group by a field from it, then join it again by 
> different field and group by field from it, second aggregation does not 
> trigger.
> Minimal example showing the problem is attached as the text to paste into 
> spark-shell (code_2.0.txt).
> The detailed description and minimal example, workaround and possible cause 
> are in the attachment, in a form of Zeppelin notebook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16683) Group by does not work after multiple joins of the same dataframe

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16683:


Assignee: (was: Apache Spark)

> Group by does not work after multiple joins of the same dataframe
> -
>
> Key: SPARK-16683
> URL: https://issues.apache.org/jira/browse/SPARK-16683
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
> Environment: local and yarn
>Reporter: Witold Jędrzejewski
> Attachments: code_2.0.txt, Duplicates Problem Presentation.json
>
>
> When I join a dataframe, group by a field from it, then join it again by 
> different field and group by field from it, second aggregation does not 
> trigger.
> Minimal example showing the problem is attached as the text to paste into 
> spark-shell (code_2.0.txt).
> The detailed description and minimal example, workaround and possible cause 
> are in the attachment, in a form of Zeppelin notebook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16683) Group by does not work after multiple joins of the same dataframe

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16683:


Assignee: Apache Spark

> Group by does not work after multiple joins of the same dataframe
> -
>
> Key: SPARK-16683
> URL: https://issues.apache.org/jira/browse/SPARK-16683
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
> Environment: local and yarn
>Reporter: Witold Jędrzejewski
>Assignee: Apache Spark
> Attachments: code_2.0.txt, Duplicates Problem Presentation.json
>
>
> When I join a dataframe, group by a field from it, then join it again by 
> different field and group by field from it, second aggregation does not 
> trigger.
> Minimal example showing the problem is attached as the text to paste into 
> spark-shell (code_2.0.txt).
> The detailed description and minimal example, workaround and possible cause 
> are in the attachment, in a form of Zeppelin notebook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21497) Pull non-deterministic joining keys from Join operator

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21497:


Assignee: Apache Spark

> Pull non-deterministic joining keys from Join operator
> --
>
> Key: SPARK-21497
> URL: https://issues.apache.org/jira/browse/SPARK-21497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently SparkSQL doesn't support non-deterministic joining conditions in 
> Join. This kind of joining conditions can be useful in some cases, e.g., 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.
> To pull non-deterministic joining conditions from Join operator, there seems 
> no standard behavior. Based on the discussion on 
> https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
> https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
> https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive 
> doesn't have special consideration for non-deterministic join conditions and 
> simply pushes down it or uses it as joining keys.
> In this attempt, we initially allow non-deterministic equi join keys in Join 
> operators. Because based on SparkSQL's join implementations the equi join 
> keys are evaluated once on joining tables, pulling equi join keys from Join 
> operators won't change the number of calls on non-deterministic expressions. 
> It is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
> rand(20) < b where pulling it and pushing down it will possibly change the 
> number of calls of rand().
> We also add a SQL conf to control this new behavior. It is disabled by 
> default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21497) Pull non-deterministic joining keys from Join operator

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21497:


Assignee: (was: Apache Spark)

> Pull non-deterministic joining keys from Join operator
> --
>
> Key: SPARK-21497
> URL: https://issues.apache.org/jira/browse/SPARK-21497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> Currently SparkSQL doesn't support non-deterministic joining conditions in 
> Join. This kind of joining conditions can be useful in some cases, e.g., 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.
> To pull non-deterministic joining conditions from Join operator, there seems 
> no standard behavior. Based on the discussion on 
> https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
> https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
> https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive 
> doesn't have special consideration for non-deterministic join conditions and 
> simply pushes down it or uses it as joining keys.
> In this attempt, we initially allow non-deterministic equi join keys in Join 
> operators. Because based on SparkSQL's join implementations the equi join 
> keys are evaluated once on joining tables, pulling equi join keys from Join 
> operators won't change the number of calls on non-deterministic expressions. 
> It is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
> rand(20) < b where pulling it and pushing down it will possibly change the 
> number of calls of rand().
> We also add a SQL conf to control this new behavior. It is disabled by 
> default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21497) Pull non-deterministic joining keys from Join operator

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095756#comment-16095756
 ] 

Apache Spark commented on SPARK-21497:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18652

> Pull non-deterministic joining keys from Join operator
> --
>
> Key: SPARK-21497
> URL: https://issues.apache.org/jira/browse/SPARK-21497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> Currently SparkSQL doesn't support non-deterministic joining conditions in 
> Join. This kind of joining conditions can be useful in some cases, e.g., 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.
> To pull non-deterministic joining conditions from Join operator, there seems 
> no standard behavior. Based on the discussion on 
> https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
> https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
> https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive 
> doesn't have special consideration for non-deterministic join conditions and 
> simply pushes down it or uses it as joining keys.
> In this attempt, we initially allow non-deterministic equi join keys in Join 
> operators. Because based on SparkSQL's join implementations the equi join 
> keys are evaluated once on joining tables, pulling equi join keys from Join 
> operators won't change the number of calls on non-deterministic expressions. 
> It is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
> rand(20) < b where pulling it and pushing down it will possibly change the 
> number of calls of rand().
> We also add a SQL conf to control this new behavior. It is disabled by 
> default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21497) Pull non-deterministic joining keys from Join operator

2017-07-20 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21497:

Description: 
Currently SparkSQL doesn't support non-deterministic joining conditions in 
Join. This kind of joining conditions can be useful in some cases, e.g., 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.

To pull non-deterministic joining conditions from Join operator, there seems no 
standard behavior. Based on the discussion on 
https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive doesn't 
have special consideration for non-deterministic join conditions and simply 
pushes down it or uses it as joining keys.

In this attempt, we initially allow non-deterministic equi join keys in Join 
operators. Because based on SparkSQL's join implementations the equi join keys 
are evaluated once on joining tables, pulling equi join keys from Join 
operators won't change the number of calls on non-deterministic expressions. It 
is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
rand(20) < b where pulling it and pushing down it will possibly change the 
number of calls of rand().

We also add a SQL conf to control this new behavior. It is disabled by default.



> Pull non-deterministic joining keys from Join operator
> --
>
> Key: SPARK-21497
> URL: https://issues.apache.org/jira/browse/SPARK-21497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> Currently SparkSQL doesn't support non-deterministic joining conditions in 
> Join. This kind of joining conditions can be useful in some cases, e.g., 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973.
> To pull non-deterministic joining conditions from Join operator, there seems 
> no standard behavior. Based on the discussion on 
> https://github.com/apache/spark/pull/18652#issuecomment-316344905, 
> https://github.com/apache/spark/pull/18652#issuecomment-316391759 and 
> https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive 
> doesn't have special consideration for non-deterministic join conditions and 
> simply pushes down it or uses it as joining keys.
> In this attempt, we initially allow non-deterministic equi join keys in Join 
> operators. Because based on SparkSQL's join implementations the equi join 
> keys are evaluated once on joining tables, pulling equi join keys from Join 
> operators won't change the number of calls on non-deterministic expressions. 
> It is more safer than other kinds of joining conditions, e.g. rand(10) > a && 
> rand(20) < b where pulling it and pushing down it will possibly change the 
> number of calls of rand().
> We also add a SQL conf to control this new behavior. It is disabled by 
> default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21497) Pull non-deterministic joining keys from Join operator

2017-07-20 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-21497:
---

 Summary: Pull non-deterministic joining keys from Join operator
 Key: SPARK-21497
 URL: https://issues.apache.org/jira/browse/SPARK-21497
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095744#comment-16095744
 ] 

Shixiong Zhu commented on SPARK-21425:
--

[~rdub] I just realized we never document local-cluster mode. You can use 
"--master local-cluster[N, cores, memory]" to start a cluster which runs 
driver, worker and master in one process, and executors in their own processes 
(See 
https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala)

Spark assumes accumulators are always captured by closures. When starting a 
task in an executor, Spark will deserialize the closure. In this step, it 
create a copy for each accumulator in the executor. When a task finishes, the 
value in the copied accumulator will be sent back to driver. Driver will merge 
all values collected from tasks and set the final value to the accumulator in 
the driver side. Because static accumulators cannot be captured by closures, 
the mechanism doesn't work for them. See AccumulatorV2.writeReplace and 
AccumulatorV2.readObject if you want to dig into details.

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server

2017-07-20 Thread Xin Yu Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Yu Pan updated SPARK-21495:
---
Description: 
We hit an issue when enabling authentication and Sasl encryption, see bold font 
in following parameter list.
spark.local.dir /tmp/xpan-spark-161
spark.eventLog.dir file:///home/xpan/spark-conf/event
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/home/xpan/spark-conf/event
spark.history.ui.port 18085
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7448
spark.shuffle.reduceLocality.enabled false
spark.master.port 7087
spark.master.rest.port 6077
spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
*spark.authenticate true
spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2
spark.authenticate.enableSaslEncryption false
spark.network.sasl.serverAlwaysEncrypt false*


We run the simple SparkPi example and there are Exception messages even though 
the application gets done.

# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1
... ...
17/07/20 02:57:30 INFO spark.SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users with view permissions: Set(xpan); users with 
modify permissions: Set(xpan)
17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=app-20170720025800-, execId=0} with 
ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs
17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection 
from /172.29.10.77:50616
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: 
DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 
123
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:785)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order 
sequencing of messages from server. Got: 125 Expected: 123
at 
com.ibm.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1535)
at 
com.ibm.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:231)
at 
org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:149)
at 
org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:127)
at 
org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:102)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
... 13 more
17/07/20 02:58:11 ERROR server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=908084716000, 
chunkIndex=1}, 
buffer=FileSegmentManagedBuffer{file=/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0/0c/shuffle_0_17_0.data,
 offset=1893612, length=302981}} to /172.29.10.77:50616; closing connection
java.nio.channels.ClosedChannelException


[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server

2017-07-20 Thread Xin Yu Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Yu Pan updated SPARK-21495:
---
Description: 
We hit an issue when enabling authentication and Sasl encryption, see bold font 
in following parameter list.
spark.local.dir /tmp/xpan-spark-161
spark.eventLog.dir file:///home/xpan/spark-conf/event
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/home/xpan/spark-conf/event
spark.history.ui.port 18085
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7448
spark.shuffle.reduceLocality.enabled false
spark.master.port 7087
spark.master.rest.port 6077
spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
*spark.authenticate true
spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2
spark.authenticate.enableSaslEncryption false
spark.network.sasl.serverAlwaysEncrypt false*


We run the simple SparkPi example and there are Exception messages even though 
the application gets done.

# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out
Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp 
/opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/
 -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService

17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users with view permissions: Set(xpan); users with 
modify permissions: Set(xpan)
17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1
... ...
17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=app-20170720025800-, execId=0} with 
ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs
17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection 
from /172.29.10.77:50616
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: 
DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 
123
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:785)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order 
sequencing of messages from server. Got: 125 Expected: 123
at 

[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server

2017-07-20 Thread Xin Yu Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Yu Pan updated SPARK-21495:
---
Description: 
We hit an issue when enabling authentication and Sasl encryption, see bold font 
in following parameter list.
spark.local.dir /tmp/xpan-spark-161
spark.eventLog.dir file:///home/xpan/spark-conf/event
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/home/xpan/spark-conf/event
spark.history.ui.port 18085
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7448
spark.shuffle.reduceLocality.enabled false
spark.master.port 7087
spark.master.rest.port 6077
spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
spark.authenticate true
spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2
spark.authenticate.enableSaslEncryption false
spark.network.sasl.serverAlwaysEncrypt false


We run the simple SparkPi example and there are Exception messages even though 
the application gets done.

# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out
Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp 
/opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/
 -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService

17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users with view permissions: Set(xpan); users with 
modify permissions: Set(xpan)
17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1
... ...
17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=app-20170720025800-, execId=0} with 
ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs
17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection 
from /172.29.10.77:50616
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: 
DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 
123
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:785)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order 
sequencing of messages from server. Got: 125 Expected: 123
at 

[jira] [Created] (SPARK-21496) Support codegen for TakeOrderedAndProjectExec

2017-07-20 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-21496:


 Summary: Support codegen for TakeOrderedAndProjectExec
 Key: SPARK-21496
 URL: https://issues.apache.org/jira/browse/SPARK-21496
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jiang Xingbo
Priority: Minor


The operator `SortExec` supports codegen, but `TakeOrderedAndProjectExec` 
doesn't. Perhaps we should also add codegen support for 
`TakeOrderedAndProjectExec`, but we should also do benchmark for it carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server

2017-07-20 Thread Xin Yu Pan (JIRA)
Xin Yu Pan created SPARK-21495:
--

 Summary: DIGEST-MD5: Out of order sequencing of messages from 
server
 Key: SPARK-21495
 URL: https://issues.apache.org/jira/browse/SPARK-21495
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.6.1
 Environment: OS: RedHat 7.1 64bit
Spark: 1.6.1

Reporter: Xin Yu Pan


We hit an issue when enabling authentication and Sasl encryption, see bold font 
in following parameter list.
spark.local.dir /tmp/xpan-spark-161
spark.eventLog.dir file:///home/xpan/spark-conf/event
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/home/xpan/spark-conf/event
spark.history.ui.port 18085
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 14d
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7448
spark.shuffle.reduceLocality.enabled false
spark.master.port 7087
spark.master.rest.port 6077
spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom
**spark.authenticate true
spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2
spark.authenticate.enableSaslEncryption false
spark.network.sasl.serverAlwaysEncrypt false**


We run the simple SparkPi example and there are Exception messages even though 
the application gets done.

# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out
Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp 
/opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/
 -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService

17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan
17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication 
enabled; ui acls disabled; users with view permissions: Set(xpan); users with 
modify permissions: Set(xpan)
17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
# cat 
spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1
... ...
17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service 
on port 7448 with useSasl = true
17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=app-20170720025800-, execId=0} with 
ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0],
 subDirsPerLocalDir=64, 
shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs
17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection 
from /172.29.10.77:50616
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: 
DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 
123
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 

[jira] [Comment Edited] (SPARK-21486) Fail when using aliased column of a aliased table from a subquery

2017-07-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095689#comment-16095689
 ] 

Liang-Chi Hsieh edited comment on SPARK-21486 at 7/21/17 2:34 AM:
--

Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please 
see SPARK-21335. It is considered a wrong SQL query.

So, this query works:

{code}
scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show()
+---+
| ta|
+---+
|  1|
|  2|
|  3|
+---+
{code}

But this query can't work:

{code}
scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input 
columns: [__auto_generated_subquery_n
ame.ta]; line 1 pos 7;
'Project ['a21.ta]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [a#7 AS ta#31]
  +- SubqueryAlias a21
 +- SubqueryAlias test
+- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9]
   +- LocalRelation [_1#3, _2#4, _3#5]
{code}


was (Author: viirya):

Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please 
see SPARK-21335.

So, this query works:

{code}
scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show()
+---+
| ta|
+---+
|  1|
|  2|
|  3|
+---+
{code}

But this query can't work:

{code}
scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input 
columns: [__auto_generated_subquery_n
ame.ta]; line 1 pos 7;
'Project ['a21.ta]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [a#7 AS ta#31]
  +- SubqueryAlias a21
 +- SubqueryAlias test
+- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9]
   +- LocalRelation [_1#3, _2#4, _3#5]
{code}

> Fail when using aliased column of a aliased table from a subquery
> -
>
> Key: SPARK-21486
> URL: https://issues.apache.org/jira/browse/SPARK-21486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Miguel Angel Fernandez Diaz
>
> SparkSQL cannot resolve a qualified column when using the column alias and 
> the table alias from a subquery. 
> Example:
> {code:sql}
> SELECT a21.ta FROM (SELECT tip_amount as ta FROM yellow a21)
> {code}
> Error:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input 
> columns: [ta]; line 1 pos 7;
> 'Project ['a21.ta]
> +- Project [tip_amount#505 AS ta#489]
>+- SubqueryAlias a21
>   +- SubqueryAlias yellow
>  +- 
> Relation[VendorID#490,tpep_pickup_datetime#491,tpep_dropoff_datetime#492,passenger_count#493,trip_distance#494,pickup_longitude#495,pickup_latitude#496,RatecodeID#497,store_and_fwd_flag#498,dropoff_longitude#499,dropoff_latitude#500,payment_type#501,fare_amount#502,extra#503,mta_tax#504,tip_amount#505,tolls_amount#506,improvement_surcharge#507,total_amount#508]
>  csv
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> 

[jira] [Commented] (SPARK-21486) Fail when using aliased column of a aliased table from a subquery

2017-07-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095689#comment-16095689
 ] 

Liang-Chi Hsieh commented on SPARK-21486:
-


Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please 
see SPARK-21335.

So, this query works:

{code}
scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show()
+---+
| ta|
+---+
|  1|
|  2|
|  3|
+---+
{code}

But this query can't work:

{code}
scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input 
columns: [__auto_generated_subquery_n
ame.ta]; line 1 pos 7;
'Project ['a21.ta]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [a#7 AS ta#31]
  +- SubqueryAlias a21
 +- SubqueryAlias test
+- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9]
   +- LocalRelation [_1#3, _2#4, _3#5]
{code}

> Fail when using aliased column of a aliased table from a subquery
> -
>
> Key: SPARK-21486
> URL: https://issues.apache.org/jira/browse/SPARK-21486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Miguel Angel Fernandez Diaz
>
> SparkSQL cannot resolve a qualified column when using the column alias and 
> the table alias from a subquery. 
> Example:
> {code:sql}
> SELECT a21.ta FROM (SELECT tip_amount as ta FROM yellow a21)
> {code}
> Error:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input 
> columns: [ta]; line 1 pos 7;
> 'Project ['a21.ta]
> +- Project [tip_amount#505 AS ta#489]
>+- SubqueryAlias a21
>   +- SubqueryAlias yellow
>  +- 
> Relation[VendorID#490,tpep_pickup_datetime#491,tpep_dropoff_datetime#492,passenger_count#493,trip_distance#494,pickup_longitude#495,pickup_latitude#496,RatecodeID#497,store_and_fwd_flag#498,dropoff_longitude#499,dropoff_latitude#500,payment_type#501,fare_amount#502,extra#503,mta_tax#504,tip_amount#505,tolls_amount#506,improvement_surcharge#507,total_amount#508]
>  csv
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> 

[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095678#comment-16095678
 ] 

Ryan Williams commented on SPARK-21425:
---

[~zsxwing] yea, it's static accumulators, and seems to only be a problem in 
"local" mode. I just updated [my comment 
above|https://issues.apache.org/jira/browse/SPARK-21425?focusedCommentId=16090017=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16090017]
 because I'd previously thought I'd been able to repro this in cluster mode, 
but now looking back at what I tried, that's not the case afaict. 

Small nit: I'm not sure what you mean by "local-cluster mode… will always be 0" 
though… I definitely get nonzero results on the driver with master "local\[*\]".

Anyway, thanks for investigating.

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090017#comment-16090017
 ] 

Ryan Williams edited comment on SPARK-21425 at 7/21/17 2:12 AM:


Yea, static accumulators; -I was able to reproduce it in cluster mode where 
multiple tasks on the same executor are writing to that executor's single 
instance of the same static accumulator-. ** I think I was mistaken when 
I wrote this. I tried getting a setup in yarn-mode that would write to static 
accumulators in a few ways, but wasn't able to run successfully with tasks 
writing to them, so I'm basically unable repro the problem in cluster-mode. 
**

I'm now thinking of this mostly as a surprise (to me!) about interactions 
between closure-serialization and accumulators… I assumed (incorrectly) that 
references to accumulators in task closures were treated specially in some way, 
but I see now that they're not.

How unexpected/undesirable this behavior is debatable; I inadvertently fell 
into a situation where I was making static accumulators while working around a 
scalatest bug, as mentioned above, but in retrospect it feels pretty contrived 
to have ended up in that situation.

Docs update, precautionary threadsafe-ing of \{Long,Double\}Accumulator, and 
taking no action all seem reasonable to me.


was (Author: rdub):
Yea, static accumulators; I was able to reproduce it in cluster mode where 
multiple tasks on the same executor are writing to that executor's single 
instance of the same static accumulator.

I'm now thinking of this mostly as a surprise (to me!) about interactions 
between closure-serialization and accumulators… I assumed (incorrectly) that 
references to accumulators in task closures were treated specially in some way, 
but I see now that they're not.

How unexpected/undesirable this behavior is debatable; I inadvertently fell 
into a situation where I was making static accumulators while working around a 
scalatest bug, as mentioned above, but in retrospect it feels pretty contrived 
to have ended up in that situation.

Docs update, precautionary threadsafe-ing of \{Long,Double\}Accumulator, and 
taking no action all seem reasonable to me.

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20960) make ColumnVector public

2017-07-20 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20960:
-
Description: 
ColumnVector is an internal interface in Spark SQL, which is only used for 
vectorized parquet reader to represent the in-memory columnar format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.

  was:
_emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
only used for vectorized parquet reader to represent the in-memory columnar 
format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.


> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20960) make ColumnVector public

2017-07-20 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20960:
-
Description: 
_emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
only used for vectorized parquet reader to represent the in-memory columnar 
format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.

  was:
ColumnVector is an internal interface in Spark SQL, which is only used for 
vectorized parquet reader to represent the in-memory columnar format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.


> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> _emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
> only used for vectorized parquet reader to represent the in-memory columnar 
> format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21485) API Documentation for Spark SQL functions

2017-07-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095600#comment-16095600
 ] 

Reynold Xin commented on SPARK-21485:
-

Pretty cool. Would be great to just generate the function list as part of the 
Spark documentation.


> API Documentation for Spark SQL functions
> -
>
> Key: SPARK-21485
> URL: https://issues.apache.org/jira/browse/SPARK-21485
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> It looks we can generate the documentation from {{ExpressionDescription}} and 
> {{ExpressionInfo}} for Spark's SQL function documentation.
> I had some time to play with this so I just made a rough version - 
> https://spark-test.github.io/sparksqldoc/
> Codes I used are as below :
> In {{pyspark}} shell:
> {code}
> from collections import namedtuple
> ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")
> jinfos = 
> spark.sparkContext._jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctions()
> infos = []
> for jinfo in jinfos:
> name = jinfo.getName()
> usage = jinfo.getUsage()
> usage = usage.replace("_FUNC_", name) if usage is not None else usage
> extended = jinfo.getExtended()
> extended = extended.replace("_FUNC_", name) if extended is not None else 
> extended
> infos.append(ExpressionInfo(
> className=jinfo.getClassName(),
> usage=usage,
> name=name,
> extended=extended))
> with open("index.md", 'w') as mdfile:
> strip = lambda s: "\n".join(map(lambda u: u.strip(), s.split("\n")))
> for info in sorted(infos, key=lambda i: i.name):
> mdfile.write("### %s\n\n" % info.name)
> if info.usage is not None:
> mdfile.write("%s\n\n" % strip(info.usage))
> if info.extended is not None:
> mdfile.write("```%s```\n\n" % strip(info.extended))
> {code}
> This change had to be made first before running the codes above:
> {code:none}
> +++ 
> b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
> @@ -17,9 +17,15 @@
>  package org.apache.spark.sql.api.python
> +import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
> +import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
>  import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
>  import org.apache.spark.sql.types.DataType
>  private[sql] object PythonSQLUtils {
>def parseDataType(typeText: String): DataType = 
> CatalystSqlParser.parseDataType(typeText)
> +
> +  def listBuiltinFunctions(): Array[ExpressionInfo] = {
> +FunctionRegistry.functionSet.flatMap(f => 
> FunctionRegistry.builtin.lookupFunction(f)).toArray
> +  }
>  }
> {code}
> And then, I ran this:
> {code}
> mkdir docs
> echo "site_name: Spark SQL 2.3.0" >> mkdocs.yml
> echo "theme: readthedocs" >> mkdocs.yml
> mv index.md docs/index.md
> mkdocs serve
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095587#comment-16095587
 ] 

Iurii Antykhovych edited comment on SPARK-21491 at 7/21/17 12:30 AM:
-

I searched for all such places in the whole GraphX module.
These are only three occurrences of such tuples-to-map conversion I could boost 
without major refactoring.
Let GraphX be the pilot module of such an optimization)

The changes in `LabelPropagation` and `ShortestPaths` are on a hot execution 
path.
The fix in `PageRank` class is executed once per run, not 'hot' enough. 
Shall I revert it?




was (Author: sereneant):
I searched for all such places in the whole GraphX module.
These are only three occurrences of such tuples-to-map conversion I could boost 
without major refactoring.
Let GraphX be the pilot module of such an optimization)

The only change on a hot path of execution is the code is the one on 
`graphx.lib.ShortestPaths` class.
The rest is executed once per run, not 'hot' enough. 
Shall I revert it (LabelPropagation.scala, PageRank.scala)?



> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-21494:
--
Attachment: logs.zip

> Spark 2.2.0 AES encryption not working with External shuffle
> 
>
> Key: SPARK-21494
> URL: https://issues.apache.org/jira/browse/SPARK-21494
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.2.0
> Environment: AWS EMR
>Reporter: Udit Mehrotra
> Attachments: logs.zip
>
>
> Spark’s new AES based authentication mechanism does not seem to work when 
> configured with external shuffle service on YARN. 
> Here is the stack trace for the error we see in the driver logs:
> ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: 
> Unable to create executor due to Unable to register with external shuffle 
> server due to: java.lang.IllegalArgumentException: Authentication failed.
> at 
> org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  
> Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
> spark.network.crypto.enabled true
> spark.network.crypto.saslFallback false
> spark.authenticate   true
>  
> Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both 
> Spark and YARN side is not giving much information either about why 
> authentication fails. The driver and node manager logs have been attached to 
> the JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-21494:
--
Description: 
Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. 

Here is the stack trace for the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and node manager logs have been attached to the JIRA.

  was:
Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. Here is the stack trace for 
the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and 

[jira] [Created] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle

2017-07-20 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-21494:
-

 Summary: Spark 2.2.0 AES encryption not working with External 
shuffle
 Key: SPARK-21494
 URL: https://issues.apache.org/jira/browse/SPARK-21494
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 2.2.0
 Environment: AWS EMR
Reporter: Udit Mehrotra


Spark’s new AES based authentication mechanism does not seem to work when 
configured with external shuffle service on YARN. Here is the stack trace for 
the error we see in the driver logs:
ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable 
to create executor due to Unable to register with external shuffle server due 
to: java.lang.IllegalArgumentException: Authentication failed.
at 
org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 
Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’:
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
spark.authenticate   true
 
Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark 
and YARN side is not giving much information either about why authentication 
fails. The driver and node manager logs have been attached to the JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095587#comment-16095587
 ] 

Iurii Antykhovych commented on SPARK-21491:
---

I searched for all such places in the whole GraphX module.
These are only three occurrences of such tuples-to-map conversion I could boost 
without major refactoring.
Let GraphX be the pilot module of such an optimization)

The only change on a hot path of execution is the code is the one on 
`graphx.lib.ShortestPaths` class.
The rest is executed once per run, not 'hot' enough. 
Shall I revert it (LabelPropagation.scala, PageRank.scala)?



> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full

2017-07-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095546#comment-16095546
 ] 

Thomas Graves commented on SPARK-21460:
---

I didn't think that was the case, but took a look at the code and I guess I was 
wrong, it definitely appears to be reliant on the listener bus.  That is really 
bad in my opinion. We are intentionally dropping events and we know that will 
cause issues.

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> --
>
> Key: SPARK-21460
> URL: https://issues.apache.org/jira/browse/SPARK-21460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Spark 2.1 
> Hadoop 2.6
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops 
> working - Spark fails to shrink number of executors when there are no active 
> jobs (Spark driver "thinks" there are active jobs since it didn't capture 
> when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, 
> so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 
> 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21493) Add more metrics to External Shuffle Service

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095466#comment-16095466
 ] 

Sean Owen commented on SPARK-21493:
---

How is this different from SPARK-21334 that you opened?

> Add more metrics to External Shuffle Service
> 
>
> Key: SPARK-21493
> URL: https://issues.apache.org/jira/browse/SPARK-21493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Raajay Viswanathan
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current set of metrics in the external shuffle service are fairly 
> limited. To debug failure of the shuffle service, it would be good to get 
> more information regarding the state of the shuffle service. As a first cut, 
> the following metrics seem important:
> 1. The amount of heap memory used by the External Shuffle Service process
> 2. The amount of direct buffer (off-heap) memory allocated to External 
> Shuffle Service. In the external shuffle service, Netty uses off-heap memory. 
> Monitoring its usage can help in allocating appropriate resources and can 
> also be used to raise alarms when the allocated memory exceeds a threshold.
> 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open 
> Block requests can be dropped as a result of Netty queue overflows (resulting 
> in FetchFailure). Having hard data on queue size can help in attributing 
> cause of failures.
> Please let me know of other metrics (from Shuffle Service perspective) that 
> would be good to have. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21493) Add more metrics to External Shuffle Service

2017-07-20 Thread Raajay Viswanathan (JIRA)
Raajay Viswanathan created SPARK-21493:
--

 Summary: Add more metrics to External Shuffle Service
 Key: SPARK-21493
 URL: https://issues.apache.org/jira/browse/SPARK-21493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Raajay Viswanathan
Priority: Minor


The current set of metrics in the external shuffle service are fairly limited. 
To debug failure of the shuffle service, it would be good to get more 
information regarding the state of the shuffle service. As a first cut, the 
following metrics seem important:

1. The amount of heap memory used by the External Shuffle Service process
2. The amount of direct buffer (off-heap) memory allocated to External Shuffle 
Service. In the external shuffle service, Netty uses off-heap memory. 
Monitoring its usage can help in allocating appropriate resources and can also 
be used to raise alarms when the allocated memory exceeds a threshold.
3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open Block 
requests can be dropped as a result of Netty queue overflows (resulting in 
FetchFailure). Having hard data on queue size can help in attributing cause of 
failures.

Please let me know of other metrics (from Shuffle Service perspective) that 
would be good to have. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full

2017-07-20 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095447#comment-16095447
 ] 

Saisai Shao commented on SPARK-21460:
-

I think this is basically a ListenerBus issue, not a dynamic allocation issue. 
Because ExecutorAllocationManager will register a listener and rely on listener 
to decide increase or decrease executors. Now because of failure of 
ListenerBus, then the listener in ExecutorAllocationManager cannot be worked 
and fails to decrease the executors as mentioned in your description.

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> --
>
> Key: SPARK-21460
> URL: https://issues.apache.org/jira/browse/SPARK-21460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Spark 2.1 
> Hadoop 2.6
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops 
> working - Spark fails to shrink number of executors when there are no active 
> jobs (Spark driver "thinks" there are active jobs since it didn't capture 
> when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, 
> so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 
> 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21490) SparkLauncher may fail to redirect streams

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21490:


Assignee: Apache Spark

> SparkLauncher may fail to redirect streams
> --
>
> Key: SPARK-21490
> URL: https://issues.apache.org/jira/browse/SPARK-21490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> While investigating a different issue, I noticed that the redirection 
> handling in SparkLauncher is faulty.
> In the default case or when users use only log redirection, things should 
> work fine.
> But if users try to redirect just the stdout of a child process without 
> redirecting stderr, the launcher won't detect that and the child process may 
> hang because stderr is not being read and its buffer fills up.
> The code should detect these cases and redirect any streams that haven't been 
> explicitly redirected by the user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21490) SparkLauncher may fail to redirect streams

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095445#comment-16095445
 ] 

Apache Spark commented on SPARK-21490:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18696

> SparkLauncher may fail to redirect streams
> --
>
> Key: SPARK-21490
> URL: https://issues.apache.org/jira/browse/SPARK-21490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While investigating a different issue, I noticed that the redirection 
> handling in SparkLauncher is faulty.
> In the default case or when users use only log redirection, things should 
> work fine.
> But if users try to redirect just the stdout of a child process without 
> redirecting stderr, the launcher won't detect that and the child process may 
> hang because stderr is not being read and its buffer fills up.
> The code should detect these cases and redirect any streams that haven't been 
> explicitly redirected by the user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21490) SparkLauncher may fail to redirect streams

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21490:


Assignee: (was: Apache Spark)

> SparkLauncher may fail to redirect streams
> --
>
> Key: SPARK-21490
> URL: https://issues.apache.org/jira/browse/SPARK-21490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While investigating a different issue, I noticed that the redirection 
> handling in SparkLauncher is faulty.
> In the default case or when users use only log redirection, things should 
> work fine.
> But if users try to redirect just the stdout of a child process without 
> redirecting stderr, the launcher won't detect that and the child process may 
> hang because stderr is not being read and its buffer fills up.
> The code should detect these cases and redirect any streams that haven't been 
> explicitly redirected by the user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095432#comment-16095432
 ] 

Apache Spark commented on SPARK-12717:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/18695

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
> Attachments: run.log
>
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
> {code:title=abridge_run.txt|borderStyle=solid}
> % spark-submit --master local[10] bug_spark.py --parallelism 

[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095425#comment-16095425
 ] 

Zhan Zhang commented on SPARK-21492:


root cause: In the SortMergeJoin, inner/leftOuter/rightOuter, one side of the 
SortedIter may not exhausted, that chunk of the memory thus cannot be released, 
causing memory leak and performance degradtion.

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhan Zhang
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095418#comment-16095418
 ] 

Apache Spark commented on SPARK-21492:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18694

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhan Zhang
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21492:


Assignee: (was: Apache Spark)

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhan Zhang
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21492:


Assignee: Apache Spark

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21492) Memory leak in SortMergeJoin

2017-07-20 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-21492:
--

 Summary: Memory leak in SortMergeJoin
 Key: SPARK-21492
 URL: https://issues.apache.org/jira/browse/SPARK-21492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhan Zhang


In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
caused by the Sort. The memory is not released until the task end, and cannot 
be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095413#comment-16095413
 ] 

Sean Owen commented on SPARK-21491:
---

I see it now, yeah, and it's used in one place in Spark:

{code}
  private[sql] def fieldsMap(fields: Array[StructField]): Map[String, 
StructField] = {
import scala.collection.breakOut
fields.map(s => (s.name, s))(breakOut)
  }
{code}

OK so it's overriding the builder that map uses to construct its target. I get 
it.

I think it could be worth making this change, where it makes a non-trivial 
difference to performance. Is this a hotspot? I'd like to get several of them 
at once if we're going to change any of these instances.

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095398#comment-16095398
 ] 

Iurii Antykhovych edited comment on SPARK-21491 at 7/20/17 9:22 PM:


This is relevant to all scala versions starting from 2.8, it's in 
`scala.collection.breakOut`.
The problem with {{collection.map(...).toMap}} is in creation of intermediate 
collection of tuples, that is converted to map then;
that leads to performance degradation and excess object allocation.
The price of {{collection.breakOut}} is the code readability, it significantly 
suffers I guess, compared to {{.toMap}} method.



was (Author: sereneant):
This is relevant to all scala versions starting from 2.8, it's in 
`scala.collection.breakOut`.
The problem with `collection.map(...).toMap` is in creation of intermediate 
collection of tuples, that is converted to map then;
that leads to performance degradation and excess object allocation.
The price of `collection.breakOut` is the code readability, it significantly 
suffers I guess, compared to '.toMap' method.


> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095398#comment-16095398
 ] 

Iurii Antykhovych commented on SPARK-21491:
---

This is relevant to all scala versions starting from 2.8, it's in 
`scala.collection.breakOut`.
The problem with `collection.map(...).toMap` is in creation of intermediate 
collection of tuples, that is converted to map then;
that leads to performance degradation and excess object allocation.
The price of `collection.breakOut` is the code readability, it significantly 
suffers I guess, compared to '.toMap' method.


> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21489) Update release docs to point out Python 2.6 support is removed.

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095386#comment-16095386
 ] 

Sean Owen commented on SPARK-21489:
---

I think that was just changed: 
https://github.com/apache/spark/commit/5b61cc6d629d537a7e5bcbd5205d1c8a43b14d43

> Update release docs to point out Python 2.6 support is removed.
> ---
>
> Key: SPARK-21489
> URL: https://issues.apache.org/jira/browse/SPARK-21489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: holdenk
>Priority: Trivial
>
> In the current RDD programming guide docs for Spark 2.2.0 we currently say 
> "Python 2.6 support may be removed in Spark 2.2.0" but we forgot to update 
> that to was removed which could lead to some confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21491:


Assignee: (was: Apache Spark)

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21491:


Assignee: Apache Spark

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Assignee: Apache Spark
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095378#comment-16095378
 ] 

Apache Spark commented on SPARK-21491:
--

User 'SereneAnt' has created a pull request for this issue:
https://github.com/apache/spark/pull/18693

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095379#comment-16095379
 ] 

Sean Owen commented on SPARK-21491:
---

I can't even find this in Scala. The article talks about 2.8. Is it obsolete? 
what's the performance problem and what is different that improves it?

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iurii Antykhovych updated SPARK-21491:
--
Priority: Trivial  (was: Minor)

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-20 Thread Iurii Antykhovych (JIRA)
Iurii Antykhovych created SPARK-21491:
-

 Summary: Performance enhancement: eliminate creation of 
intermediate collections
 Key: SPARK-21491
 URL: https://issues.apache.org/jira/browse/SPARK-21491
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 2.2.0
Reporter: Iurii Antykhovych
Priority: Minor


Simple performance optimization in a few places of GraphX:
{{Traversable.toMap}} can be replaced with {{collection.breakout}}.
This would eliminate creation of an intermediate collection of tuples, see
[Stack Overflow 
article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21490) SparkLauncher may fail to redirect streams

2017-07-20 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-21490:
--

 Summary: SparkLauncher may fail to redirect streams
 Key: SPARK-21490
 URL: https://issues.apache.org/jira/browse/SPARK-21490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


While investigating a different issue, I noticed that the redirection handling 
in SparkLauncher is faulty.

In the default case or when users use only log redirection, things should work 
fine.

But if users try to redirect just the stdout of a child process without 
redirecting stderr, the launcher won't detect that and the child process may 
hang because stderr is not being read and its buffer fills up.

The code should detect these cases and redirect any streams that haven't been 
explicitly redirected by the user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21489) Update release docs to point out Python 2.6 support is removed.

2017-07-20 Thread holdenk (JIRA)
holdenk created SPARK-21489:
---

 Summary: Update release docs to point out Python 2.6 support is 
removed.
 Key: SPARK-21489
 URL: https://issues.apache.org/jira/browse/SPARK-21489
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 2.2.0, 2.2.1
Reporter: holdenk
Priority: Trivial


In the current RDD programming guide docs for Spark 2.2.0 we currently say 
"Python 2.6 support may be removed in Spark 2.2.0" but we forgot to update that 
to was removed which could lead to some confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323
 ] 

Shixiong Zhu edited comment on SPARK-21425 at 7/20/17 8:45 PM:
---

[~srowen] The issue is static accumulators. Right? They won't be 
serialized/deserialized with tasks and cannot be reported back to driver. If 
running in local-cluster mode or a real cluster, they will always be 0 in the 
driver side.

The overhead comes from memory barrier introduced by synchronization. I tried 
to improve this but gave up due to the significant performance regression: 
https://github.com/apache/spark/pull/15065

`avg` is a good point. But it's not a big deal considering it just shows on UI 
and will be fixed eventually when the job finishes.


was (Author: zsxwing):
[~srowen] The issue is static accumulators. Right? They won't be 
serialized/deserialized with tasks and cannot be reported back to driver. If 
running in local-cluster mode or a real cluster, they will always be 0 in the 
driver side.

The overhead comes from memory barrier introduced by synchronization. I tried 
to improved this but gave up due to the significant performance regression: 
https://github.com/apache/spark/pull/15065

`avg` is a good point. But it's not a big deal considering it just shows on UI 
and will be fixed eventually when the job finishes.

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323
 ] 

Shixiong Zhu edited comment on SPARK-21425 at 7/20/17 8:38 PM:
---

[~srowen] The issue is static accumulators. Right? They won't be 
serialized/deserialized with tasks and cannot be reported back to driver. If 
running in local-cluster mode or a real cluster, they will always be 0 in the 
driver side.

The overhead comes from memory barrier introduced by synchronization. I tried 
to improved this but gave up due to the significant performance regression: 
https://github.com/apache/spark/pull/15065

`avg` is a good point. But it's not a big deal considering it just shows on UI 
and will be fixed eventually when the job finishes.


was (Author: zsxwing):
[~srowen] The issue is static accumulators. Right? They won't be 
serialized/deserialized with tasks and cannot be reported back to driver. If 
running in local-cluster mode or a real cluster, they will always be 0 in the 
driver side.

The overhead comes from memory barrier introduced by synchronization. I tried 
to improved this but gave up due to the significant performance regression: 
https://github.com/apache/spark/pull/15065

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe

2017-07-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323
 ] 

Shixiong Zhu commented on SPARK-21425:
--

[~srowen] The issue is static accumulators. Right? They won't be 
serialized/deserialized with tasks and cannot be reported back to driver. If 
running in local-cluster mode or a real cluster, they will always be 0 in the 
driver side.

The overhead comes from memory barrier introduced by synchronization. I tried 
to improved this but gave up due to the significant performance regression: 
https://github.com/apache/spark/pull/15065

> LongAccumulator, DoubleAccumulator not threadsafe
> -
>
> Key: SPARK-21425
> URL: https://issues.apache.org/jira/browse/SPARK-21425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> [AccumulatorV2 
> docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43]
>  acknowledge that accumulators must be concurrent-read-safe, but afaict they 
> must also be concurrent-write-safe.
> The same docs imply that {{Int}} and {{Long}} meet either/both of these 
> criteria, when afaict they do not.
> Relatedly, the provided 
> [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291]
>  and 
> [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370]
>  are not thread-safe, and should be expected to behave undefinedly when 
> multiple concurrent tasks on the same executor write to them.
> [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] 
> with some simple applications that demonstrate incorrect results from 
> {{LongAccumulator}}'s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21417:


Assignee: (was: Apache Spark)

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the second query (ideally 
> equivalent to that of the 

[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21417:


Assignee: Apache Spark

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>Assignee: Apache Spark
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A correctly working, executable, query plan for the second query (ideally 
> 

[jira] [Commented] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095317#comment-16095317
 ] 

Apache Spark commented on SPARK-21417:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/18692

> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use the CROSS JOIN syntax to allow cartesian products between these 
> relations.;
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
>   at 
> org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   ...
> Run completed in 6 seconds, 833 milliseconds.
> Total number of tests run: 1
> Suites: completed 1, aborted 0
> Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> {noformat}
> Expected:
> A 

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2017-07-20 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095299#comment-16095299
 ] 

holdenk commented on SPARK-7146:


So it seems like there is a (more recent) agreement that exposing this as a 
DeveloperAPI could be useful (I know in my own talks where I mention how to 
write custom ML pipeline stages I wish this was available as a developer API).

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-07-20 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095275#comment-16095275
 ] 

Roberto Mirizzi edited comment on SPARK-21478 at 7/20/17 8:06 PM:
--

Sorry about that. I totally misunderstood you. My bad :-) 


was (Author: roberto.mirizzi):
Sorry about that. I totally misunderstood you. :-) 

> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-07-20 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095275#comment-16095275
 ] 

Roberto Mirizzi commented on SPARK-21478:
-

Sorry about that. I totally misunderstood you. :-) 

> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095208#comment-16095208
 ] 

Sean Owen commented on SPARK-21478:
---

I said I can reproduce it, I agree with you.

> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-07-20 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095203#comment-16095203
 ] 

Kaushal Prajapati commented on SPARK-19908:
---

[~bojanbabic] This error comes when your kryobuffer memory exceed default limit 
(64m). Increase your buffer memory using this property 
{noformat}
spark.kryoserializer.buffer.max=64m
{noformat}


> Direct buffer memory OOM should not cause stage retries.
> 
>
> Key: SPARK-19908
> URL: https://issues.apache.org/jira/browse/SPARK-19908
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
> exception will be changed to FetchFailedException, causing stage retries.
> org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> 

[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Raajay Viswanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095204#comment-16095204
 ] 

Raajay Viswanathan commented on SPARK-21334:


I think SPARK-18364 aims to implement metrics in YarnShuffleService. The 
current issue builds on existing work (SPARK-16405) for ExternalShuffleService 
and implements a reporting service that was earlier missing.

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21243) Limit the number of maps in a single shuffle fetch

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095199#comment-16095199
 ] 

Apache Spark commented on SPARK-21243:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/18691

> Limit the number of maps in a single shuffle fetch
> --
>
> Key: SPARK-21243
> URL: https://issues.apache.org/jira/browse/SPARK-21243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.3.0
>
>
> Right now spark can limit the # of parallel fetches and also limits the 
> amount of data in one fetch, but one fetch to a host could be for 100's of 
> blocks. In one instance we saw 450+ blocks. When you have 100's of those and 
> 1000's of reducers fetching that becomes a lot of metadata and can run the 
> Node Manager out of memory. We should add a config to limit the # of maps per 
> fetch to reduce the load on the NM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-07-20 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095187#comment-16095187
 ] 

Roberto Mirizzi commented on SPARK-21478:
-

That's weird you are not able to reproduce it. 
Did you just launch the spark-shell and copied/pasted the above commands? 
I tried on Spark 2.1.0, 2.1.1 and 2.2.0, both on AWS and on my local machine. 
Spark 2.1.0 doesn't exhibit any issue, Spark 2.1.1 and 2.2.0 fail the last 
assertion.

About your point "I do think one wants to be able to persist the result and not 
the original though", it depends on the specific use case. My example was a 
trivial one to reproduce the problem, but as you can imagine you may want to 
persist the first DF, do a bunch of operations reusing it, and then generate a 
new DF, persist it, and unpersist the old one when you don't need it anymore.
It looks like a serious problem to me. 

This is my entire output:

{code:java}
$ spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/07/20 11:44:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/07/20 11:44:44 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Spark context Web UI available at http://10.15.16.46:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1500576281010).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_71)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val x1 = Seq(1).toDF()
x1: org.apache.spark.sql.DataFrame = [value: int]

scala> x1.persist()
res0: x1.type = [value: int]

scala> x1.count()
res1: Long = 1

scala> assert(x1.storageLevel.useMemory)

scala> 

scala> val x11 = x1.select($"value" * 2)
x11: org.apache.spark.sql.DataFrame = [(value * 2): int]

scala> x11.persist()
res3: x11.type = [(value * 2): int]

scala> x11.count()
res4: Long = 1

scala> assert(x11.storageLevel.useMemory)

scala> 

scala> x1.unpersist()
res6: x1.type = [value: int]

scala> 

scala> assert(!x1.storageLevel.useMemory)

scala> //the following assertion FAILS

scala> assert(x11.storageLevel.useMemory)
java.lang.AssertionError: assertion failed
  at scala.Predef$.assert(Predef.scala:156)
  ... 48 elided

scala> 
{code}


> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view

2017-07-20 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21488:
--
Description: 
It would be great to make saveAsTable() return dataframe of created table, 
so you could pipe result further as for example

{code}
mv_table_df = (sqlc.sql('''
SELECT ...
FROM 
''')
.write.format("parquet").mode("overwrite")
.saveAsTable('test.parquet_table')
.createOrReplaceTempView('mv_table')
)
{code}

... Above code returns now expectedly:
{noformat}
AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
{noformat}

If this is implemented, we can skip a step like
{code}
sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table')
{code}

We have this pattern very frequently. 

Further improvement can be made if createOrReplaceTempView also returns 
dataframe object, so in one pipeline of functions 
we can 
- create an external table 
- create a dataframe reference to this newly created for SparkSQL and as a 
Spark variable.

  was:
It would be great to make saveAsTable() return dataframe of created table, 
so you could pipe result further as for example

{code}
(sqlc.sql('''
SELECT ...
FROM 
''')
.write.format("parquet").mode("overwrite")
.saveAsTable('test.parquet_table')
.createOrReplaceTempView('mv_table')
)
{code}

... Above code returns now expectedly:
{noformat}
AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
{noformat}

If this is implemented, we can skip a step like
{code}
sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table')
{code}

We have this pattern very frequently. 



> Make saveAsTable() and createOrReplaceTempView() return dataframe of created 
> table/ created view
> 
>
> Key: SPARK-21488
> URL: https://issues.apache.org/jira/browse/SPARK-21488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>
> It would be great to make saveAsTable() return dataframe of created table, 
> so you could pipe result further as for example
> {code}
> mv_table_df = (sqlc.sql('''
> SELECT ...
> FROM 
> ''')
> .write.format("parquet").mode("overwrite")
> .saveAsTable('test.parquet_table')
> .createOrReplaceTempView('mv_table')
> )
> {code}
> ... Above code returns now expectedly:
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
> {noformat}
> If this is implemented, we can skip a step like
> {code}
> sqlc.sql('SELECT * FROM 
> test.parquet_table').createOrReplaceTempView('mv_table')
> {code}
> We have this pattern very frequently. 
> Further improvement can be made if createOrReplaceTempView also returns 
> dataframe object, so in one pipeline of functions 
> we can 
> - create an external table 
> - create a dataframe reference to this newly created for SparkSQL and as a 
> Spark variable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view

2017-07-20 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21488:
--
Summary: Make saveAsTable() and createOrReplaceTempView() return dataframe 
of created table/ created view  (was: Make saveAsTable() return dataframe of 
created table)

> Make saveAsTable() and createOrReplaceTempView() return dataframe of created 
> table/ created view
> 
>
> Key: SPARK-21488
> URL: https://issues.apache.org/jira/browse/SPARK-21488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>
> It would be great to make saveAsTable() return dataframe of created table, 
> so you could pipe result further as for example
> {code}
> (sqlc.sql('''
> SELECT ...
> FROM 
> ''')
> .write.format("parquet").mode("overwrite")
> .saveAsTable('test.parquet_table')
> .createOrReplaceTempView('mv_table')
> )
> {code}
> ... Above code returns now expectedly:
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
> {noformat}
> If this is implemented, we can skip a step like
> {code}
> sqlc.sql('SELECT * FROM 
> test.parquet_table').createOrReplaceTempView('mv_table')
> {code}
> We have this pattern very frequently. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21488) Make saveAsTable() return dataframe of created table

2017-07-20 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-21488:
-

 Summary: Make saveAsTable() return dataframe of created table
 Key: SPARK-21488
 URL: https://issues.apache.org/jira/browse/SPARK-21488
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.2.0
Reporter: Ruslan Dautkhanov


It would be great to make saveAsTable() return dataframe of created table, 
so you could pipe result further as for example

{code}
(sqlc.sql('''
SELECT ...
FROM 
''')
.write.format("parquet").mode("overwrite")
.saveAsTable('test.parquet_table')
.createOrReplaceTempView('mv_table')
)
{code}

... Above code returns now expectedly:
{noformat}
AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView'
{noformat}

If this is implemented, we can skip a step like
{code}
sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table')
{code}

We have this pattern very frequently. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21463) Output of StructuredStreaming tables don't respect user specified schema when reading back the table

2017-07-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21463.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Output of StructuredStreaming tables don't respect user specified schema when 
> reading back the table
> 
>
> Key: SPARK-21463
> URL: https://issues.apache.org/jira/browse/SPARK-21463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.3.0
>
>
> When using the MetadataLogFileIndex to read back a table, we don't respect 
> the user provided schema as the proper column types. This can lead to issues 
> when trying to read strings that look like dates that get truncated to 
> DateType, or longs being truncated to IntegerType, just because a long value 
> doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21478) Unpersist a DF also unpersists related DFs

2017-07-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21478:
-
Component/s: (was: Spark Core)
 SQL

> Unpersist a DF also unpersists related DFs
> --
>
> Key: SPARK-21478
> URL: https://issues.apache.org/jira/browse/SPARK-21478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Roberto Mirizzi
>
> Starting with Spark 2.1.1 I observed this bug. Here's are the steps to 
> reproduce it:
> # create a DF
> # persist it
> # count the items in it
> # create a new DF as a transformation of the previous one
> # persist it
> # count the items in it
> # unpersist the first DF
> Once you do that you will see that also the 2nd DF is gone.
> The code to reproduce it is:
> {code:java}
> val x1 = Seq(1).toDF()
> x1.persist()
> x1.count()
> assert(x1.storageLevel.useMemory)
> val x11 = x1.select($"value" * 2)
> x11.persist()
> x11.count()
> assert(x11.storageLevel.useMemory)
> x1.unpersist()
> assert(!x1.storageLevel.useMemory)
> //the following assertion FAILS
> assert(x11.storageLevel.useMemory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full

2017-07-20 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095131#comment-16095131
 ] 

Ruslan Dautkhanov commented on SPARK-21460:
---

[~Dhruve Ashar], I can email logs to you. Although logs are not so revealing, 
basically problem starts at 
{noformat}
ERROR [2017-05-15 10:37:53,350] ({dag-scheduler-event-loop} 
Logging.scala[logError]:70) - Dropping SparkListenerEvent because no remaining 
room in event queue. 
This likely means one of the SparkListeners is too slow and cannot keep up with 
the rate at which tasks are being started by the scheduler. 
{noformat}
and then nothing interesting. 

We were hitting it constantly until following changes were made:
- disable concurrentSQL
- increase spark.scheduler.listenerbus.eventqueue.size to 55000
- spark.dynamicAllocation.maxExecutors set to 210 (it was not set /unlimited)

After that we have seen it rarely - a few times (changes were made back in 
May). Also it was happening mostly with several users who were using 
concurrentSQL actively (was submitting multiple jobs before previous ones 
completed). Although concurrentSQL isn't the problem - it just makes 
ListenerBus event queue runs full quicker. Again, we have seen a few times the 
same issue after above workaround changes were made including disabling 
concurrentSQL.

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> --
>
> Key: SPARK-21460
> URL: https://issues.apache.org/jira/browse/SPARK-21460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Spark 2.1 
> Hadoop 2.6
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops 
> working - Spark fails to shrink number of executors when there are no active 
> jobs (Spark driver "thinks" there are active jobs since it didn't capture 
> when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, 
> so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 
> 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095095#comment-16095095
 ] 

Robert Kruszewski commented on SPARK-21334:
---

I think this is a dupe of https://issues.apache.org/jira/browse/SPARK-18364

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21334:


Assignee: Apache Spark

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>Assignee: Apache Spark
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095063#comment-16095063
 ] 

Apache Spark commented on SPARK-21334:
--

User 'raajay' has created a pull request for this issue:
https://github.com/apache/spark/pull/18690

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21334:


Assignee: (was: Apache Spark)

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Raajay Viswanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095056#comment-16095056
 ] 

Raajay Viswanathan commented on SPARK-21334:


[~jerryshao] I am using external shuffle service in a standalone manner.

Also, I am mistaken in the claim that the "blockTransferRate" is incorrect. It 
is done in a proper manner in the latest version of Spark; it was fixed as part 
of [SPARK-20994]. 

> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Raajay Viswanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raajay Viswanathan updated SPARK-21334:
---
Description: 
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
--2. The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message. --


  was:
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
-2. The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message. -



> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> --2. The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message. --



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Raajay Viswanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raajay Viswanathan updated SPARK-21334:
---
Description: 
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
2. -The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message.-


  was:
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
--2. The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message. --



> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> 2. -The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service

2017-07-20 Thread Raajay Viswanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raajay Viswanathan updated SPARK-21334:
---
Description: 
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
-2. The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message. -


  was:
SPARK-16405 introduced metrics for external shuffle service. However, as it is 
currently there are two issues.

1. The shuffle service metrics system does not report values ever.
2. The current method for determining "blockTransferRate" is incorrect. The 
entire block is assumed to be transferred once the OpenBlocks message if 
processed. The actual data fetch from the disk and the succeeding transfer over 
the wire happens much later when MessageEncoder invokes encode on 
ChunkFetchSuccess message. 



> Fix metrics for external shuffle service
> 
>
> Key: SPARK-21334
> URL: https://issues.apache.org/jira/browse/SPARK-21334
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Raajay Viswanathan
>  Labels: external-shuffle-service
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SPARK-16405 introduced metrics for external shuffle service. However, as it 
> is currently there are two issues.
> 1. The shuffle service metrics system does not report values ever.
> -2. The current method for determining "blockTransferRate" is incorrect. The 
> entire block is assumed to be transferred once the OpenBlocks message if 
> processed. The actual data fetch from the disk and the succeeding transfer 
> over the wire happens much later when MessageEncoder invokes encode on 
> ChunkFetchSuccess message. -



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-07-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21142.
---
Resolution: Fixed

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Assignee: Tim Van Wassenhove
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-07-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21142:
-

 Assignee: Tim Van Wassenhove
Fix Version/s: 2.3.0
   Issue Type: Improvement  (was: Bug)

Resolved by https://github.com/apache/spark/pull/18353

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Assignee: Tim Van Wassenhove
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19531) History server doesn't refresh jobs for long-life apps like thriftserver

2017-07-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19531.

   Resolution: Fixed
 Assignee: Oleg Danilov
Fix Version/s: 2.3.0

> History server doesn't refresh jobs for long-life apps like thriftserver
> 
>
> Key: SPARK-19531
> URL: https://issues.apache.org/jira/browse/SPARK-19531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Oleg Danilov
>Assignee: Oleg Danilov
> Fix For: 2.3.0
>
>
> If spark.history.fs.logDirectory points to hdfs, then spark history server 
> doesn't refresh jobs page. This is caused by Hadoop - during writing to the 
> .inprogress file Hadoop doesn't update file length until close and therefor 
> Spark's history server is not able to detect any changes.
> I'm gonna submit a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20394) Replication factor value Not changing properly

2017-07-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20394.

Resolution: Workaround

Great. I doubt we'll fix this in 1.6 at this point, and I believe this is fixed 
in 2.x, so let's close this.

> Replication factor value Not changing properly
> --
>
> Key: SPARK-20394
> URL: https://issues.apache.org/jira/browse/SPARK-20394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Kannan Subramanian
>
> To save SparkSQL dataframe to a persistent hive table using the below steps.
> a) RegisterTempTable to the dataframe as a tempTable
> b) create table  (cols)PartitionedBy(col1, col2) stored as 
> parquet
> c) Insert into  partition(col1, col2) select * from tempTable
> I have set dfs.replication is equal to "1" in hiveContext object. But It did 
> not work properly. That is replica is 1 for 80 % of the generated parquet 
> files on HDFS and default replica 3 is for remaining 20 % of parquet files in 
> HDFS. I am not sure why the replica is not reflecting to all the generated 
> parquet files. Please let me know if you have any suggestions or solutions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Claus Stadler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844
 ] 

Claus Stadler edited comment on SPARK-21417 at 7/20/17 3:41 PM:


Hi Anton,

I have a rough idea how the issue could be addressed, but I feel it would be 
better if the SPARK team provided feedback on the idea and possibly revises it. 
I am also not very familiar with the code, so I would be good if once the 
conceptual approach is sorted out, you could provide at least an initial 
implementation. Should there be any small issues left, I may be able to 
contribute fixes if needed.

Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' 

* Normalize the selection expression into disjunctive normal form => { {(a.x = 
'foo'), ('foo', b.y)}, {(c.z = 'bar')} }
* Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 
'foo')}, {(c.z = 'bar')} }
* For each clause in the DNF, infer transitive join conditions b using the 
following approach; Given {(a.x = 'foo'), ('foo', b.y)}
  * Create a multimap from all equal expressions of the form ee.ff = expr, 
which in our example gives { 'foo' => {a.x, b.y } }
  * For every multimap key having multiple values, infer appropriate equals 
expressions, in this case a.x = b.y.
  * Note: If there is more than 2 values, in general one create equal 
expressions for each pair-wise combination of distinct elements in the value 
set. Alternatively, one could pick a random element, and equate all other 
values to it. Not sure what representation is most beneficial for the 
underlying SPARK query optimizer.




was (Author: aklakan):
Hi Anton,

I have a rough idea how the issue could be addressed, but I feel it would be 
better if the SPARK team provided feedback on the idea and possibly revises it. 
I am also not very familiar with the code, so I would be good if once the 
conceptual approach is sorted out, you could provide at least an initial 
implementation. Should there be any small issues left, I may be able to 
contribute fixes if needed.

Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' 

* Normalize the selection expression into disjunctive normal form => { {(a.x = 
'foo'), ('foo', b.y)}, {(c.z = 'bar')} }
* Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 
'foo')}, {(c.z = 'bar')} }
* For each clause in the DNF, infer transitive join conditions b using the 
following approach: Given {(a.x = 'foo'), ('foo', b.y)}:
  * Create a multimap from all equal expressions of the form ee.ff = expr, 
which in our example gives { 'foo' => {a.x, b.y } }
  * For every multimap key having multiple values, infer appropriate equals 
expressions, in this case a.x = b.y.
  * Note: If there is more than 2 values, in general one create equal 
expressions for each pair-wise combination of distinct elements in the value 
set. Alternatively, one could pick a random element, and equate all other 
values to it. Not sure what representation is most beneficial for the 
underlying SPARK query optimizer.



> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL 

[jira] [Comment Edited] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Claus Stadler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844
 ] 

Claus Stadler edited comment on SPARK-21417 at 7/20/17 3:40 PM:


Hi Anton,

I have a rough idea how the issue could be addressed, but I feel it would be 
better if the SPARK team provided feedback on the idea and possibly revises it. 
I am also not very familiar with the code, so I would be good if once the 
conceptual approach is sorted out, you could provide at least an initial 
implementation. Should there be any small issues left, I may be able to 
contribute fixes if needed.

Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' 

* Normalize the selection expression into disjunctive normal form => { {(a.x = 
'foo'), ('foo', b.y)}, {(c.z = 'bar')} }
* Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 
'foo')}, {(c.z = 'bar')} }
* For each clause in the DNF, infer transitive join conditions b using the 
following approach: Given {(a.x = 'foo'), ('foo', b.y)}:
  * Create a multimap from all equal expressions of the form ee.ff = expr, 
which in our example gives { 'foo' => {a.x, b.y } }
  * For every multimap key having multiple values, infer appropriate equals 
expressions, in this case a.x = b.y.
  * Note: If there is more than 2 values, in general one create equal 
expressions for each pair-wise combination of distinct elements in the value 
set. Alternatively, one could pick a random element, and equate all other 
values to it. Not sure what representation is most beneficial for the 
underlying SPARK query optimizer.




was (Author: aklakan):
Hi Anton,

I have a rough idea how the issue could be addressed, but I feel it would be 
better if the SPARK team provided feedback on the idea and possibly revises it. 
I am also not very familiar with the code, so I would be good if once the 
conceptual approach is sorted out, you could provide at least an initial 
implementation. Should there be any small issues left, I may be able to 
contribute fixes if needed.

Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' 

* Normalize the selection expression into disjunctive normal form => { {(a.x = 
'foo'), ('foo', b.y)}, {(c.z = 'bar')} }
* Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 
'foo')}, {(c.z = 'bar')} }
* For each clause in the DNF, infer transitive join conditions b using the 
following approach: Given e.g. {(a.x = 'foo'), ('foo', b.y)}:
  * Create a multimap from all equal expressions of the form ee.ff = expr, 
which in our example gives { 'foo' => {a.x, b.y } }
  * For every multimap key having multiple values, infer appropriate equals 
expressions, in this case a.x = b.y.
  * Note: If there is more than 2 values, in general one create equal 
expressions for each pair-wise combination of distinct elements in the value 
set. Alternatively, one could pick a random element, and equate all other 
values to it. Not sure what representation is most beneficial for the 
underlying SPARK query optimizer.



> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   

[jira] [Commented] (SPARK-21417) Detect transitive join conditions via expressions

2017-07-20 Thread Claus Stadler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844
 ] 

Claus Stadler commented on SPARK-21417:
---

Hi Anton,

I have a rough idea how the issue could be addressed, but I feel it would be 
better if the SPARK team provided feedback on the idea and possibly revises it. 
I am also not very familiar with the code, so I would be good if once the 
conceptual approach is sorted out, you could provide at least an initial 
implementation. Should there be any small issues left, I may be able to 
contribute fixes if needed.

Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' 

* Normalize the selection expression into disjunctive normal form => { {(a.x = 
'foo'), ('foo', b.y)}, {(c.z = 'bar')} }
* Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 
'foo')}, {(c.z = 'bar')} }
* For each clause in the DNF, infer transitive join conditions b using the 
following approach: Given e.g. {(a.x = 'foo'), ('foo', b.y)}:
  * Create a multimap from all equal expressions of the form ee.ff = expr, 
which in our example gives { 'foo' => {a.x, b.y } }
  * For every multimap key having multiple values, infer appropriate equals 
expressions, in this case a.x = b.y.
  * Note: If there is more than 2 values, in general one create equal 
expressions for each pair-wise combination of distinct elements in the value 
set. Alternatively, one could pick a random element, and equate all other 
values to it. Not sure what representation is most beneficial for the 
underlying SPARK query optimizer.



> Detect transitive join conditions via expressions
> -
>
> Key: SPARK-21417
> URL: https://issues.apache.org/jira/browse/SPARK-21417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Claus Stadler
>
> _Disclaimer: The nature of this report is similar to that of 
> https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my 
> understanding) uses its own SQL implementation, the requested improvement has 
> to be treated as a separate issue._
> Given table aliases ta, tb column names ca, cb, and an arbitrary 
> (deterministic) expression expr then calcite should be capable to infer join 
> conditions by transitivity:
> {noformat}
> ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb
> {noformat}
> The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries 
> such as
> {code:java}
> SELECT {
>   dbr:Leipzig a ?type .
>   dbr:Leipzig dbo:mayor ?mayor
> }
> {code}
> result in an SQL query similar to
> {noformat}
> SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig'
> {noformat}
> A consequence of the join condition not being recognized is, that Apache 
> SPARK does not find an executable plan to process the query.
> Self contained example:
> {code:java}
> package my.package;
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.scalatest._
> class TestSparkSqlJoin extends FlatSpec {
>   "SPARK SQL processor" should "be capable of handling transitive join 
> conditions" in {
> val spark = SparkSession
>   .builder()
>   .master("local[2]")
>   .appName("Spark SQL parser bug")
>   .getOrCreate()
> import spark.implicits._
> // The schema is encoded in a string
> val schemaString = "s p o"
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
>   .map(fieldName => StructField(fieldName, StringType, nullable = true))
> val schema = StructType(fields)
> val data = List(("s1", "p1", "o1"))
> val dataRDD = spark.sparkContext.parallelize(data).map(attributes => 
> Row(attributes._1, attributes._2, attributes._3))
> val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES")
> df.createOrReplaceTempView("TRIPLES")
> println("First Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = 
> 'dbr:Leipzig'").show(10)
> println("Second Query")
> spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' 
> AND B.s = 'dbr:Leipzig'").show(10)
>   }
> }
> {code}
> Output (excerpt):
> {noformat}
> First Query
> ...
> +---+
> |  s|
> +---+
> +---+
> Second Query
> - should be capable of handling transitive join conditions *** FAILED ***
>   org.apache.spark.sql.AnalysisException: Detected cartesian product for 
> INNER join between logical plans
> Project [s#3]
> +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig))
>+- LogicalRDD [s#3, p#4, o#5]
> and
> Project
> +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig))
>+- LogicalRDD [s#20, p#21, o#22]
> Join condition is missing or trivial.
> Use 

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2017-07-20 Thread David Kats (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094797#comment-16094797
 ] 

David Kats commented on SPARK-15544:


Confirming the same issue with Spark 2.1.0 and 2.2.0, ubuntu 14.04, zookeeper 
3.4.5

017-07-20 12:48:25,151 INFO ClientCnxn: Client session timed out, have not 
heard from server in 35022ms for sessionid 0x15d5fb6dc7d0009, closing socket 
connection and attempting reconnect
2017-07-20 12:48:25,254 INFO ConnectionStateManager: State change: SUSPENDED
2017-07-20 12:48:25,268 INFO ZooKeeperLeaderElectionAgent: We have lost 
leadership
2017-07-20 12:48:25,295 ERROR Master: Leadership has been revoked -- master 
shutting down.


> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"

2017-07-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094794#comment-16094794
 ] 

Sean Owen commented on SPARK-21487:
---

I think this is a YARN issue/question.

> WebUI-Executors Page results in "Request is a replay (34) attack"
> -
>
> Key: SPARK-21487
> URL: https://issues.apache.org/jira/browse/SPARK-21487
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: lishuming
>Priority: Minor
>
> We upgraded Spark version from 2.0.2 to 2.1.1 recently,  WebUI `Executors 
> Page` becomed empty, with the exception below.
> `Executor Page` rendering using javascript language rather than scala in 
> 2.1.1, but I don't know why causes this result?
> "two queries are submitted at the same time and have the same timestamp may 
> cause this result", but I'm not sure?
> ResouceManager log:
> {code:java}
> 2017-07-20 20:39:09,371 WARN 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
> Authentication exception: GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> {code}
> Safari explorer console
> {code:java}
> Failed to load resource: the server responded with a status of 403 
> (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request 
> is a replay 
> (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html
> {code}
> Related Links:
> https://issues.apache.org/jira/browse/HIVE-12481
> https://issues.apache.org/jira/browse/HADOOP-8830



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"

2017-07-20 Thread lishuming (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lishuming updated SPARK-21487:
--
Description: 
We upgraded Spark version from 2.0.2 to 2.1.1 recently,  WebUI `Executors Page` 
becomed empty, with the exception below.

`Executor Page` rendering using javascript language rather than scala in 2.1.1, 
but I don't know why causes this result?

"two queries are submitted at the same time and have the same timestamp may 
cause this result", but I'm not sure?

ResouceManager log:
{code:java}
2017-07-20 20:39:09,371 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
Authentication exception: GSSException: Failure unspecified at GSS-API level 
(Mechanism level: Request is a replay (34))
{code}

Safari explorer console
{code:java}
Failed to load resource: the server responded with a status of 403 
(GSSException: Failure unspecified at GSS-API level (Mechanism level: Request 
is a replay 
(34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html
{code}

Related Links:
https://issues.apache.org/jira/browse/HIVE-12481
https://issues.apache.org/jira/browse/HADOOP-8830

  was:
We upgraded Spark version from 2.0.2 to 2.1.1 recently,  WebUI `Executors Page` 
becomed empty, with the exception below.

`Executor Page` rendering using javascript language rather than scala in 2.1.1, 
but I don't know why causes this result?

"two queries are submitted at the same time and have the same timestamp may 
cause this result", but I'm not sure?

ResouceManager log:
{code:java}
2017-07-20 20:39:09,371 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
Authentication exception: GSSException: Failure unspecified at GSS-API level 
(Mechanism level: Request is a replay (34))
{code}

Safari explorer console
{code:java}
Failed to load resource: the server responded with a status of 403 
(GSSException: Failure unspecified at GSS-API level (Mechanism level: Request 
is a replay 
(34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html
{code}

Recent Links:
https://issues.apache.org/jira/browse/HIVE-12481
https://issues.apache.org/jira/browse/HADOOP-8830


> WebUI-Executors Page results in "Request is a replay (34) attack"
> -
>
> Key: SPARK-21487
> URL: https://issues.apache.org/jira/browse/SPARK-21487
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: lishuming
>Priority: Minor
>
> We upgraded Spark version from 2.0.2 to 2.1.1 recently,  WebUI `Executors 
> Page` becomed empty, with the exception below.
> `Executor Page` rendering using javascript language rather than scala in 
> 2.1.1, but I don't know why causes this result?
> "two queries are submitted at the same time and have the same timestamp may 
> cause this result", but I'm not sure?
> ResouceManager log:
> {code:java}
> 2017-07-20 20:39:09,371 WARN 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
> Authentication exception: GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> {code}
> Safari explorer console
> {code:java}
> Failed to load resource: the server responded with a status of 403 
> (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request 
> is a replay 
> (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html
> {code}
> Related Links:
> https://issues.apache.org/jira/browse/HIVE-12481
> https://issues.apache.org/jira/browse/HADOOP-8830



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"

2017-07-20 Thread lishuming (JIRA)
lishuming created SPARK-21487:
-

 Summary: WebUI-Executors Page results in "Request is a replay (34) 
attack"
 Key: SPARK-21487
 URL: https://issues.apache.org/jira/browse/SPARK-21487
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.1
Reporter: lishuming
Priority: Minor


We upgraded Spark version from 2.0.2 to 2.1.1 recently,  WebUI `Executors Page` 
becomed empty, with the exception below.

`Executor Page` rendering using javascript language rather than scala in 2.1.1, 
but I don't know why causes this result?

"two queries are submitted at the same time and have the same timestamp may 
cause this result", but I'm not sure?

ResouceManager log:
{code:java}
2017-07-20 20:39:09,371 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
Authentication exception: GSSException: Failure unspecified at GSS-API level 
(Mechanism level: Request is a replay (34))
{code}

Safari explorer console
{code:java}
Failed to load resource: the server responded with a status of 403 
(GSSException: Failure unspecified at GSS-API level (Mechanism level: Request 
is a replay 
(34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html
{code}

Recent Links:
https://issues.apache.org/jira/browse/HIVE-12481
https://issues.apache.org/jira/browse/HADOOP-8830



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full

2017-07-20 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094689#comment-16094689
 ] 

Dhruve Ashar commented on SPARK-21460:
--

[~Tagar] Can you attach the driver logs so that it helps in investigating. I am 
not able to reproduce this issue on my end and would like to check more on 
this. Also how frequently are you hitting this?

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> --
>
> Key: SPARK-21460
> URL: https://issues.apache.org/jira/browse/SPARK-21460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Spark 2.1 
> Hadoop 2.6
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops 
> working - Spark fails to shrink number of executors when there are no active 
> jobs (Spark driver "thinks" there are active jobs since it didn't capture 
> when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, 
> so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 
> 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20394) Replication factor value Not changing properly

2017-07-20 Thread Kannan Subramanian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094686#comment-16094686
 ] 

Kannan Subramanian commented on SPARK-20394:


Yes. I have edited the hdfs-site.xml for replication value change and override 
the existing Spark environment. It works now.

Thanks for your response
Kannan

> Replication factor value Not changing properly
> --
>
> Key: SPARK-20394
> URL: https://issues.apache.org/jira/browse/SPARK-20394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Kannan Subramanian
>
> To save SparkSQL dataframe to a persistent hive table using the below steps.
> a) RegisterTempTable to the dataframe as a tempTable
> b) create table  (cols)PartitionedBy(col1, col2) stored as 
> parquet
> c) Insert into  partition(col1, col2) select * from tempTable
> I have set dfs.replication is equal to "1" in hiveContext object. But It did 
> not work properly. That is replica is 1 for 80 % of the generated parquet 
> files on HDFS and default replica 3 is for remaining 20 % of parquet files in 
> HDFS. I am not sure why the replica is not reflecting to all the generated 
> parquet files. Please let me know if you have any suggestions or solutions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21472) Introduce ArrowColumnVector as a reader for Arrow vectors.

2017-07-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21472:
---

Assignee: Takuya Ueshin

> Introduce ArrowColumnVector as a reader for Arrow vectors.
> --
>
> Key: SPARK-21472
> URL: https://issues.apache.org/jira/browse/SPARK-21472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.3.0
>
>
> Introducing {{ArrowColumnVector}} as a reader for Arrow vectors.
> It extends {{ColumnVector}}, so we will be able to use it with 
> {{ColumnarBatch}} and its functionalities.
> Currently it supports primitive types and {{StringType}}, {{ArrayType}} and 
> {{StructType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21472) Introduce ArrowColumnVector as a reader for Arrow vectors.

2017-07-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21472.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18680
[https://github.com/apache/spark/pull/18680]

> Introduce ArrowColumnVector as a reader for Arrow vectors.
> --
>
> Key: SPARK-21472
> URL: https://issues.apache.org/jira/browse/SPARK-21472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
> Fix For: 2.3.0
>
>
> Introducing {{ArrowColumnVector}} as a reader for Arrow vectors.
> It extends {{ColumnVector}}, so we will be able to use it with 
> {{ColumnarBatch}} and its functionalities.
> Currently it supports primitive types and {{StringType}}, {{ArrayType}} and 
> {{StructType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2017-07-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094610#comment-16094610
 ] 

Apache Spark commented on SPARK-10063:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18689

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094607#comment-16094607
 ] 

Aseem Bansal edited comment on SPARK-21483 at 7/20/17 12:29 PM:


Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data. Raw 
data is just a java bean
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}



was (Author: anshbansal):
Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}


> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094607#comment-16094607
 ] 

Aseem Bansal commented on SPARK-21483:
--

Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}


> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-20 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094603#comment-16094603
 ] 

Peng Meng commented on SPARK-21476:
---

I am optimizing RF and GBT these days, if no one works on it. I can working on 
it. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-07-20 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094547#comment-16094547
 ] 

Ioana Delaney commented on SPARK-19842:
---

Yes, we've been working to productize our code. Our progress was slowed down 
due to some other ongoing projects, but we are planning to open the first round 
of PRs in the next few weeks if the community agrees with this feature. 


> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21477) Mark LocalTableScanExec's input data transient

2017-07-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21477:
---

Assignee: Xiao Li

> Mark LocalTableScanExec's input data transient
> --
>
> Key: SPARK-21477
> URL: https://issues.apache.org/jira/browse/SPARK-21477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Mark the parameter rows and unsafeRow of LocalTableScanExec transient. It can 
> avoid serializing the unneeded objects.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >