[jira] [Commented] (SPARK-16683) Group by does not work after multiple joins of the same dataframe
[ https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095790#comment-16095790 ] Apache Spark commented on SPARK-16683: -- User 'aray' has created a pull request for this issue: https://github.com/apache/spark/pull/18697 > Group by does not work after multiple joins of the same dataframe > - > > Key: SPARK-16683 > URL: https://issues.apache.org/jira/browse/SPARK-16683 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 > Environment: local and yarn >Reporter: Witold Jędrzejewski > Attachments: code_2.0.txt, Duplicates Problem Presentation.json > > > When I join a dataframe, group by a field from it, then join it again by > different field and group by field from it, second aggregation does not > trigger. > Minimal example showing the problem is attached as the text to paste into > spark-shell (code_2.0.txt). > The detailed description and minimal example, workaround and possible cause > are in the attachment, in a form of Zeppelin notebook. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16683) Group by does not work after multiple joins of the same dataframe
[ https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16683: Assignee: (was: Apache Spark) > Group by does not work after multiple joins of the same dataframe > - > > Key: SPARK-16683 > URL: https://issues.apache.org/jira/browse/SPARK-16683 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 > Environment: local and yarn >Reporter: Witold Jędrzejewski > Attachments: code_2.0.txt, Duplicates Problem Presentation.json > > > When I join a dataframe, group by a field from it, then join it again by > different field and group by field from it, second aggregation does not > trigger. > Minimal example showing the problem is attached as the text to paste into > spark-shell (code_2.0.txt). > The detailed description and minimal example, workaround and possible cause > are in the attachment, in a form of Zeppelin notebook. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16683) Group by does not work after multiple joins of the same dataframe
[ https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16683: Assignee: Apache Spark > Group by does not work after multiple joins of the same dataframe > - > > Key: SPARK-16683 > URL: https://issues.apache.org/jira/browse/SPARK-16683 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 > Environment: local and yarn >Reporter: Witold Jędrzejewski >Assignee: Apache Spark > Attachments: code_2.0.txt, Duplicates Problem Presentation.json > > > When I join a dataframe, group by a field from it, then join it again by > different field and group by field from it, second aggregation does not > trigger. > Minimal example showing the problem is attached as the text to paste into > spark-shell (code_2.0.txt). > The detailed description and minimal example, workaround and possible cause > are in the attachment, in a form of Zeppelin notebook. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21497) Pull non-deterministic joining keys from Join operator
[ https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21497: Assignee: Apache Spark > Pull non-deterministic joining keys from Join operator > -- > > Key: SPARK-21497 > URL: https://issues.apache.org/jira/browse/SPARK-21497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Currently SparkSQL doesn't support non-deterministic joining conditions in > Join. This kind of joining conditions can be useful in some cases, e.g., > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973. > To pull non-deterministic joining conditions from Join operator, there seems > no standard behavior. Based on the discussion on > https://github.com/apache/spark/pull/18652#issuecomment-316344905, > https://github.com/apache/spark/pull/18652#issuecomment-316391759 and > https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive > doesn't have special consideration for non-deterministic join conditions and > simply pushes down it or uses it as joining keys. > In this attempt, we initially allow non-deterministic equi join keys in Join > operators. Because based on SparkSQL's join implementations the equi join > keys are evaluated once on joining tables, pulling equi join keys from Join > operators won't change the number of calls on non-deterministic expressions. > It is more safer than other kinds of joining conditions, e.g. rand(10) > a && > rand(20) < b where pulling it and pushing down it will possibly change the > number of calls of rand(). > We also add a SQL conf to control this new behavior. It is disabled by > default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21497) Pull non-deterministic joining keys from Join operator
[ https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21497: Assignee: (was: Apache Spark) > Pull non-deterministic joining keys from Join operator > -- > > Key: SPARK-21497 > URL: https://issues.apache.org/jira/browse/SPARK-21497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > Currently SparkSQL doesn't support non-deterministic joining conditions in > Join. This kind of joining conditions can be useful in some cases, e.g., > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973. > To pull non-deterministic joining conditions from Join operator, there seems > no standard behavior. Based on the discussion on > https://github.com/apache/spark/pull/18652#issuecomment-316344905, > https://github.com/apache/spark/pull/18652#issuecomment-316391759 and > https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive > doesn't have special consideration for non-deterministic join conditions and > simply pushes down it or uses it as joining keys. > In this attempt, we initially allow non-deterministic equi join keys in Join > operators. Because based on SparkSQL's join implementations the equi join > keys are evaluated once on joining tables, pulling equi join keys from Join > operators won't change the number of calls on non-deterministic expressions. > It is more safer than other kinds of joining conditions, e.g. rand(10) > a && > rand(20) < b where pulling it and pushing down it will possibly change the > number of calls of rand(). > We also add a SQL conf to control this new behavior. It is disabled by > default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21497) Pull non-deterministic joining keys from Join operator
[ https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095756#comment-16095756 ] Apache Spark commented on SPARK-21497: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/18652 > Pull non-deterministic joining keys from Join operator > -- > > Key: SPARK-21497 > URL: https://issues.apache.org/jira/browse/SPARK-21497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > Currently SparkSQL doesn't support non-deterministic joining conditions in > Join. This kind of joining conditions can be useful in some cases, e.g., > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973. > To pull non-deterministic joining conditions from Join operator, there seems > no standard behavior. Based on the discussion on > https://github.com/apache/spark/pull/18652#issuecomment-316344905, > https://github.com/apache/spark/pull/18652#issuecomment-316391759 and > https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive > doesn't have special consideration for non-deterministic join conditions and > simply pushes down it or uses it as joining keys. > In this attempt, we initially allow non-deterministic equi join keys in Join > operators. Because based on SparkSQL's join implementations the equi join > keys are evaluated once on joining tables, pulling equi join keys from Join > operators won't change the number of calls on non-deterministic expressions. > It is more safer than other kinds of joining conditions, e.g. rand(10) > a && > rand(20) < b where pulling it and pushing down it will possibly change the > number of calls of rand(). > We also add a SQL conf to control this new behavior. It is disabled by > default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21497) Pull non-deterministic joining keys from Join operator
[ https://issues.apache.org/jira/browse/SPARK-21497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21497: Description: Currently SparkSQL doesn't support non-deterministic joining conditions in Join. This kind of joining conditions can be useful in some cases, e.g., http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973. To pull non-deterministic joining conditions from Join operator, there seems no standard behavior. Based on the discussion on https://github.com/apache/spark/pull/18652#issuecomment-316344905, https://github.com/apache/spark/pull/18652#issuecomment-316391759 and https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive doesn't have special consideration for non-deterministic join conditions and simply pushes down it or uses it as joining keys. In this attempt, we initially allow non-deterministic equi join keys in Join operators. Because based on SparkSQL's join implementations the equi join keys are evaluated once on joining tables, pulling equi join keys from Join operators won't change the number of calls on non-deterministic expressions. It is more safer than other kinds of joining conditions, e.g. rand(10) > a && rand(20) < b where pulling it and pushing down it will possibly change the number of calls of rand(). We also add a SQL conf to control this new behavior. It is disabled by default. > Pull non-deterministic joining keys from Join operator > -- > > Key: SPARK-21497 > URL: https://issues.apache.org/jira/browse/SPARK-21497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > Currently SparkSQL doesn't support non-deterministic joining conditions in > Join. This kind of joining conditions can be useful in some cases, e.g., > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973. > To pull non-deterministic joining conditions from Join operator, there seems > no standard behavior. Based on the discussion on > https://github.com/apache/spark/pull/18652#issuecomment-316344905, > https://github.com/apache/spark/pull/18652#issuecomment-316391759 and > https://github.com/apache/spark/pull/18652#issuecomment-316665649, Hive > doesn't have special consideration for non-deterministic join conditions and > simply pushes down it or uses it as joining keys. > In this attempt, we initially allow non-deterministic equi join keys in Join > operators. Because based on SparkSQL's join implementations the equi join > keys are evaluated once on joining tables, pulling equi join keys from Join > operators won't change the number of calls on non-deterministic expressions. > It is more safer than other kinds of joining conditions, e.g. rand(10) > a && > rand(20) < b where pulling it and pushing down it will possibly change the > number of calls of rand(). > We also add a SQL conf to control this new behavior. It is disabled by > default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21497) Pull non-deterministic joining keys from Join operator
Liang-Chi Hsieh created SPARK-21497: --- Summary: Pull non-deterministic joining keys from Join operator Key: SPARK-21497 URL: https://issues.apache.org/jira/browse/SPARK-21497 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095744#comment-16095744 ] Shixiong Zhu commented on SPARK-21425: -- [~rdub] I just realized we never document local-cluster mode. You can use "--master local-cluster[N, cores, memory]" to start a cluster which runs driver, worker and master in one process, and executors in their own processes (See https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala) Spark assumes accumulators are always captured by closures. When starting a task in an executor, Spark will deserialize the closure. In this step, it create a copy for each accumulator in the executor. When a task finishes, the value in the copied accumulator will be sent back to driver. Driver will merge all values collected from tasks and set the final value to the accumulator in the driver side. Because static accumulators cannot be captured by closures, the mechanism doesn't work for them. See AccumulatorV2.writeReplace and AccumulatorV2.readObject if you want to dig into details. > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server
[ https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Yu Pan updated SPARK-21495: --- Description: We hit an issue when enabling authentication and Sasl encryption, see bold font in following parameter list. spark.local.dir /tmp/xpan-spark-161 spark.eventLog.dir file:///home/xpan/spark-conf/event spark.eventLog.enabled true spark.history.fs.logDirectory file:/home/xpan/spark-conf/event spark.history.ui.port 18085 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 14d spark.dynamicAllocation.enabled false spark.shuffle.service.enabled false spark.shuffle.service.port 7448 spark.shuffle.reduceLocality.enabled false spark.master.port 7087 spark.master.rest.port 6077 spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom *spark.authenticate true spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2 spark.authenticate.enableSaslEncryption false spark.network.sasl.serverAlwaysEncrypt false* We run the simple SparkPi example and there are Exception messages even though the application gets done. # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1 ... ... 17/07/20 02:57:30 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(xpan); users with modify permissions: Set(xpan) 17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true 17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170720025800-, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} 17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs 17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection from /172.29.10.77:50616 io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:785) Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at com.ibm.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1535) at com.ibm.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:231) at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:149) at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:127) at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:102) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) ... 13 more 17/07/20 02:58:11 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=908084716000, chunkIndex=1}, buffer=FileSegmentManagedBuffer{file=/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0/0c/shuffle_0_17_0.data, offset=1893612, length=302981}} to /172.29.10.77:50616; closing connection java.nio.channels.ClosedChannelException
[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server
[ https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Yu Pan updated SPARK-21495: --- Description: We hit an issue when enabling authentication and Sasl encryption, see bold font in following parameter list. spark.local.dir /tmp/xpan-spark-161 spark.eventLog.dir file:///home/xpan/spark-conf/event spark.eventLog.enabled true spark.history.fs.logDirectory file:/home/xpan/spark-conf/event spark.history.ui.port 18085 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 14d spark.dynamicAllocation.enabled false spark.shuffle.service.enabled false spark.shuffle.service.port 7448 spark.shuffle.reduceLocality.enabled false spark.master.port 7087 spark.master.rest.port 6077 spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom *spark.authenticate true spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2 spark.authenticate.enableSaslEncryption false spark.network.sasl.serverAlwaysEncrypt false* We run the simple SparkPi example and there are Exception messages even though the application gets done. # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp /opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/ -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService 17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(xpan); users with modify permissions: Set(xpan) 17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1 ... ... 17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true 17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170720025800-, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} 17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs 17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection from /172.29.10.77:50616 io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:785) Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at
[jira] [Updated] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server
[ https://issues.apache.org/jira/browse/SPARK-21495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Yu Pan updated SPARK-21495: --- Description: We hit an issue when enabling authentication and Sasl encryption, see bold font in following parameter list. spark.local.dir /tmp/xpan-spark-161 spark.eventLog.dir file:///home/xpan/spark-conf/event spark.eventLog.enabled true spark.history.fs.logDirectory file:/home/xpan/spark-conf/event spark.history.ui.port 18085 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 14d spark.dynamicAllocation.enabled false spark.shuffle.service.enabled false spark.shuffle.service.port 7448 spark.shuffle.reduceLocality.enabled false spark.master.port 7087 spark.master.rest.port 6077 spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom spark.authenticate true spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2 spark.authenticate.enableSaslEncryption false spark.network.sasl.serverAlwaysEncrypt false We run the simple SparkPi example and there are Exception messages even though the application gets done. # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp /opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/ -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService 17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(xpan); users with modify permissions: Set(xpan) 17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1 ... ... 17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true 17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170720025800-, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} 17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs 17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection from /172.29.10.77:50616 io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:785) Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at
[jira] [Created] (SPARK-21496) Support codegen for TakeOrderedAndProjectExec
Jiang Xingbo created SPARK-21496: Summary: Support codegen for TakeOrderedAndProjectExec Key: SPARK-21496 URL: https://issues.apache.org/jira/browse/SPARK-21496 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Jiang Xingbo Priority: Minor The operator `SortExec` supports codegen, but `TakeOrderedAndProjectExec` doesn't. Perhaps we should also add codegen support for `TakeOrderedAndProjectExec`, but we should also do benchmark for it carefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21495) DIGEST-MD5: Out of order sequencing of messages from server
Xin Yu Pan created SPARK-21495: -- Summary: DIGEST-MD5: Out of order sequencing of messages from server Key: SPARK-21495 URL: https://issues.apache.org/jira/browse/SPARK-21495 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.6.1 Environment: OS: RedHat 7.1 64bit Spark: 1.6.1 Reporter: Xin Yu Pan We hit an issue when enabling authentication and Sasl encryption, see bold font in following parameter list. spark.local.dir /tmp/xpan-spark-161 spark.eventLog.dir file:///home/xpan/spark-conf/event spark.eventLog.enabled true spark.history.fs.logDirectory file:/home/xpan/spark-conf/event spark.history.ui.port 18085 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 14d spark.dynamicAllocation.enabled false spark.shuffle.service.enabled false spark.shuffle.service.port 7448 spark.shuffle.reduceLocality.enabled false spark.master.port 7087 spark.master.rest.port 6077 spark.executor.extraJavaOptions -Djava.security.egd=file:/dev/./urandom **spark.authenticate true spark.authenticate.secret 5828d44b-f9b9-4033-b1f5-21d1e3273ec2 spark.authenticate.enableSaslEncryption false spark.network.sasl.serverAlwaysEncrypt false** We run the simple SparkPi example and there are Exception messages even though the application gets done. # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out Spark Command: /opt/xpan/cws22-0713/jre/8.0.3.21/linux-x86_64/bin/java -cp /opt/xpan/spark-1.6.1-bin-hadoop2.6/conf/:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/xpan/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/xpan/hadoop-2.8.0/etc/hadoop/ -Xms2g -Xmx2g org.apache.spark.deploy.ExternalShuffleService 17/07/20 06:24:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/20 06:24:10 INFO spark.SecurityManager: Changing view acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: Changing modify acls to: xpan 17/07/20 06:24:10 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(xpan); users with modify permissions: Set(xpan) 17/07/20 06:24:11 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true # cat spark-1.6.1-bin-hadoop2.6/logs/spark-xpan-org.apache.spark.deploy.ExternalShuffleService-1-cws-75.out.1 ... ... 17/07/20 02:57:31 INFO deploy.ExternalShuffleService: Starting shuffle service on port 7448 with useSasl = true 17/07/20 02:58:04 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170720025800-, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/xpan-spark-161/spark-8e4885a3-c463-4dfb-a396-04e16b65fd1e/executor-be15fcd0-c946-4c83-ba25-3b20bbce5b0e/blockmgr-0fd2658a-ce15-4d56-901c-4c746161bbe0], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} 17/07/20 02:58:11 INFO security.sasl: DIGEST41:Unmatched MACs 17/07/20 02:58:11 WARN server.TransportChannelHandler: Exception in connection from /172.29.10.77:50616 io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 125 Expected: 123 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at
[jira] [Comment Edited] (SPARK-21486) Fail when using aliased column of a aliased table from a subquery
[ https://issues.apache.org/jira/browse/SPARK-21486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095689#comment-16095689 ] Liang-Chi Hsieh edited comment on SPARK-21486 at 7/21/17 2:34 AM: -- Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please see SPARK-21335. It is considered a wrong SQL query. So, this query works: {code} scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show() +---+ | ta| +---+ | 1| | 2| | 3| +---+ {code} But this query can't work: {code} scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show() org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input columns: [__auto_generated_subquery_n ame.ta]; line 1 pos 7; 'Project ['a21.ta] +- SubqueryAlias __auto_generated_subquery_name +- Project [a#7 AS ta#31] +- SubqueryAlias a21 +- SubqueryAlias test +- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9] +- LocalRelation [_1#3, _2#4, _3#5] {code} was (Author: viirya): Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please see SPARK-21335. So, this query works: {code} scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show() +---+ | ta| +---+ | 1| | 2| | 3| +---+ {code} But this query can't work: {code} scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show() org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input columns: [__auto_generated_subquery_n ame.ta]; line 1 pos 7; 'Project ['a21.ta] +- SubqueryAlias __auto_generated_subquery_name +- Project [a#7 AS ta#31] +- SubqueryAlias a21 +- SubqueryAlias test +- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9] +- LocalRelation [_1#3, _2#4, _3#5] {code} > Fail when using aliased column of a aliased table from a subquery > - > > Key: SPARK-21486 > URL: https://issues.apache.org/jira/browse/SPARK-21486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Miguel Angel Fernandez Diaz > > SparkSQL cannot resolve a qualified column when using the column alias and > the table alias from a subquery. > Example: > {code:sql} > SELECT a21.ta FROM (SELECT tip_amount as ta FROM yellow a21) > {code} > Error: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input > columns: [ta]; line 1 pos 7; > 'Project ['a21.ta] > +- Project [tip_amount#505 AS ta#489] >+- SubqueryAlias a21 > +- SubqueryAlias yellow > +- > Relation[VendorID#490,tpep_pickup_datetime#491,tpep_dropoff_datetime#492,passenger_count#493,trip_distance#494,pickup_longitude#495,pickup_latitude#496,RatecodeID#497,store_and_fwd_flag#498,dropoff_longitude#499,dropoff_latitude#500,payment_type#501,fare_amount#502,extra#503,mta_tax#504,tip_amount#505,tolls_amount#506,improvement_surcharge#507,total_amount#508] > csv > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at >
[jira] [Commented] (SPARK-21486) Fail when using aliased column of a aliased table from a subquery
[ https://issues.apache.org/jira/browse/SPARK-21486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095689#comment-16095689 ] Liang-Chi Hsieh commented on SPARK-21486: - Since 2.2.0, it is not allowed to use the qualifier inside a subquery. Please see SPARK-21335. So, this query works: {code} scala> sql("SELECT ta FROM (SELECT a as ta FROM test a21)").show() +---+ | ta| +---+ | 1| | 2| | 3| +---+ {code} But this query can't work: {code} scala> sql("SELECT a21.ta FROM (SELECT a as ta FROM test a21)").show() org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input columns: [__auto_generated_subquery_n ame.ta]; line 1 pos 7; 'Project ['a21.ta] +- SubqueryAlias __auto_generated_subquery_name +- Project [a#7 AS ta#31] +- SubqueryAlias a21 +- SubqueryAlias test +- Project [_1#3 AS a#7, _2#4 AS b#8, _3#5 AS c#9] +- LocalRelation [_1#3, _2#4, _3#5] {code} > Fail when using aliased column of a aliased table from a subquery > - > > Key: SPARK-21486 > URL: https://issues.apache.org/jira/browse/SPARK-21486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Miguel Angel Fernandez Diaz > > SparkSQL cannot resolve a qualified column when using the column alias and > the table alias from a subquery. > Example: > {code:sql} > SELECT a21.ta FROM (SELECT tip_amount as ta FROM yellow a21) > {code} > Error: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`a21.ta`' given input > columns: [ta]; line 1 pos 7; > 'Project ['a21.ta] > +- Project [tip_amount#505 AS ta#489] >+- SubqueryAlias a21 > +- SubqueryAlias yellow > +- > Relation[VendorID#490,tpep_pickup_datetime#491,tpep_dropoff_datetime#492,passenger_count#493,trip_distance#494,pickup_longitude#495,pickup_latitude#496,RatecodeID#497,store_and_fwd_flag#498,dropoff_longitude#499,dropoff_latitude#500,payment_type#501,fare_amount#502,extra#503,mta_tax#504,tip_amount#505,tolls_amount#506,improvement_surcharge#507,total_amount#508] > csv > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at >
[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095678#comment-16095678 ] Ryan Williams commented on SPARK-21425: --- [~zsxwing] yea, it's static accumulators, and seems to only be a problem in "local" mode. I just updated [my comment above|https://issues.apache.org/jira/browse/SPARK-21425?focusedCommentId=16090017=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16090017] because I'd previously thought I'd been able to repro this in cluster mode, but now looking back at what I tried, that's not the case afaict. Small nit: I'm not sure what you mean by "local-cluster mode… will always be 0" though… I definitely get nonzero results on the driver with master "local\[*\]". Anyway, thanks for investigating. > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090017#comment-16090017 ] Ryan Williams edited comment on SPARK-21425 at 7/21/17 2:12 AM: Yea, static accumulators; -I was able to reproduce it in cluster mode where multiple tasks on the same executor are writing to that executor's single instance of the same static accumulator-. ** I think I was mistaken when I wrote this. I tried getting a setup in yarn-mode that would write to static accumulators in a few ways, but wasn't able to run successfully with tasks writing to them, so I'm basically unable repro the problem in cluster-mode. ** I'm now thinking of this mostly as a surprise (to me!) about interactions between closure-serialization and accumulators… I assumed (incorrectly) that references to accumulators in task closures were treated specially in some way, but I see now that they're not. How unexpected/undesirable this behavior is debatable; I inadvertently fell into a situation where I was making static accumulators while working around a scalatest bug, as mentioned above, but in retrospect it feels pretty contrived to have ended up in that situation. Docs update, precautionary threadsafe-ing of \{Long,Double\}Accumulator, and taking no action all seem reasonable to me. was (Author: rdub): Yea, static accumulators; I was able to reproduce it in cluster mode where multiple tasks on the same executor are writing to that executor's single instance of the same static accumulator. I'm now thinking of this mostly as a surprise (to me!) about interactions between closure-serialization and accumulators… I assumed (incorrectly) that references to accumulators in task closures were treated specially in some way, but I see now that they're not. How unexpected/undesirable this behavior is debatable; I inadvertently fell into a situation where I was making static accumulators while working around a scalatest bug, as mentioned above, but in retrospect it feels pretty contrived to have ended up in that situation. Docs update, precautionary threadsafe-ing of \{Long,Double\}Accumulator, and taking no action all seem reasonable to me. > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20960: - Description: ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. was: _emphasized text_ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > ColumnVector is an internal interface in Spark SQL, which is only used for > vectorized parquet reader to represent the in-memory columnar format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20960: - Description: _emphasized text_ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. was: ColumnVector is an internal interface in Spark SQL, which is only used for vectorized parquet reader to represent the in-memory columnar format. In Spark 2.3 we want to make ColumnVector public, so that we can provide a more efficient way for data exchanges between Spark and external systems. For example, we can use ColumnVector to build the columnar read API in data source framework, we can use ColumnVector to build a more efficient UDF API, etc. We also want to introduce a new ColumnVector implementation based on Apache Arrow(basically just a wrapper over Arrow), so that external systems(like Python Pandas DataFrame) can build ColumnVector very easily. > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > _emphasized text_ColumnVector is an internal interface in Spark SQL, which is > only used for vectorized parquet reader to represent the in-memory columnar > format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21485) API Documentation for Spark SQL functions
[ https://issues.apache.org/jira/browse/SPARK-21485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095600#comment-16095600 ] Reynold Xin commented on SPARK-21485: - Pretty cool. Would be great to just generate the function list as part of the Spark documentation. > API Documentation for Spark SQL functions > - > > Key: SPARK-21485 > URL: https://issues.apache.org/jira/browse/SPARK-21485 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > It looks we can generate the documentation from {{ExpressionDescription}} and > {{ExpressionInfo}} for Spark's SQL function documentation. > I had some time to play with this so I just made a rough version - > https://spark-test.github.io/sparksqldoc/ > Codes I used are as below : > In {{pyspark}} shell: > {code} > from collections import namedtuple > ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended") > jinfos = > spark.sparkContext._jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctions() > infos = [] > for jinfo in jinfos: > name = jinfo.getName() > usage = jinfo.getUsage() > usage = usage.replace("_FUNC_", name) if usage is not None else usage > extended = jinfo.getExtended() > extended = extended.replace("_FUNC_", name) if extended is not None else > extended > infos.append(ExpressionInfo( > className=jinfo.getClassName(), > usage=usage, > name=name, > extended=extended)) > with open("index.md", 'w') as mdfile: > strip = lambda s: "\n".join(map(lambda u: u.strip(), s.split("\n"))) > for info in sorted(infos, key=lambda i: i.name): > mdfile.write("### %s\n\n" % info.name) > if info.usage is not None: > mdfile.write("%s\n\n" % strip(info.usage)) > if info.extended is not None: > mdfile.write("```%s```\n\n" % strip(info.extended)) > {code} > This change had to be made first before running the codes above: > {code:none} > +++ > b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala > @@ -17,9 +17,15 @@ > package org.apache.spark.sql.api.python > +import org.apache.spark.sql.catalyst.analysis.FunctionRegistry > +import org.apache.spark.sql.catalyst.expressions.ExpressionInfo > import org.apache.spark.sql.catalyst.parser.CatalystSqlParser > import org.apache.spark.sql.types.DataType > private[sql] object PythonSQLUtils { >def parseDataType(typeText: String): DataType = > CatalystSqlParser.parseDataType(typeText) > + > + def listBuiltinFunctions(): Array[ExpressionInfo] = { > +FunctionRegistry.functionSet.flatMap(f => > FunctionRegistry.builtin.lookupFunction(f)).toArray > + } > } > {code} > And then, I ran this: > {code} > mkdir docs > echo "site_name: Spark SQL 2.3.0" >> mkdocs.yml > echo "theme: readthedocs" >> mkdocs.yml > mv index.md docs/index.md > mkdocs serve > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095587#comment-16095587 ] Iurii Antykhovych edited comment on SPARK-21491 at 7/21/17 12:30 AM: - I searched for all such places in the whole GraphX module. These are only three occurrences of such tuples-to-map conversion I could boost without major refactoring. Let GraphX be the pilot module of such an optimization) The changes in `LabelPropagation` and `ShortestPaths` are on a hot execution path. The fix in `PageRank` class is executed once per run, not 'hot' enough. Shall I revert it? was (Author: sereneant): I searched for all such places in the whole GraphX module. These are only three occurrences of such tuples-to-map conversion I could boost without major refactoring. Let GraphX be the pilot module of such an optimization) The only change on a hot path of execution is the code is the one on `graphx.lib.ShortestPaths` class. The rest is executed once per run, not 'hot' enough. Shall I revert it (LabelPropagation.scala, PageRank.scala)? > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle
[ https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated SPARK-21494: -- Attachment: logs.zip > Spark 2.2.0 AES encryption not working with External shuffle > > > Key: SPARK-21494 > URL: https://issues.apache.org/jira/browse/SPARK-21494 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.2.0 > Environment: AWS EMR >Reporter: Udit Mehrotra > Attachments: logs.zip > > > Spark’s new AES based authentication mechanism does not seem to work when > configured with external shuffle service on YARN. > Here is the stack trace for the error we see in the driver logs: > ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: > Unable to create executor due to Unable to register with external shuffle > server due to: java.lang.IllegalArgumentException: Authentication failed. > at > org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) > at > org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) > at > org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > > Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’: > spark.network.crypto.enabled true > spark.network.crypto.saslFallback false > spark.authenticate true > > Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both > Spark and YARN side is not giving much information either about why > authentication fails. The driver and node manager logs have been attached to > the JIRA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle
[ https://issues.apache.org/jira/browse/SPARK-21494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated SPARK-21494: -- Description: Spark’s new AES based authentication mechanism does not seem to work when configured with external shuffle service on YARN. Here is the stack trace for the error we see in the driver logs: ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable to create executor due to Unable to register with external shuffle server due to: java.lang.IllegalArgumentException: Authentication failed. at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’: spark.network.crypto.enabled true spark.network.crypto.saslFallback false spark.authenticate true Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark and YARN side is not giving much information either about why authentication fails. The driver and node manager logs have been attached to the JIRA. was: Spark’s new AES based authentication mechanism does not seem to work when configured with external shuffle service on YARN. Here is the stack trace for the error we see in the driver logs: ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable to create executor due to Unable to register with external shuffle server due to: java.lang.IllegalArgumentException: Authentication failed. at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’: spark.network.crypto.enabled true spark.network.crypto.saslFallback false spark.authenticate true Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark and YARN side is not giving much information either about why authentication fails. The driver and
[jira] [Created] (SPARK-21494) Spark 2.2.0 AES encryption not working with External shuffle
Udit Mehrotra created SPARK-21494: - Summary: Spark 2.2.0 AES encryption not working with External shuffle Key: SPARK-21494 URL: https://issues.apache.org/jira/browse/SPARK-21494 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 2.2.0 Environment: AWS EMR Reporter: Udit Mehrotra Spark’s new AES based authentication mechanism does not seem to work when configured with external shuffle service on YARN. Here is the stack trace for the error we see in the driver logs: ERROR YarnScheduler: Lost executor 40 on ip-10-167-104-125.ec2.internal: Unable to create executor due to Unable to register with external shuffle server due to: java.lang.IllegalArgumentException: Authentication failed. at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) Here are the settings we are configuring in ‘spark-defaults’ and ‘yarn-site’: spark.network.crypto.enabled true spark.network.crypto.saslFallback false spark.authenticate true Turning on DEBUG logs for class ‘org.apache.spark.network.crypto’ on both Spark and YARN side is not giving much information either about why authentication fails. The driver and node manager logs have been attached to the JIRA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095587#comment-16095587 ] Iurii Antykhovych commented on SPARK-21491: --- I searched for all such places in the whole GraphX module. These are only three occurrences of such tuples-to-map conversion I could boost without major refactoring. Let GraphX be the pilot module of such an optimization) The only change on a hot path of execution is the code is the one on `graphx.lib.ShortestPaths` class. The rest is executed once per run, not 'hot' enough. Shall I revert it (LabelPropagation.scala, PageRank.scala)? > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full
[ https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095546#comment-16095546 ] Thomas Graves commented on SPARK-21460: --- I didn't think that was the case, but took a look at the code and I guess I was wrong, it definitely appears to be reliant on the listener bus. That is really bad in my opinion. We are intentionally dropping events and we know that will cause issues. > Spark dynamic allocation breaks when ListenerBus event queue runs full > -- > > Key: SPARK-21460 > URL: https://issues.apache.org/jira/browse/SPARK-21460 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Spark 2.1 > Hadoop 2.6 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: dynamic_allocation, performance, scheduler, yarn > > When ListenerBus event queue runs full, spark dynamic allocation stops > working - Spark fails to shrink number of executors when there are no active > jobs (Spark driver "thinks" there are active jobs since it didn't capture > when they finished) . > ps. What's worse it also makes Spark flood YARN RM with reservation requests, > so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop > 2.6). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21493) Add more metrics to External Shuffle Service
[ https://issues.apache.org/jira/browse/SPARK-21493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095466#comment-16095466 ] Sean Owen commented on SPARK-21493: --- How is this different from SPARK-21334 that you opened? > Add more metrics to External Shuffle Service > > > Key: SPARK-21493 > URL: https://issues.apache.org/jira/browse/SPARK-21493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Raajay Viswanathan >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > The current set of metrics in the external shuffle service are fairly > limited. To debug failure of the shuffle service, it would be good to get > more information regarding the state of the shuffle service. As a first cut, > the following metrics seem important: > 1. The amount of heap memory used by the External Shuffle Service process > 2. The amount of direct buffer (off-heap) memory allocated to External > Shuffle Service. In the external shuffle service, Netty uses off-heap memory. > Monitoring its usage can help in allocating appropriate resources and can > also be used to raise alarms when the allocated memory exceeds a threshold. > 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open > Block requests can be dropped as a result of Netty queue overflows (resulting > in FetchFailure). Having hard data on queue size can help in attributing > cause of failures. > Please let me know of other metrics (from Shuffle Service perspective) that > would be good to have. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21493) Add more metrics to External Shuffle Service
Raajay Viswanathan created SPARK-21493: -- Summary: Add more metrics to External Shuffle Service Key: SPARK-21493 URL: https://issues.apache.org/jira/browse/SPARK-21493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Raajay Viswanathan Priority: Minor The current set of metrics in the external shuffle service are fairly limited. To debug failure of the shuffle service, it would be good to get more information regarding the state of the shuffle service. As a first cut, the following metrics seem important: 1. The amount of heap memory used by the External Shuffle Service process 2. The amount of direct buffer (off-heap) memory allocated to External Shuffle Service. In the external shuffle service, Netty uses off-heap memory. Monitoring its usage can help in allocating appropriate resources and can also be used to raise alarms when the allocated memory exceeds a threshold. 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open Block requests can be dropped as a result of Netty queue overflows (resulting in FetchFailure). Having hard data on queue size can help in attributing cause of failures. Please let me know of other metrics (from Shuffle Service perspective) that would be good to have. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full
[ https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095447#comment-16095447 ] Saisai Shao commented on SPARK-21460: - I think this is basically a ListenerBus issue, not a dynamic allocation issue. Because ExecutorAllocationManager will register a listener and rely on listener to decide increase or decrease executors. Now because of failure of ListenerBus, then the listener in ExecutorAllocationManager cannot be worked and fails to decrease the executors as mentioned in your description. > Spark dynamic allocation breaks when ListenerBus event queue runs full > -- > > Key: SPARK-21460 > URL: https://issues.apache.org/jira/browse/SPARK-21460 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Spark 2.1 > Hadoop 2.6 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: dynamic_allocation, performance, scheduler, yarn > > When ListenerBus event queue runs full, spark dynamic allocation stops > working - Spark fails to shrink number of executors when there are no active > jobs (Spark driver "thinks" there are active jobs since it didn't capture > when they finished) . > ps. What's worse it also makes Spark flood YARN RM with reservation requests, > so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop > 2.6). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21490) SparkLauncher may fail to redirect streams
[ https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21490: Assignee: Apache Spark > SparkLauncher may fail to redirect streams > -- > > Key: SPARK-21490 > URL: https://issues.apache.org/jira/browse/SPARK-21490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > While investigating a different issue, I noticed that the redirection > handling in SparkLauncher is faulty. > In the default case or when users use only log redirection, things should > work fine. > But if users try to redirect just the stdout of a child process without > redirecting stderr, the launcher won't detect that and the child process may > hang because stderr is not being read and its buffer fills up. > The code should detect these cases and redirect any streams that haven't been > explicitly redirected by the user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21490) SparkLauncher may fail to redirect streams
[ https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095445#comment-16095445 ] Apache Spark commented on SPARK-21490: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18696 > SparkLauncher may fail to redirect streams > -- > > Key: SPARK-21490 > URL: https://issues.apache.org/jira/browse/SPARK-21490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While investigating a different issue, I noticed that the redirection > handling in SparkLauncher is faulty. > In the default case or when users use only log redirection, things should > work fine. > But if users try to redirect just the stdout of a child process without > redirecting stderr, the launcher won't detect that and the child process may > hang because stderr is not being read and its buffer fills up. > The code should detect these cases and redirect any streams that haven't been > explicitly redirected by the user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21490) SparkLauncher may fail to redirect streams
[ https://issues.apache.org/jira/browse/SPARK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21490: Assignee: (was: Apache Spark) > SparkLauncher may fail to redirect streams > -- > > Key: SPARK-21490 > URL: https://issues.apache.org/jira/browse/SPARK-21490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While investigating a different issue, I noticed that the redirection > handling in SparkLauncher is faulty. > In the default case or when users use only log redirection, things should > work fine. > But if users try to redirect just the stdout of a child process without > redirecting stderr, the launcher won't detect that and the child process may > hang because stderr is not being read and its buffer fills up. > The code should detect these cases and redirect any streams that haven't been > explicitly redirected by the user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095432#comment-16095432 ] Apache Spark commented on SPARK-12717: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/18695 > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Priority: Critical > Attachments: run.log > > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: > {code:title=abridge_run.txt|borderStyle=solid} > % spark-submit --master local[10] bug_spark.py --parallelism
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095425#comment-16095425 ] Zhan Zhang commented on SPARK-21492: root cause: In the SortMergeJoin, inner/leftOuter/rightOuter, one side of the SortedIter may not exhausted, that chunk of the memory thus cannot be released, causing memory leak and performance degradtion. > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhan Zhang > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095418#comment-16095418 ] Apache Spark commented on SPARK-21492: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/18694 > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhan Zhang > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21492: Assignee: (was: Apache Spark) > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhan Zhang > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21492: Assignee: Apache Spark > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhan Zhang >Assignee: Apache Spark > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21492) Memory leak in SortMergeJoin
Zhan Zhang created SPARK-21492: -- Summary: Memory leak in SortMergeJoin Key: SPARK-21492 URL: https://issues.apache.org/jira/browse/SPARK-21492 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Zhan Zhang In SortMergeJoin, if the iterator is not exhausted, there will be memory leak caused by the Sort. The memory is not released until the task end, and cannot be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095413#comment-16095413 ] Sean Owen commented on SPARK-21491: --- I see it now, yeah, and it's used in one place in Spark: {code} private[sql] def fieldsMap(fields: Array[StructField]): Map[String, StructField] = { import scala.collection.breakOut fields.map(s => (s.name, s))(breakOut) } {code} OK so it's overriding the builder that map uses to construct its target. I get it. I think it could be worth making this change, where it makes a non-trivial difference to performance. Is this a hotspot? I'd like to get several of them at once if we're going to change any of these instances. > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095398#comment-16095398 ] Iurii Antykhovych edited comment on SPARK-21491 at 7/20/17 9:22 PM: This is relevant to all scala versions starting from 2.8, it's in `scala.collection.breakOut`. The problem with {{collection.map(...).toMap}} is in creation of intermediate collection of tuples, that is converted to map then; that leads to performance degradation and excess object allocation. The price of {{collection.breakOut}} is the code readability, it significantly suffers I guess, compared to {{.toMap}} method. was (Author: sereneant): This is relevant to all scala versions starting from 2.8, it's in `scala.collection.breakOut`. The problem with `collection.map(...).toMap` is in creation of intermediate collection of tuples, that is converted to map then; that leads to performance degradation and excess object allocation. The price of `collection.breakOut` is the code readability, it significantly suffers I guess, compared to '.toMap' method. > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095398#comment-16095398 ] Iurii Antykhovych commented on SPARK-21491: --- This is relevant to all scala versions starting from 2.8, it's in `scala.collection.breakOut`. The problem with `collection.map(...).toMap` is in creation of intermediate collection of tuples, that is converted to map then; that leads to performance degradation and excess object allocation. The price of `collection.breakOut` is the code readability, it significantly suffers I guess, compared to '.toMap' method. > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21489) Update release docs to point out Python 2.6 support is removed.
[ https://issues.apache.org/jira/browse/SPARK-21489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095386#comment-16095386 ] Sean Owen commented on SPARK-21489: --- I think that was just changed: https://github.com/apache/spark/commit/5b61cc6d629d537a7e5bcbd5205d1c8a43b14d43 > Update release docs to point out Python 2.6 support is removed. > --- > > Key: SPARK-21489 > URL: https://issues.apache.org/jira/browse/SPARK-21489 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: holdenk >Priority: Trivial > > In the current RDD programming guide docs for Spark 2.2.0 we currently say > "Python 2.6 support may be removed in Spark 2.2.0" but we forgot to update > that to was removed which could lead to some confusion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21491: Assignee: (was: Apache Spark) > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21491: Assignee: Apache Spark > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Assignee: Apache Spark >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095378#comment-16095378 ] Apache Spark commented on SPARK-21491: -- User 'SereneAnt' has created a pull request for this issue: https://github.com/apache/spark/pull/18693 > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095379#comment-16095379 ] Sean Owen commented on SPARK-21491: --- I can't even find this in Scala. The article talks about 2.8. Is it obsolete? what's the performance problem and what is different that improves it? > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
[ https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iurii Antykhovych updated SPARK-21491: -- Priority: Trivial (was: Minor) > Performance enhancement: eliminate creation of intermediate collections > --- > > Key: SPARK-21491 > URL: https://issues.apache.org/jira/browse/SPARK-21491 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.2.0 >Reporter: Iurii Antykhovych >Priority: Trivial > > Simple performance optimization in a few places of GraphX: > {{Traversable.toMap}} can be replaced with {{collection.breakout}}. > This would eliminate creation of an intermediate collection of tuples, see > [Stack Overflow > article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections
Iurii Antykhovych created SPARK-21491: - Summary: Performance enhancement: eliminate creation of intermediate collections Key: SPARK-21491 URL: https://issues.apache.org/jira/browse/SPARK-21491 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 2.2.0 Reporter: Iurii Antykhovych Priority: Minor Simple performance optimization in a few places of GraphX: {{Traversable.toMap}} can be replaced with {{collection.breakout}}. This would eliminate creation of an intermediate collection of tuples, see [Stack Overflow article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21490) SparkLauncher may fail to redirect streams
Marcelo Vanzin created SPARK-21490: -- Summary: SparkLauncher may fail to redirect streams Key: SPARK-21490 URL: https://issues.apache.org/jira/browse/SPARK-21490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Priority: Minor While investigating a different issue, I noticed that the redirection handling in SparkLauncher is faulty. In the default case or when users use only log redirection, things should work fine. But if users try to redirect just the stdout of a child process without redirecting stderr, the launcher won't detect that and the child process may hang because stderr is not being read and its buffer fills up. The code should detect these cases and redirect any streams that haven't been explicitly redirected by the user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21489) Update release docs to point out Python 2.6 support is removed.
holdenk created SPARK-21489: --- Summary: Update release docs to point out Python 2.6 support is removed. Key: SPARK-21489 URL: https://issues.apache.org/jira/browse/SPARK-21489 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 2.2.0, 2.2.1 Reporter: holdenk Priority: Trivial In the current RDD programming guide docs for Spark 2.2.0 we currently say "Python 2.6 support may be removed in Spark 2.2.0" but we forgot to update that to was removed which could lead to some confusion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323 ] Shixiong Zhu edited comment on SPARK-21425 at 7/20/17 8:45 PM: --- [~srowen] The issue is static accumulators. Right? They won't be serialized/deserialized with tasks and cannot be reported back to driver. If running in local-cluster mode or a real cluster, they will always be 0 in the driver side. The overhead comes from memory barrier introduced by synchronization. I tried to improve this but gave up due to the significant performance regression: https://github.com/apache/spark/pull/15065 `avg` is a good point. But it's not a big deal considering it just shows on UI and will be fixed eventually when the job finishes. was (Author: zsxwing): [~srowen] The issue is static accumulators. Right? They won't be serialized/deserialized with tasks and cannot be reported back to driver. If running in local-cluster mode or a real cluster, they will always be 0 in the driver side. The overhead comes from memory barrier introduced by synchronization. I tried to improved this but gave up due to the significant performance regression: https://github.com/apache/spark/pull/15065 `avg` is a good point. But it's not a big deal considering it just shows on UI and will be fixed eventually when the job finishes. > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323 ] Shixiong Zhu edited comment on SPARK-21425 at 7/20/17 8:38 PM: --- [~srowen] The issue is static accumulators. Right? They won't be serialized/deserialized with tasks and cannot be reported back to driver. If running in local-cluster mode or a real cluster, they will always be 0 in the driver side. The overhead comes from memory barrier introduced by synchronization. I tried to improved this but gave up due to the significant performance regression: https://github.com/apache/spark/pull/15065 `avg` is a good point. But it's not a big deal considering it just shows on UI and will be fixed eventually when the job finishes. was (Author: zsxwing): [~srowen] The issue is static accumulators. Right? They won't be serialized/deserialized with tasks and cannot be reported back to driver. If running in local-cluster mode or a real cluster, they will always be 0 in the driver side. The overhead comes from memory barrier introduced by synchronization. I tried to improved this but gave up due to the significant performance regression: https://github.com/apache/spark/pull/15065 > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21425) LongAccumulator, DoubleAccumulator not threadsafe
[ https://issues.apache.org/jira/browse/SPARK-21425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095323#comment-16095323 ] Shixiong Zhu commented on SPARK-21425: -- [~srowen] The issue is static accumulators. Right? They won't be serialized/deserialized with tasks and cannot be reported back to driver. If running in local-cluster mode or a real cluster, they will always be 0 in the driver side. The overhead comes from memory barrier introduced by synchronization. I tried to improved this but gave up due to the significant performance regression: https://github.com/apache/spark/pull/15065 > LongAccumulator, DoubleAccumulator not threadsafe > - > > Key: SPARK-21425 > URL: https://issues.apache.org/jira/browse/SPARK-21425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > [AccumulatorV2 > docs|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L42-L43] > acknowledge that accumulators must be concurrent-read-safe, but afaict they > must also be concurrent-write-safe. > The same docs imply that {{Int}} and {{Long}} meet either/both of these > criteria, when afaict they do not. > Relatedly, the provided > [LongAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L291] > and > [DoubleAccumulator|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L370] > are not thread-safe, and should be expected to behave undefinedly when > multiple concurrent tasks on the same executor write to them. > [Here is a repro repo|https://github.com/ryan-williams/spark-bugs/tree/accum] > with some simple applications that demonstrate incorrect results from > {{LongAccumulator}}'s. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21417: Assignee: (was: Apache Spark) > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") > .appName("Spark SQL parser bug") > .getOrCreate() > import spark.implicits._ > // The schema is encoded in a string > val schemaString = "s p o" > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, StringType, nullable = true)) > val schema = StructType(fields) > val data = List(("s1", "p1", "o1")) > val dataRDD = spark.sparkContext.parallelize(data).map(attributes => > Row(attributes._1, attributes._2, attributes._3)) > val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES") > df.createOrReplaceTempView("TRIPLES") > println("First Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = > 'dbr:Leipzig'").show(10) > println("Second Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' > AND B.s = 'dbr:Leipzig'").show(10) > } > } > {code} > Output (excerpt): > {noformat} > First Query > ... > +---+ > | s| > +---+ > +---+ > Second Query > - should be capable of handling transitive join conditions *** FAILED *** > org.apache.spark.sql.AnalysisException: Detected cartesian product for > INNER join between logical plans > Project [s#3] > +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig)) >+- LogicalRDD [s#3, p#4, o#5] > and > Project > +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig)) >+- LogicalRDD [s#20, p#21, o#22] > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > ... > Run completed in 6 seconds, 833 milliseconds. > Total number of tests run: 1 > Suites: completed 1, aborted 0 > Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 > *** 1 TEST FAILED *** > {noformat} > Expected: > A correctly working, executable, query plan for the second query (ideally > equivalent to that of the
[jira] [Assigned] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21417: Assignee: Apache Spark > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler >Assignee: Apache Spark > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") > .appName("Spark SQL parser bug") > .getOrCreate() > import spark.implicits._ > // The schema is encoded in a string > val schemaString = "s p o" > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, StringType, nullable = true)) > val schema = StructType(fields) > val data = List(("s1", "p1", "o1")) > val dataRDD = spark.sparkContext.parallelize(data).map(attributes => > Row(attributes._1, attributes._2, attributes._3)) > val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES") > df.createOrReplaceTempView("TRIPLES") > println("First Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = > 'dbr:Leipzig'").show(10) > println("Second Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' > AND B.s = 'dbr:Leipzig'").show(10) > } > } > {code} > Output (excerpt): > {noformat} > First Query > ... > +---+ > | s| > +---+ > +---+ > Second Query > - should be capable of handling transitive join conditions *** FAILED *** > org.apache.spark.sql.AnalysisException: Detected cartesian product for > INNER join between logical plans > Project [s#3] > +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig)) >+- LogicalRDD [s#3, p#4, o#5] > and > Project > +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig)) >+- LogicalRDD [s#20, p#21, o#22] > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > ... > Run completed in 6 seconds, 833 milliseconds. > Total number of tests run: 1 > Suites: completed 1, aborted 0 > Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 > *** 1 TEST FAILED *** > {noformat} > Expected: > A correctly working, executable, query plan for the second query (ideally >
[jira] [Commented] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095317#comment-16095317 ] Apache Spark commented on SPARK-21417: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/18692 > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") > .appName("Spark SQL parser bug") > .getOrCreate() > import spark.implicits._ > // The schema is encoded in a string > val schemaString = "s p o" > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, StringType, nullable = true)) > val schema = StructType(fields) > val data = List(("s1", "p1", "o1")) > val dataRDD = spark.sparkContext.parallelize(data).map(attributes => > Row(attributes._1, attributes._2, attributes._3)) > val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES") > df.createOrReplaceTempView("TRIPLES") > println("First Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = > 'dbr:Leipzig'").show(10) > println("Second Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' > AND B.s = 'dbr:Leipzig'").show(10) > } > } > {code} > Output (excerpt): > {noformat} > First Query > ... > +---+ > | s| > +---+ > +---+ > Second Query > - should be capable of handling transitive join conditions *** FAILED *** > org.apache.spark.sql.AnalysisException: Detected cartesian product for > INNER join between logical plans > Project [s#3] > +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig)) >+- LogicalRDD [s#3, p#4, o#5] > and > Project > +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig)) >+- LogicalRDD [s#20, p#21, o#22] > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > ... > Run completed in 6 seconds, 833 milliseconds. > Total number of tests run: 1 > Suites: completed 1, aborted 0 > Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 > *** 1 TEST FAILED *** > {noformat} > Expected: > A
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095299#comment-16095299 ] holdenk commented on SPARK-7146: So it seems like there is a (more recent) agreement that exposing this as a DeveloperAPI could be useful (I know in my own talks where I mention how to write custom ML pipeline stages I wish this was available as a developer API). > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21478) Unpersist a DF also unpersists related DFs
[ https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095275#comment-16095275 ] Roberto Mirizzi edited comment on SPARK-21478 at 7/20/17 8:06 PM: -- Sorry about that. I totally misunderstood you. My bad :-) was (Author: roberto.mirizzi): Sorry about that. I totally misunderstood you. :-) > Unpersist a DF also unpersists related DFs > -- > > Key: SPARK-21478 > URL: https://issues.apache.org/jira/browse/SPARK-21478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Roberto Mirizzi > > Starting with Spark 2.1.1 I observed this bug. Here's are the steps to > reproduce it: > # create a DF > # persist it > # count the items in it > # create a new DF as a transformation of the previous one > # persist it > # count the items in it > # unpersist the first DF > Once you do that you will see that also the 2nd DF is gone. > The code to reproduce it is: > {code:java} > val x1 = Seq(1).toDF() > x1.persist() > x1.count() > assert(x1.storageLevel.useMemory) > val x11 = x1.select($"value" * 2) > x11.persist() > x11.count() > assert(x11.storageLevel.useMemory) > x1.unpersist() > assert(!x1.storageLevel.useMemory) > //the following assertion FAILS > assert(x11.storageLevel.useMemory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs
[ https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095275#comment-16095275 ] Roberto Mirizzi commented on SPARK-21478: - Sorry about that. I totally misunderstood you. :-) > Unpersist a DF also unpersists related DFs > -- > > Key: SPARK-21478 > URL: https://issues.apache.org/jira/browse/SPARK-21478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Roberto Mirizzi > > Starting with Spark 2.1.1 I observed this bug. Here's are the steps to > reproduce it: > # create a DF > # persist it > # count the items in it > # create a new DF as a transformation of the previous one > # persist it > # count the items in it > # unpersist the first DF > Once you do that you will see that also the 2nd DF is gone. > The code to reproduce it is: > {code:java} > val x1 = Seq(1).toDF() > x1.persist() > x1.count() > assert(x1.storageLevel.useMemory) > val x11 = x1.select($"value" * 2) > x11.persist() > x11.count() > assert(x11.storageLevel.useMemory) > x1.unpersist() > assert(!x1.storageLevel.useMemory) > //the following assertion FAILS > assert(x11.storageLevel.useMemory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs
[ https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095208#comment-16095208 ] Sean Owen commented on SPARK-21478: --- I said I can reproduce it, I agree with you. > Unpersist a DF also unpersists related DFs > -- > > Key: SPARK-21478 > URL: https://issues.apache.org/jira/browse/SPARK-21478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Roberto Mirizzi > > Starting with Spark 2.1.1 I observed this bug. Here's are the steps to > reproduce it: > # create a DF > # persist it > # count the items in it > # create a new DF as a transformation of the previous one > # persist it > # count the items in it > # unpersist the first DF > Once you do that you will see that also the 2nd DF is gone. > The code to reproduce it is: > {code:java} > val x1 = Seq(1).toDF() > x1.persist() > x1.count() > assert(x1.storageLevel.useMemory) > val x11 = x1.select($"value" * 2) > x11.persist() > x11.count() > assert(x11.storageLevel.useMemory) > x1.unpersist() > assert(!x1.storageLevel.useMemory) > //the following assertion FAILS > assert(x11.storageLevel.useMemory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
[ https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095203#comment-16095203 ] Kaushal Prajapati commented on SPARK-19908: --- [~bojanbabic] This error comes when your kryobuffer memory exceed default limit (64m). Increase your buffer memory using this property {noformat} spark.kryoserializer.buffer.max=64m {noformat} > Direct buffer memory OOM should not cause stage retries. > > > Key: SPARK-19908 > URL: https://issues.apache.org/jira/browse/SPARK-19908 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the > exception will be changed to FetchFailedException, causing stage retries. > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at >
[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095204#comment-16095204 ] Raajay Viswanathan commented on SPARK-21334: I think SPARK-18364 aims to implement metrics in YarnShuffleService. The current issue builds on existing work (SPARK-16405) for ExternalShuffleService and implements a reporting service that was earlier missing. > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21243) Limit the number of maps in a single shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-21243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095199#comment-16095199 ] Apache Spark commented on SPARK-21243: -- User 'dhruve' has created a pull request for this issue: https://github.com/apache/spark/pull/18691 > Limit the number of maps in a single shuffle fetch > -- > > Key: SPARK-21243 > URL: https://issues.apache.org/jira/browse/SPARK-21243 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: Dhruve Ashar >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.3.0 > > > Right now spark can limit the # of parallel fetches and also limits the > amount of data in one fetch, but one fetch to a host could be for 100's of > blocks. In one instance we saw 450+ blocks. When you have 100's of those and > 1000's of reducers fetching that becomes a lot of metadata and can run the > Node Manager out of memory. We should add a config to limit the # of maps per > fetch to reduce the load on the NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21478) Unpersist a DF also unpersists related DFs
[ https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095187#comment-16095187 ] Roberto Mirizzi commented on SPARK-21478: - That's weird you are not able to reproduce it. Did you just launch the spark-shell and copied/pasted the above commands? I tried on Spark 2.1.0, 2.1.1 and 2.2.0, both on AWS and on my local machine. Spark 2.1.0 doesn't exhibit any issue, Spark 2.1.1 and 2.2.0 fail the last assertion. About your point "I do think one wants to be able to persist the result and not the original though", it depends on the specific use case. My example was a trivial one to reproduce the problem, but as you can imagine you may want to persist the first DF, do a bunch of operations reusing it, and then generate a new DF, persist it, and unpersist the old one when you don't need it anymore. It looks like a serious problem to me. This is my entire output: {code:java} $ spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/20 11:44:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/20 11:44:44 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.15.16.46:4040 Spark context available as 'sc' (master = local[*], app id = local-1500576281010). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_71) Type in expressions to have them evaluated. Type :help for more information. scala> val x1 = Seq(1).toDF() x1: org.apache.spark.sql.DataFrame = [value: int] scala> x1.persist() res0: x1.type = [value: int] scala> x1.count() res1: Long = 1 scala> assert(x1.storageLevel.useMemory) scala> scala> val x11 = x1.select($"value" * 2) x11: org.apache.spark.sql.DataFrame = [(value * 2): int] scala> x11.persist() res3: x11.type = [(value * 2): int] scala> x11.count() res4: Long = 1 scala> assert(x11.storageLevel.useMemory) scala> scala> x1.unpersist() res6: x1.type = [value: int] scala> scala> assert(!x1.storageLevel.useMemory) scala> //the following assertion FAILS scala> assert(x11.storageLevel.useMemory) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) ... 48 elided scala> {code} > Unpersist a DF also unpersists related DFs > -- > > Key: SPARK-21478 > URL: https://issues.apache.org/jira/browse/SPARK-21478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Roberto Mirizzi > > Starting with Spark 2.1.1 I observed this bug. Here's are the steps to > reproduce it: > # create a DF > # persist it > # count the items in it > # create a new DF as a transformation of the previous one > # persist it > # count the items in it > # unpersist the first DF > Once you do that you will see that also the 2nd DF is gone. > The code to reproduce it is: > {code:java} > val x1 = Seq(1).toDF() > x1.persist() > x1.count() > assert(x1.storageLevel.useMemory) > val x11 = x1.select($"value" * 2) > x11.persist() > x11.count() > assert(x11.storageLevel.useMemory) > x1.unpersist() > assert(!x1.storageLevel.useMemory) > //the following assertion FAILS > assert(x11.storageLevel.useMemory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view
[ https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-21488: -- Description: It would be great to make saveAsTable() return dataframe of created table, so you could pipe result further as for example {code} mv_table_df = (sqlc.sql(''' SELECT ... FROM ''') .write.format("parquet").mode("overwrite") .saveAsTable('test.parquet_table') .createOrReplaceTempView('mv_table') ) {code} ... Above code returns now expectedly: {noformat} AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView' {noformat} If this is implemented, we can skip a step like {code} sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table') {code} We have this pattern very frequently. Further improvement can be made if createOrReplaceTempView also returns dataframe object, so in one pipeline of functions we can - create an external table - create a dataframe reference to this newly created for SparkSQL and as a Spark variable. was: It would be great to make saveAsTable() return dataframe of created table, so you could pipe result further as for example {code} (sqlc.sql(''' SELECT ... FROM ''') .write.format("parquet").mode("overwrite") .saveAsTable('test.parquet_table') .createOrReplaceTempView('mv_table') ) {code} ... Above code returns now expectedly: {noformat} AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView' {noformat} If this is implemented, we can skip a step like {code} sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table') {code} We have this pattern very frequently. > Make saveAsTable() and createOrReplaceTempView() return dataframe of created > table/ created view > > > Key: SPARK-21488 > URL: https://issues.apache.org/jira/browse/SPARK-21488 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: Ruslan Dautkhanov > > It would be great to make saveAsTable() return dataframe of created table, > so you could pipe result further as for example > {code} > mv_table_df = (sqlc.sql(''' > SELECT ... > FROM > ''') > .write.format("parquet").mode("overwrite") > .saveAsTable('test.parquet_table') > .createOrReplaceTempView('mv_table') > ) > {code} > ... Above code returns now expectedly: > {noformat} > AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView' > {noformat} > If this is implemented, we can skip a step like > {code} > sqlc.sql('SELECT * FROM > test.parquet_table').createOrReplaceTempView('mv_table') > {code} > We have this pattern very frequently. > Further improvement can be made if createOrReplaceTempView also returns > dataframe object, so in one pipeline of functions > we can > - create an external table > - create a dataframe reference to this newly created for SparkSQL and as a > Spark variable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21488) Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view
[ https://issues.apache.org/jira/browse/SPARK-21488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-21488: -- Summary: Make saveAsTable() and createOrReplaceTempView() return dataframe of created table/ created view (was: Make saveAsTable() return dataframe of created table) > Make saveAsTable() and createOrReplaceTempView() return dataframe of created > table/ created view > > > Key: SPARK-21488 > URL: https://issues.apache.org/jira/browse/SPARK-21488 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: Ruslan Dautkhanov > > It would be great to make saveAsTable() return dataframe of created table, > so you could pipe result further as for example > {code} > (sqlc.sql(''' > SELECT ... > FROM > ''') > .write.format("parquet").mode("overwrite") > .saveAsTable('test.parquet_table') > .createOrReplaceTempView('mv_table') > ) > {code} > ... Above code returns now expectedly: > {noformat} > AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView' > {noformat} > If this is implemented, we can skip a step like > {code} > sqlc.sql('SELECT * FROM > test.parquet_table').createOrReplaceTempView('mv_table') > {code} > We have this pattern very frequently. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21488) Make saveAsTable() return dataframe of created table
Ruslan Dautkhanov created SPARK-21488: - Summary: Make saveAsTable() return dataframe of created table Key: SPARK-21488 URL: https://issues.apache.org/jira/browse/SPARK-21488 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL Affects Versions: 2.2.0 Reporter: Ruslan Dautkhanov It would be great to make saveAsTable() return dataframe of created table, so you could pipe result further as for example {code} (sqlc.sql(''' SELECT ... FROM ''') .write.format("parquet").mode("overwrite") .saveAsTable('test.parquet_table') .createOrReplaceTempView('mv_table') ) {code} ... Above code returns now expectedly: {noformat} AttributeError: 'NoneType' object has no attribute 'createOrReplaceTempView' {noformat} If this is implemented, we can skip a step like {code} sqlc.sql('SELECT * FROM test.parquet_table').createOrReplaceTempView('mv_table') {code} We have this pattern very frequently. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21463) Output of StructuredStreaming tables don't respect user specified schema when reading back the table
[ https://issues.apache.org/jira/browse/SPARK-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21463. -- Resolution: Fixed Fix Version/s: 2.3.0 > Output of StructuredStreaming tables don't respect user specified schema when > reading back the table > > > Key: SPARK-21463 > URL: https://issues.apache.org/jira/browse/SPARK-21463 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.3.0 > > > When using the MetadataLogFileIndex to read back a table, we don't respect > the user provided schema as the proper column types. This can lead to issues > when trying to read strings that look like dates that get truncated to > DateType, or longs being truncated to IntegerType, just because a long value > doesn't exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21478) Unpersist a DF also unpersists related DFs
[ https://issues.apache.org/jira/browse/SPARK-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21478: - Component/s: (was: Spark Core) SQL > Unpersist a DF also unpersists related DFs > -- > > Key: SPARK-21478 > URL: https://issues.apache.org/jira/browse/SPARK-21478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Roberto Mirizzi > > Starting with Spark 2.1.1 I observed this bug. Here's are the steps to > reproduce it: > # create a DF > # persist it > # count the items in it > # create a new DF as a transformation of the previous one > # persist it > # count the items in it > # unpersist the first DF > Once you do that you will see that also the 2nd DF is gone. > The code to reproduce it is: > {code:java} > val x1 = Seq(1).toDF() > x1.persist() > x1.count() > assert(x1.storageLevel.useMemory) > val x11 = x1.select($"value" * 2) > x11.persist() > x11.count() > assert(x11.storageLevel.useMemory) > x1.unpersist() > assert(!x1.storageLevel.useMemory) > //the following assertion FAILS > assert(x11.storageLevel.useMemory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full
[ https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095131#comment-16095131 ] Ruslan Dautkhanov commented on SPARK-21460: --- [~Dhruve Ashar], I can email logs to you. Although logs are not so revealing, basically problem starts at {noformat} ERROR [2017-05-15 10:37:53,350] ({dag-scheduler-event-loop} Logging.scala[logError]:70) - Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. {noformat} and then nothing interesting. We were hitting it constantly until following changes were made: - disable concurrentSQL - increase spark.scheduler.listenerbus.eventqueue.size to 55000 - spark.dynamicAllocation.maxExecutors set to 210 (it was not set /unlimited) After that we have seen it rarely - a few times (changes were made back in May). Also it was happening mostly with several users who were using concurrentSQL actively (was submitting multiple jobs before previous ones completed). Although concurrentSQL isn't the problem - it just makes ListenerBus event queue runs full quicker. Again, we have seen a few times the same issue after above workaround changes were made including disabling concurrentSQL. > Spark dynamic allocation breaks when ListenerBus event queue runs full > -- > > Key: SPARK-21460 > URL: https://issues.apache.org/jira/browse/SPARK-21460 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Spark 2.1 > Hadoop 2.6 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: dynamic_allocation, performance, scheduler, yarn > > When ListenerBus event queue runs full, spark dynamic allocation stops > working - Spark fails to shrink number of executors when there are no active > jobs (Spark driver "thinks" there are active jobs since it didn't capture > when they finished) . > ps. What's worse it also makes Spark flood YARN RM with reservation requests, > so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop > 2.6). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095095#comment-16095095 ] Robert Kruszewski commented on SPARK-21334: --- I think this is a dupe of https://issues.apache.org/jira/browse/SPARK-18364 > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21334: Assignee: Apache Spark > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan >Assignee: Apache Spark > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095063#comment-16095063 ] Apache Spark commented on SPARK-21334: -- User 'raajay' has created a pull request for this issue: https://github.com/apache/spark/pull/18690 > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21334: Assignee: (was: Apache Spark) > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095056#comment-16095056 ] Raajay Viswanathan commented on SPARK-21334: [~jerryshao] I am using external shuffle service in a standalone manner. Also, I am mistaken in the claim that the "blockTransferRate" is incorrect. It is done in a proper manner in the latest version of Spark; it was fixed as part of [SPARK-20994]. > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raajay Viswanathan updated SPARK-21334: --- Description: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. --2. The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message. -- was: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. -2. The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message. - > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > --2. The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message. -- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raajay Viswanathan updated SPARK-21334: --- Description: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. 2. -The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message.- was: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. --2. The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message. -- > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > 2. -The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21334) Fix metrics for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raajay Viswanathan updated SPARK-21334: --- Description: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. -2. The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message. - was: SPARK-16405 introduced metrics for external shuffle service. However, as it is currently there are two issues. 1. The shuffle service metrics system does not report values ever. 2. The current method for determining "blockTransferRate" is incorrect. The entire block is assumed to be transferred once the OpenBlocks message if processed. The actual data fetch from the disk and the succeeding transfer over the wire happens much later when MessageEncoder invokes encode on ChunkFetchSuccess message. > Fix metrics for external shuffle service > > > Key: SPARK-21334 > URL: https://issues.apache.org/jira/browse/SPARK-21334 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Raajay Viswanathan > Labels: external-shuffle-service > Original Estimate: 168h > Remaining Estimate: 168h > > SPARK-16405 introduced metrics for external shuffle service. However, as it > is currently there are two issues. > 1. The shuffle service metrics system does not report values ever. > -2. The current method for determining "blockTransferRate" is incorrect. The > entire block is assumed to be transferred once the OpenBlocks message if > processed. The actual data fetch from the disk and the succeeding transfer > over the wire happens much later when MessageEncoder invokes encode on > ChunkFetchSuccess message. - -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21142. --- Resolution: Fixed > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Assignee: Tim Van Wassenhove >Priority: Minor > Fix For: 2.3.0 > > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21142: - Assignee: Tim Van Wassenhove Fix Version/s: 2.3.0 Issue Type: Improvement (was: Bug) Resolved by https://github.com/apache/spark/pull/18353 > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Assignee: Tim Van Wassenhove >Priority: Minor > Fix For: 2.3.0 > > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19531) History server doesn't refresh jobs for long-life apps like thriftserver
[ https://issues.apache.org/jira/browse/SPARK-19531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19531. Resolution: Fixed Assignee: Oleg Danilov Fix Version/s: 2.3.0 > History server doesn't refresh jobs for long-life apps like thriftserver > > > Key: SPARK-19531 > URL: https://issues.apache.org/jira/browse/SPARK-19531 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Oleg Danilov >Assignee: Oleg Danilov > Fix For: 2.3.0 > > > If spark.history.fs.logDirectory points to hdfs, then spark history server > doesn't refresh jobs page. This is caused by Hadoop - during writing to the > .inprogress file Hadoop doesn't update file length until close and therefor > Spark's history server is not able to detect any changes. > I'm gonna submit a PR to fix this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20394) Replication factor value Not changing properly
[ https://issues.apache.org/jira/browse/SPARK-20394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20394. Resolution: Workaround Great. I doubt we'll fix this in 1.6 at this point, and I believe this is fixed in 2.x, so let's close this. > Replication factor value Not changing properly > -- > > Key: SPARK-20394 > URL: https://issues.apache.org/jira/browse/SPARK-20394 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.0 >Reporter: Kannan Subramanian > > To save SparkSQL dataframe to a persistent hive table using the below steps. > a) RegisterTempTable to the dataframe as a tempTable > b) create table (cols)PartitionedBy(col1, col2) stored as > parquet > c) Insert into partition(col1, col2) select * from tempTable > I have set dfs.replication is equal to "1" in hiveContext object. But It did > not work properly. That is replica is 1 for 80 % of the generated parquet > files on HDFS and default replica 3 is for remaining 20 % of parquet files in > HDFS. I am not sure why the replica is not reflecting to all the generated > parquet files. Please let me know if you have any suggestions or solutions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844 ] Claus Stadler edited comment on SPARK-21417 at 7/20/17 3:41 PM: Hi Anton, I have a rough idea how the issue could be addressed, but I feel it would be better if the SPARK team provided feedback on the idea and possibly revises it. I am also not very familiar with the code, so I would be good if once the conceptual approach is sorted out, you could provide at least an initial implementation. Should there be any small issues left, I may be able to contribute fixes if needed. Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' * Normalize the selection expression into disjunctive normal form => { {(a.x = 'foo'), ('foo', b.y)}, {(c.z = 'bar')} } * Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 'foo')}, {(c.z = 'bar')} } * For each clause in the DNF, infer transitive join conditions b using the following approach; Given {(a.x = 'foo'), ('foo', b.y)} * Create a multimap from all equal expressions of the form ee.ff = expr, which in our example gives { 'foo' => {a.x, b.y } } * For every multimap key having multiple values, infer appropriate equals expressions, in this case a.x = b.y. * Note: If there is more than 2 values, in general one create equal expressions for each pair-wise combination of distinct elements in the value set. Alternatively, one could pick a random element, and equate all other values to it. Not sure what representation is most beneficial for the underlying SPARK query optimizer. was (Author: aklakan): Hi Anton, I have a rough idea how the issue could be addressed, but I feel it would be better if the SPARK team provided feedback on the idea and possibly revises it. I am also not very familiar with the code, so I would be good if once the conceptual approach is sorted out, you could provide at least an initial implementation. Should there be any small issues left, I may be able to contribute fixes if needed. Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' * Normalize the selection expression into disjunctive normal form => { {(a.x = 'foo'), ('foo', b.y)}, {(c.z = 'bar')} } * Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 'foo')}, {(c.z = 'bar')} } * For each clause in the DNF, infer transitive join conditions b using the following approach: Given {(a.x = 'foo'), ('foo', b.y)}: * Create a multimap from all equal expressions of the form ee.ff = expr, which in our example gives { 'foo' => {a.x, b.y } } * For every multimap key having multiple values, infer appropriate equals expressions, in this case a.x = b.y. * Note: If there is more than 2 values, in general one create equal expressions for each pair-wise combination of distinct elements in the value set. Alternatively, one could pick a random element, and equate all other values to it. Not sure what representation is most beneficial for the underlying SPARK query optimizer. > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") > .appName("Spark SQL
[jira] [Comment Edited] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844 ] Claus Stadler edited comment on SPARK-21417 at 7/20/17 3:40 PM: Hi Anton, I have a rough idea how the issue could be addressed, but I feel it would be better if the SPARK team provided feedback on the idea and possibly revises it. I am also not very familiar with the code, so I would be good if once the conceptual approach is sorted out, you could provide at least an initial implementation. Should there be any small issues left, I may be able to contribute fixes if needed. Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' * Normalize the selection expression into disjunctive normal form => { {(a.x = 'foo'), ('foo', b.y)}, {(c.z = 'bar')} } * Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 'foo')}, {(c.z = 'bar')} } * For each clause in the DNF, infer transitive join conditions b using the following approach: Given {(a.x = 'foo'), ('foo', b.y)}: * Create a multimap from all equal expressions of the form ee.ff = expr, which in our example gives { 'foo' => {a.x, b.y } } * For every multimap key having multiple values, infer appropriate equals expressions, in this case a.x = b.y. * Note: If there is more than 2 values, in general one create equal expressions for each pair-wise combination of distinct elements in the value set. Alternatively, one could pick a random element, and equate all other values to it. Not sure what representation is most beneficial for the underlying SPARK query optimizer. was (Author: aklakan): Hi Anton, I have a rough idea how the issue could be addressed, but I feel it would be better if the SPARK team provided feedback on the idea and possibly revises it. I am also not very familiar with the code, so I would be good if once the conceptual approach is sorted out, you could provide at least an initial implementation. Should there be any small issues left, I may be able to contribute fixes if needed. Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' * Normalize the selection expression into disjunctive normal form => { {(a.x = 'foo'), ('foo', b.y)}, {(c.z = 'bar')} } * Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 'foo')}, {(c.z = 'bar')} } * For each clause in the DNF, infer transitive join conditions b using the following approach: Given e.g. {(a.x = 'foo'), ('foo', b.y)}: * Create a multimap from all equal expressions of the form ee.ff = expr, which in our example gives { 'foo' => {a.x, b.y } } * For every multimap key having multiple values, infer appropriate equals expressions, in this case a.x = b.y. * Note: If there is more than 2 values, in general one create equal expressions for each pair-wise combination of distinct elements in the value set. Alternatively, one could pick a random element, and equate all other values to it. Not sure what representation is most beneficial for the underlying SPARK query optimizer. > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") >
[jira] [Commented] (SPARK-21417) Detect transitive join conditions via expressions
[ https://issues.apache.org/jira/browse/SPARK-21417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094844#comment-16094844 ] Claus Stadler commented on SPARK-21417: --- Hi Anton, I have a rough idea how the issue could be addressed, but I feel it would be better if the SPARK team provided feedback on the idea and possibly revises it. I am also not very familiar with the code, so I would be good if once the conceptual approach is sorted out, you could provide at least an initial implementation. Should there be any small issues left, I may be able to contribute fixes if needed. Given an expression such as (a.x = 'foo' and 'foo' = b.y) or c.z = 'bar' * Normalize the selection expression into disjunctive normal form => { {(a.x = 'foo'), ('foo', b.y)}, {(c.z = 'bar')} } * Normalize argument order of equal expressions => { {(a.x = 'foo'), (b.y = 'foo')}, {(c.z = 'bar')} } * For each clause in the DNF, infer transitive join conditions b using the following approach: Given e.g. {(a.x = 'foo'), ('foo', b.y)}: * Create a multimap from all equal expressions of the form ee.ff = expr, which in our example gives { 'foo' => {a.x, b.y } } * For every multimap key having multiple values, infer appropriate equals expressions, in this case a.x = b.y. * Note: If there is more than 2 values, in general one create equal expressions for each pair-wise combination of distinct elements in the value set. Alternatively, one could pick a random element, and equate all other values to it. Not sure what representation is most beneficial for the underlying SPARK query optimizer. > Detect transitive join conditions via expressions > - > > Key: SPARK-21417 > URL: https://issues.apache.org/jira/browse/SPARK-21417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Claus Stadler > > _Disclaimer: The nature of this report is similar to that of > https://issues.apache.org/jira/browse/CALCITE-1887 - yet, as SPARK (to my > understanding) uses its own SQL implementation, the requested improvement has > to be treated as a separate issue._ > Given table aliases ta, tb column names ca, cb, and an arbitrary > (deterministic) expression expr then calcite should be capable to infer join > conditions by transitivity: > {noformat} > ta.ca = expr AND tb.cb = expr -> ta.ca = tb.cb > {noformat} > The use case for us stems from SPARQL to SQL rewriting, where SPARQL queries > such as > {code:java} > SELECT { > dbr:Leipzig a ?type . > dbr:Leipzig dbo:mayor ?mayor > } > {code} > result in an SQL query similar to > {noformat} > SELECT s.rdf a, s.rdf b WHERE a.s = 'dbr:Leipzig' AND b.s = 'dbr:Leipzig' > {noformat} > A consequence of the join condition not being recognized is, that Apache > SPARK does not find an executable plan to process the query. > Self contained example: > {code:java} > package my.package; > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.scalatest._ > class TestSparkSqlJoin extends FlatSpec { > "SPARK SQL processor" should "be capable of handling transitive join > conditions" in { > val spark = SparkSession > .builder() > .master("local[2]") > .appName("Spark SQL parser bug") > .getOrCreate() > import spark.implicits._ > // The schema is encoded in a string > val schemaString = "s p o" > // Generate the schema based on the string of schema > val fields = schemaString.split(" ") > .map(fieldName => StructField(fieldName, StringType, nullable = true)) > val schema = StructType(fields) > val data = List(("s1", "p1", "o1")) > val dataRDD = spark.sparkContext.parallelize(data).map(attributes => > Row(attributes._1, attributes._2, attributes._3)) > val df = spark.createDataFrame(dataRDD, schema).as("TRIPLES") > df.createOrReplaceTempView("TRIPLES") > println("First Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = B.s AND A.s = > 'dbr:Leipzig'").show(10) > println("Second Query") > spark.sql("SELECT A.s FROM TRIPLES A, TRIPLES B WHERE A.s = 'dbr:Leipzig' > AND B.s = 'dbr:Leipzig'").show(10) > } > } > {code} > Output (excerpt): > {noformat} > First Query > ... > +---+ > | s| > +---+ > +---+ > Second Query > - should be capable of handling transitive join conditions *** FAILED *** > org.apache.spark.sql.AnalysisException: Detected cartesian product for > INNER join between logical plans > Project [s#3] > +- Filter (isnotnull(s#3) && (s#3 = dbr:Leipzig)) >+- LogicalRDD [s#3, p#4, o#5] > and > Project > +- Filter (isnotnull(s#20) && (s#20 = dbr:Leipzig)) >+- LogicalRDD [s#20, p#21, o#22] > Join condition is missing or trivial. > Use
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094797#comment-16094797 ] David Kats commented on SPARK-15544: Confirming the same issue with Spark 2.1.0 and 2.2.0, ubuntu 14.04, zookeeper 3.4.5 017-07-20 12:48:25,151 INFO ClientCnxn: Client session timed out, have not heard from server in 35022ms for sessionid 0x15d5fb6dc7d0009, closing socket connection and attempting reconnect 2017-07-20 12:48:25,254 INFO ConnectionStateManager: State change: SUSPENDED 2017-07-20 12:48:25,268 INFO ZooKeeperLeaderElectionAgent: We have lost leadership 2017-07-20 12:48:25,295 ERROR Master: Leadership has been revoked -- master shutting down. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
[ https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094794#comment-16094794 ] Sean Owen commented on SPARK-21487: --- I think this is a YARN issue/question. > WebUI-Executors Page results in "Request is a replay (34) attack" > - > > Key: SPARK-21487 > URL: https://issues.apache.org/jira/browse/SPARK-21487 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: lishuming >Priority: Minor > > We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors > Page` becomed empty, with the exception below. > `Executor Page` rendering using javascript language rather than scala in > 2.1.1, but I don't know why causes this result? > "two queries are submitted at the same time and have the same timestamp may > cause this result", but I'm not sure? > ResouceManager log: > {code:java} > 2017-07-20 20:39:09,371 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > Authentication exception: GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > {code} > Safari explorer console > {code:java} > Failed to load resource: the server responded with a status of 403 > (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request > is a replay > (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html > {code} > Related Links: > https://issues.apache.org/jira/browse/HIVE-12481 > https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
[ https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lishuming updated SPARK-21487: -- Description: We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Related Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 was: We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Recent Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 > WebUI-Executors Page results in "Request is a replay (34) attack" > - > > Key: SPARK-21487 > URL: https://issues.apache.org/jira/browse/SPARK-21487 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: lishuming >Priority: Minor > > We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors > Page` becomed empty, with the exception below. > `Executor Page` rendering using javascript language rather than scala in > 2.1.1, but I don't know why causes this result? > "two queries are submitted at the same time and have the same timestamp may > cause this result", but I'm not sure? > ResouceManager log: > {code:java} > 2017-07-20 20:39:09,371 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > Authentication exception: GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > {code} > Safari explorer console > {code:java} > Failed to load resource: the server responded with a status of 403 > (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request > is a replay > (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html > {code} > Related Links: > https://issues.apache.org/jira/browse/HIVE-12481 > https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
lishuming created SPARK-21487: - Summary: WebUI-Executors Page results in "Request is a replay (34) attack" Key: SPARK-21487 URL: https://issues.apache.org/jira/browse/SPARK-21487 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.1 Reporter: lishuming Priority: Minor We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors Page` becomed empty, with the exception below. `Executor Page` rendering using javascript language rather than scala in 2.1.1, but I don't know why causes this result? "two queries are submitted at the same time and have the same timestamp may cause this result", but I'm not sure? ResouceManager log: {code:java} 2017-07-20 20:39:09,371 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: Authentication exception: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)) {code} Safari explorer console {code:java} Failed to load resource: the server responded with a status of 403 (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34)))http://hadoop280.lt.163.org:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html {code} Recent Links: https://issues.apache.org/jira/browse/HIVE-12481 https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full
[ https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094689#comment-16094689 ] Dhruve Ashar commented on SPARK-21460: -- [~Tagar] Can you attach the driver logs so that it helps in investigating. I am not able to reproduce this issue on my end and would like to check more on this. Also how frequently are you hitting this? > Spark dynamic allocation breaks when ListenerBus event queue runs full > -- > > Key: SPARK-21460 > URL: https://issues.apache.org/jira/browse/SPARK-21460 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Spark 2.1 > Hadoop 2.6 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: dynamic_allocation, performance, scheduler, yarn > > When ListenerBus event queue runs full, spark dynamic allocation stops > working - Spark fails to shrink number of executors when there are no active > jobs (Spark driver "thinks" there are active jobs since it didn't capture > when they finished) . > ps. What's worse it also makes Spark flood YARN RM with reservation requests, > so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop > 2.6). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20394) Replication factor value Not changing properly
[ https://issues.apache.org/jira/browse/SPARK-20394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094686#comment-16094686 ] Kannan Subramanian commented on SPARK-20394: Yes. I have edited the hdfs-site.xml for replication value change and override the existing Spark environment. It works now. Thanks for your response Kannan > Replication factor value Not changing properly > -- > > Key: SPARK-20394 > URL: https://issues.apache.org/jira/browse/SPARK-20394 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.0 >Reporter: Kannan Subramanian > > To save SparkSQL dataframe to a persistent hive table using the below steps. > a) RegisterTempTable to the dataframe as a tempTable > b) create table (cols)PartitionedBy(col1, col2) stored as > parquet > c) Insert into partition(col1, col2) select * from tempTable > I have set dfs.replication is equal to "1" in hiveContext object. But It did > not work properly. That is replica is 1 for 80 % of the generated parquet > files on HDFS and default replica 3 is for remaining 20 % of parquet files in > HDFS. I am not sure why the replica is not reflecting to all the generated > parquet files. Please let me know if you have any suggestions or solutions -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21472) Introduce ArrowColumnVector as a reader for Arrow vectors.
[ https://issues.apache.org/jira/browse/SPARK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21472: --- Assignee: Takuya Ueshin > Introduce ArrowColumnVector as a reader for Arrow vectors. > -- > > Key: SPARK-21472 > URL: https://issues.apache.org/jira/browse/SPARK-21472 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.3.0 > > > Introducing {{ArrowColumnVector}} as a reader for Arrow vectors. > It extends {{ColumnVector}}, so we will be able to use it with > {{ColumnarBatch}} and its functionalities. > Currently it supports primitive types and {{StringType}}, {{ArrayType}} and > {{StructType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21472) Introduce ArrowColumnVector as a reader for Arrow vectors.
[ https://issues.apache.org/jira/browse/SPARK-21472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21472. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18680 [https://github.com/apache/spark/pull/18680] > Introduce ArrowColumnVector as a reader for Arrow vectors. > -- > > Key: SPARK-21472 > URL: https://issues.apache.org/jira/browse/SPARK-21472 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > Fix For: 2.3.0 > > > Introducing {{ArrowColumnVector}} as a reader for Arrow vectors. > It extends {{ColumnVector}}, so we will be able to use it with > {{ColumnarBatch}} and its functionalities. > Currently it supports primitive types and {{StringType}}, {{ArrayType}} and > {{StructType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094610#comment-16094610 ] Apache Spark commented on SPARK-10063: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18689 > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094607#comment-16094607 ] Aseem Bansal edited comment on SPARK-21483 at 7/20/17 12:29 PM: Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data. Raw data is just a java bean //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} was (Author: anshbansal): Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094607#comment-16094607 ] Aseem Bansal commented on SPARK-21483: -- Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094603#comment-16094603 ] Peng Meng commented on SPARK-21476: --- I am optimizing RF and GBT these days, if no one works on it. I can working on it. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094547#comment-16094547 ] Ioana Delaney commented on SPARK-19842: --- Yes, we've been working to productize our code. Our progress was slowed down due to some other ongoing projects, but we are planning to open the first round of PRs in the next few weeks if the community agrees with this feature. > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. > Link to the google doc: > [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21477) Mark LocalTableScanExec's input data transient
[ https://issues.apache.org/jira/browse/SPARK-21477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21477: --- Assignee: Xiao Li > Mark LocalTableScanExec's input data transient > -- > > Key: SPARK-21477 > URL: https://issues.apache.org/jira/browse/SPARK-21477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Mark the parameter rows and unsafeRow of LocalTableScanExec transient. It can > avoid serializing the unneeded objects. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org