[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-20168: --- Assignee: Yash Sharma > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma > Labels: kinesis, streaming > Fix For: 2.3.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-20168: Fix Version/s: 2.3.0 > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > Fix For: 2.3.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-20168. - Resolution: Done Target Version/s: 2.3.0 Resolved with https://github.com/apache/spark/pull/18029 > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303567#comment-16303567 ] Shixiong Zhu commented on SPARK-22897: -- +1 > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22901) Add non-deterministic to Python UDF
Xiao Li created SPARK-22901: --- Summary: Add non-deterministic to Python UDF Key: SPARK-22901 URL: https://issues.apache.org/jira/browse/SPARK-22901 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1 Reporter: Xiao Li Add a new API for Python UDF to allow users to change the determinism from deterministic to non-deterministic. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21302) history server WebUI show HTTP ERROR 500
[ https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303458#comment-16303458 ] Xin Yu Pan edited comment on SPARK-21302 at 12/26/17 5:40 AM: -- I hit same problem with Spark 2.1.1 History Server. There is no precise steps to reproduce from my observation. There are 180 completed applications and no incomplete applications. After restarting History Server, the problem can disappear, but if running over one day, the problem can be hitted again. Workaround: restart History Server. The suspected message from my Spark History Server log file. 17/12/25 14:48:32 WARN ServletHandler: /history/app-20171225144620-0142-056cf26e-0c64-4e14-bd9b-d78871e09745/1/jobs/ java.lang.NullPointerException at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.spark_project.jetty.server.Server.handle(Server.java:499) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:748) was (Author: pxy0592): I hit same problem with Spark 2.1.1 History Server. There is no precise steps to reproduce from my observation. There are 180 completed applications and no incomplete applications. After restarting History Server, the problem can disappear, but if running over one day, the problem can be hitted again. Workaround: restart History Server. > history server WebUI show HTTP ERROR 500 > > > Key: SPARK-21302 > URL: https://issues.apache.org/jira/browse/SPARK-21302 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Jason Pan > Attachments: npe.PNG > > > When navigate to history server WebUI, and check incomplete applications, > show http 500 > Error logs: > 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt > app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; > refreshing > 17/07/05 20:17:44 WARN ServletHandler: > /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/ > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:785) > 17/07/05 20:18:00 WARN ServletHandler: / > java.lang.NullPointerException > at >
[jira] [Commented] (SPARK-22629) incorrect handling of calls to random in UDFs
[ https://issues.apache.org/jira/browse/SPARK-22629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303480#comment-16303480 ] Xiao Li commented on SPARK-22629: - The reason is we assume all the UDF are deterministic. The problem this JIRA hit is caused by the misuse. We are thinking whether we should change the default to be non-deterministic. > incorrect handling of calls to random in UDFs > - > > Key: SPARK-22629 > URL: https://issues.apache.org/jira/browse/SPARK-22629 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Michael H > > {code:none} > df_br = spark.createDataFrame([{'name': 'hello'}]) > # udf creates a random integer > udf_random_col = udf(lambda: int(100*random.random()), IntegerType()) > # add a column to our DF using that udf > df_br = df_br.withColumn('RAND', udf_random_col()) > df_br.show() > +-++ > | name|RAND| > +-++ > |hello| 68| > +-++ > # udf that adds 10 to an input column value > random.seed(1234) > udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) > # unexpected result due to re-evaluation > df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show() > +-++-+ > | name|RAND|RAND_PLUS_TEN| > +-++-+ > |hello| 72| 87| > +-++-+ > # workaround: cache the resulst after using the random number generating udf > df_br.withColumn('RAND', > udf_random_col()).cache().withColumn('RAND_PLUS_TEN', > udf_add_ten('RAND')).show() > +-++-+ > | name|RAND|RAND_PLUS_TEN| > +-++-+ > |hello| 68| 78| > +-++-+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22833) [Examples] Improvements made at SparkHive Example with Scala
[ https://issues.apache.org/jira/browse/SPARK-22833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303464#comment-16303464 ] Apache Spark commented on SPARK-22833: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20081 > [Examples] Improvements made at SparkHive Example with Scala > > > Key: SPARK-22833 > URL: https://issues.apache.org/jira/browse/SPARK-22833 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.2.1 >Reporter: Chetan Khatri >Assignee: Chetan Khatri >Priority: Minor > Fix For: 2.3.0 > > > Current Scala Spark Example folder has missing implementation as a part of > examples: > * Writing DataFrame / DataSet to Hive Managed , Hive External table using > different storage format. > * Implementation of Partition, Reparition, Coalesce with appropriate example. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500
[ https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303458#comment-16303458 ] Xin Yu Pan commented on SPARK-21302: I hit same problem with Spark 2.1.1 History Server. There is no precise steps to reproduce from my observation. There are 180 completed applications and no incomplete applications. After restarting History Server, the problem can disappear, but if running over one day, the problem can be hitted again. Workaround: restart History Server. > history server WebUI show HTTP ERROR 500 > > > Key: SPARK-21302 > URL: https://issues.apache.org/jira/browse/SPARK-21302 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Jason Pan > Attachments: npe.PNG > > > When navigate to history server WebUI, and check incomplete applications, > show http 500 > Error logs: > 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt > app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; > refreshing > 17/07/05 20:17:44 WARN ServletHandler: > /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/ > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:785) > 17/07/05 20:18:00 WARN ServletHandler: / > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:785) > 17/07/05 20:18:17 WARN ServletHandler: / > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe,
[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303452#comment-16303452 ] Wenchen Fan commented on SPARK-22897: - https://github.com/apache/spark/pull/12248 added a new interface `TaskContext.getLocalProperty`, I think it's ok to add a new `TaskContext.stageAttemptId` > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303451#comment-16303451 ] Wenchen Fan commented on SPARK-22897: - Sounds reasonable to me, but `TaskContext` is a public API and we should be careful when adding new interface to it. cc [~zsxwing] too > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22870) Dynamic allocation should allow 0 idle time
[ https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22870: Assignee: (was: Apache Spark) > Dynamic allocation should allow 0 idle time > --- > > Key: SPARK-22870 > URL: https://issues.apache.org/jira/browse/SPARK-22870 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang >Priority: Minor > > As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out > when there are pending tasks to run. When there is no task to run, an > executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, > which is currently required to be greater than zero. However, for efficiency, > a user should be able to specify that an executor can die out immediately w/o > being required to be idle for at least 1s. > This is to make {{0}} a valid value for > {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a > case might be needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22870) Dynamic allocation should allow 0 idle time
[ https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22870: Assignee: Apache Spark > Dynamic allocation should allow 0 idle time > --- > > Key: SPARK-22870 > URL: https://issues.apache.org/jira/browse/SPARK-22870 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang >Assignee: Apache Spark >Priority: Minor > > As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out > when there are pending tasks to run. When there is no task to run, an > executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, > which is currently required to be greater than zero. However, for efficiency, > a user should be able to specify that an executor can die out immediately w/o > being required to be idle for at least 1s. > This is to make {{0}} a valid value for > {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a > case might be needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time
[ https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303438#comment-16303438 ] Apache Spark commented on SPARK-22870: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20080 > Dynamic allocation should allow 0 idle time > --- > > Key: SPARK-22870 > URL: https://issues.apache.org/jira/browse/SPARK-22870 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang >Priority: Minor > > As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out > when there are pending tasks to run. When there is no task to run, an > executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, > which is currently required to be greater than zero. However, for efficiency, > a user should be able to specify that an executor can die out immediately w/o > being required to be idle for at least 1s. > This is to make {{0}} a valid value for > {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a > case might be needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22793) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292314#comment-16292314 ] zuotingbing edited comment on SPARK-22793 at 12/26/17 2:00 AM: --- yes the master branch also has this problem. was (Author: zuo.tingbing9): yes the master branch also has this problem, but the difference is so big between branch master and 2.0 . Could someone help to merge this to the master branch? > Memory leak in Spark Thrift Server > -- > > Key: SPARK-22793 > URL: https://issues.apache.org/jira/browse/SPARK-22793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.1 >Reporter: zuotingbing >Priority: Critical > > 1. Start HiveThriftServer2. > 2. Connect to thriftserver through beeline. > 3. Close the beeline. > 4. repeat step2 and step 3 for several times, which caused the leak of Memory. > we found there are many directories never be dropped under the path > {code:java} > hive.exec.local.scratchdir > {code} and > {code:java} > hive.exec.scratchdir > {code} , as we know the scratchdir has been added to deleteOnExit when it be > created. So it means that the cache size of FileSystem deleteOnExit will keep > increasing until JVM terminated. > In addition, we use > {code:java} > jmap -histo:live [PID] > {code} to printout the size of objects in HiveThriftServer2 Process, we can > find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and > "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though > we closed all the beeline connections, which caused the leak of Memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22893) Unified the data type mismatch message
[ https://issues.apache.org/jira/browse/SPARK-22893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303353#comment-16303353 ] Apache Spark commented on SPARK-22893: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20079 > Unified the data type mismatch message > -- > > Key: SPARK-22893 > URL: https://issues.apache.org/jira/browse/SPARK-22893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.3.0 > > > {noformat} > spark-sql> select cast(1 as binary); > Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: > cannot cast IntegerType to BinaryType; line 1 pos 7; > {noformat} > We should use {{dataType.simpleString}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18115) Custom metrics Sink/Source prevent Executor from starting
[ https://issues.apache.org/jira/browse/SPARK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303295#comment-16303295 ] Xudong Zheng commented on SPARK-18115: -- Hello everyone, Can anybody give a short status about this issue? It has been fixed? > Custom metrics Sink/Source prevent Executor from starting > - > > Key: SPARK-18115 > URL: https://issues.apache.org/jira/browse/SPARK-18115 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Kostya Golikov > > Even though there is a semi-official support for custom metrics, in practice > specifying either custom sink or custom source will lead to NoClassDefFound > exceptions on executor side (but will be fine on driver side). > The initialization goes as: > 1. CoarseGrainedExecutorBackend [prepares SparkEnv for > executor|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L223] > 2. SparkEnv [initializes > MetricSystem|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkEnv.scala#L338-L351]. > In case of executor it also starts it > 3. On [`.start()` MetricsSystem parses configuration files and creates > instances of sinks and > sources|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L101-L102]. > This is where the issue actually happens -- it tries to instantiate classes > which are not there yet -- [jars and files are downloaded downstream, in > Executor|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/executor/Executor.scala#L257] > One of the possible solutions is to NOT start MetricSystem this fast, just > like driver does, but postpone it until jar with user defined code is fetched > and available on classpath. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303284#comment-16303284 ] Lior Chaga commented on SPARK-22840: gotcha [~greenhat], thanks > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lior Chaga resolved SPARK-22840. Resolution: Not A Bug > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281 ] Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM: - [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which makes {{last}} function to always return the current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} was (Author: greenhat): [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which makes {last} function to always return current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281 ] Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM: - [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which makes {last} function to always return current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} was (Author: greenhat): [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW} which makes {last} function to always return current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281 ] Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM: - [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which makes {{last}} function to always return the current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting (empty set) : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} was (Author: greenhat): [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which makes {{last}} function to always return the current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22840) Incorrect results when using distinct on window
[ https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281 ] Denys Zadorozhnyi commented on SPARK-22840: --- [~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the window it'll be {RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW} which makes {last} function to always return current value ( see - [https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses] ). If you explicitly specify the frame you should get the result you are expecting : {code} |SELECT DISTINCT |id , |LAST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) |except |SELECT DISTINCT |id , |FIRST(stats) over w |FROM sample |WINDOW w AS (PARTITION BY id SORT BY start_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) {code} > Incorrect results when using distinct on window > --- > > Key: SPARK-22840 > URL: https://issues.apache.org/jira/browse/SPARK-22840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga > Attachments: sample.parquet.zip > > > Given the following schema: > {code} > root > |-- id: string (nullable = true) > |-- start_time: long (nullable = true) > |-- stats: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- calibratedRecsHistory: double (nullable = true) > |||-- eventTime: long (nullable = true) > |||-- itemId: long (nullable = true) > |||-- recsHistory: long (nullable = true) > {code} > Data contains multiple rows per id and start_time, with all stats elements > for a specific id and start_time is identical in all rows, I've noticed > inconsistent results when using window with FIRST(stats) DESC, and > LAST(stats) ASC. > Specifically, the latter (LAST with ASC) produces more results. > This is the query for seeing that: > {code} > SELECT DISTINCT > id , > LAST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time DESC) > except > SELECT DISTINCT > id , > FIRST(stats) over w > FROM sample > WINDOW w AS (PARTITION BY id SORT BY start_time ASC) > {code} > Each of the subqueries should return the stats for the latest start_time, > partitioned by id. > Changing the order of the subqueries returns nothing... > The query with FIRST and ASC produces correct results. > the data for sample is attached in [^sample.parquet.zip] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303277#comment-16303277 ] Apache Spark commented on SPARK-22900: -- User 'sharkdtu' has created a pull request for this issue: https://github.com/apache/spark/pull/20078 > remove unnecessary restrict for streaming dynamic allocation > > > Key: SPARK-22900 > URL: https://issues.apache.org/jira/browse/SPARK-22900 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.3.0 >Reporter: sharkd tu > > When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the > conf `num-executors` can not be set. As a result, it will allocate default 2 > executors and all receivers will be run on this 2 executors, there may not be > redundant cpu cores for tasks. it will stuck all the time. > in my opinion, we should remove unnecessary restrict for streaming dynamic > allocation. we can set `num-executors` and > `spark.streaming.dynamicAllocation.enabled=true` together. when application > starts, each receiver will be run on an executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22900: Assignee: Apache Spark > remove unnecessary restrict for streaming dynamic allocation > > > Key: SPARK-22900 > URL: https://issues.apache.org/jira/browse/SPARK-22900 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.3.0 >Reporter: sharkd tu >Assignee: Apache Spark > > When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the > conf `num-executors` can not be set. As a result, it will allocate default 2 > executors and all receivers will be run on this 2 executors, there may not be > redundant cpu cores for tasks. it will stuck all the time. > in my opinion, we should remove unnecessary restrict for streaming dynamic > allocation. we can set `num-executors` and > `spark.streaming.dynamicAllocation.enabled=true` together. when application > starts, each receiver will be run on an executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22900: Assignee: (was: Apache Spark) > remove unnecessary restrict for streaming dynamic allocation > > > Key: SPARK-22900 > URL: https://issues.apache.org/jira/browse/SPARK-22900 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.3.0 >Reporter: sharkd tu > > When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the > conf `num-executors` can not be set. As a result, it will allocate default 2 > executors and all receivers will be run on this 2 executors, there may not be > redundant cpu cores for tasks. it will stuck all the time. > in my opinion, we should remove unnecessary restrict for streaming dynamic > allocation. we can set `num-executors` and > `spark.streaming.dynamicAllocation.enabled=true` together. when application > starts, each receiver will be run on an executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation
sharkd tu created SPARK-22900: - Summary: remove unnecessary restrict for streaming dynamic allocation Key: SPARK-22900 URL: https://issues.apache.org/jira/browse/SPARK-22900 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.3.0 Reporter: sharkd tu When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the conf `num-executors` can not be set. As a result, it will allocate default 2 executors and all receivers will be run on this 2 executors, there may not be redundant cpu cores for tasks. it will stuck all the time. in my opinion, we should remove unnecessary restrict for streaming dynamic allocation. we can set `num-executors` and `spark.streaming.dynamicAllocation.enabled=true` together. when application starts, each receiver will be run on an executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.
[ https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22899: Assignee: (was: Apache Spark) > OneVsRestModel transform on streaming data failed. > -- > > Key: SPARK-22899 > URL: https://issues.apache.org/jira/browse/SPARK-22899 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Weichen Xu > > OneVsRestModel transform on streaming data failed. > Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.
[ https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22899: Assignee: Apache Spark > OneVsRestModel transform on streaming data failed. > -- > > Key: SPARK-22899 > URL: https://issues.apache.org/jira/browse/SPARK-22899 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Weichen Xu >Assignee: Apache Spark > > OneVsRestModel transform on streaming data failed. > Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22899) OneVsRestModel transform on streaming data failed.
[ https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303194#comment-16303194 ] Apache Spark commented on SPARK-22899: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/20077 > OneVsRestModel transform on streaming data failed. > -- > > Key: SPARK-22899 > URL: https://issues.apache.org/jira/browse/SPARK-22899 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Weichen Xu > > OneVsRestModel transform on streaming data failed. > Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22899) OneVsRestModel transform on streaming data failed.
Weichen Xu created SPARK-22899: -- Summary: OneVsRestModel transform on streaming data failed. Key: SPARK-22899 URL: https://issues.apache.org/jira/browse/SPARK-22899 Project: Spark Issue Type: Bug Components: ML, Structured Streaming Affects Versions: 2.2.1 Reporter: Weichen Xu OneVsRestModel transform on streaming data failed. Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22893) Unified the data type mismatch message
[ https://issues.apache.org/jira/browse/SPARK-22893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22893. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.3.0 > Unified the data type mismatch message > -- > > Key: SPARK-22893 > URL: https://issues.apache.org/jira/browse/SPARK-22893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.3.0 > > > {noformat} > spark-sql> select cast(1 as binary); > Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: > cannot cast IntegerType to BinaryType; line 1 pos 7; > {noformat} > We should use {{dataType.simpleString}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org