[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-20168:
---

Assignee: Yash Sharma

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-20168:

Fix Version/s: 2.3.0

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20168.
-
  Resolution: Done
Target Version/s: 2.3.0

Resolved with
https://github.com/apache/spark/pull/18029

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext

2017-12-25 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303567#comment-16303567
 ] 

Shixiong Zhu commented on SPARK-22897:
--

+1

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22901) Add non-deterministic to Python UDF

2017-12-25 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22901:
---

 Summary: Add non-deterministic to Python UDF
 Key: SPARK-22901
 URL: https://issues.apache.org/jira/browse/SPARK-22901
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1
Reporter: Xiao Li


Add a new API for Python UDF to allow users to change the determinism from 
deterministic to non-deterministic. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21302) history server WebUI show HTTP ERROR 500

2017-12-25 Thread Xin Yu Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303458#comment-16303458
 ] 

Xin Yu Pan edited comment on SPARK-21302 at 12/26/17 5:40 AM:
--

I hit same problem with Spark 2.1.1 History Server. There is no precise steps 
to reproduce from my observation. There are 180 completed applications and no 
incomplete applications. After restarting History Server, the problem can 
disappear, but if running over one day, the problem can be hitted again.

Workaround: restart History Server.

The suspected message from my Spark History Server log file.
17/12/25 14:48:32 WARN ServletHandler: 
/history/app-20171225144620-0142-056cf26e-0c64-4e14-bd9b-d78871e09745/1/jobs/
java.lang.NullPointerException
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Server.java:499)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:748)



was (Author: pxy0592):
I hit same problem with Spark 2.1.1 History Server. There is no precise steps 
to reproduce from my observation. There are 180 completed applications and no 
incomplete applications. After restarting History Server, the problem can 
disappear, but if running over one day, the problem can be hitted again.

Workaround: restart History Server.

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
> Attachments: npe.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> 

[jira] [Commented] (SPARK-22629) incorrect handling of calls to random in UDFs

2017-12-25 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303480#comment-16303480
 ] 

Xiao Li commented on SPARK-22629:
-

The reason is we assume all the UDF are deterministic. The problem this JIRA 
hit is caused by the misuse.

We are thinking whether we should change the default to be non-deterministic. 

> incorrect handling of calls to random in UDFs
> -
>
> Key: SPARK-22629
> URL: https://issues.apache.org/jira/browse/SPARK-22629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael H
>
> {code:none}
> df_br = spark.createDataFrame([{'name': 'hello'}])
> # udf creates a random integer
> udf_random_col =  udf(lambda: int(100*random.random()), IntegerType())
> # add a column to our DF using that udf
> df_br = df_br.withColumn('RAND', udf_random_col())
> df_br.show()
> +-++
> | name|RAND|
> +-++
> |hello|  68|
> +-++
> # udf that adds 10 to an input column value
> random.seed(1234)
> udf_add_ten =  udf(lambda rand: rand + 10, IntegerType())
> # unexpected result due to re-evaluation
> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show()
> +-++-+
> | name|RAND|RAND_PLUS_TEN|
> +-++-+
> |hello|  72|   87|
> +-++-+
> # workaround: cache the resulst after using the random number generating udf
> df_br.withColumn('RAND', 
> udf_random_col()).cache().withColumn('RAND_PLUS_TEN', 
> udf_add_ten('RAND')).show()
> +-++-+
> | name|RAND|RAND_PLUS_TEN|
> +-++-+
> |hello|  68|   78|
> +-++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22833) [Examples] Improvements made at SparkHive Example with Scala

2017-12-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303464#comment-16303464
 ] 

Apache Spark commented on SPARK-22833:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20081

> [Examples] Improvements made at SparkHive Example with Scala
> 
>
> Key: SPARK-22833
> URL: https://issues.apache.org/jira/browse/SPARK-22833
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.2.1
>Reporter: Chetan Khatri
>Assignee: Chetan Khatri
>Priority: Minor
> Fix For: 2.3.0
>
>
> Current Scala Spark Example folder has missing implementation as a part of 
> examples:
> * Writing DataFrame / DataSet to Hive Managed , Hive External table using 
> different storage format.
> * Implementation of Partition, Reparition, Coalesce with appropriate example.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500

2017-12-25 Thread Xin Yu Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303458#comment-16303458
 ] 

Xin Yu Pan commented on SPARK-21302:


I hit same problem with Spark 2.1.1 History Server. There is no precise steps 
to reproduce from my observation. There are 180 completed applications and no 
incomplete applications. After restarting History Server, the problem can 
disappear, but if running over one day, the problem can be hitted again.

Workaround: restart History Server.

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
> Attachments: npe.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:17 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, 

[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext

2017-12-25 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303452#comment-16303452
 ] 

Wenchen Fan commented on SPARK-22897:
-

https://github.com/apache/spark/pull/12248 added a new interface 
`TaskContext.getLocalProperty`, I think it's ok to add a new 
`TaskContext.stageAttemptId`

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22897) Expose stageAttemptId in TaskContext

2017-12-25 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303451#comment-16303451
 ] 

Wenchen Fan commented on SPARK-22897:
-

Sounds reasonable to me, but `TaskContext` is a public API and we should be 
careful when adding new interface to it. cc [~zsxwing] too

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22870:


Assignee: (was: Apache Spark)

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22870:


Assignee: Apache Spark

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303438#comment-16303438
 ] 

Apache Spark commented on SPARK-22870:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20080

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22793) Memory leak in Spark Thrift Server

2017-12-25 Thread zuotingbing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292314#comment-16292314
 ] 

zuotingbing edited comment on SPARK-22793 at 12/26/17 2:00 AM:
---

yes the master branch also has this problem.


was (Author: zuo.tingbing9):
yes the master branch also has this problem, but the difference is so big 
between branch master and 2.0 . Could someone help to merge this to the master 
branch?

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-22793
> URL: https://issues.apache.org/jira/browse/SPARK-22793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.1
>Reporter: zuotingbing
>Priority: Critical
>
> 1. Start HiveThriftServer2.
> 2. Connect to thriftserver through beeline.
> 3. Close the beeline.
> 4. repeat step2 and step 3 for several times, which caused the leak of Memory.
> we found there are many directories never be dropped under the path
> {code:java}
> hive.exec.local.scratchdir
> {code} and 
> {code:java}
> hive.exec.scratchdir
> {code} , as we know the scratchdir has been added to deleteOnExit when it be 
> created. So it means that the cache size of FileSystem deleteOnExit will keep 
> increasing until JVM terminated.
> In addition, we use 
> {code:java}
> jmap -histo:live [PID]
> {code} to printout the size of objects in HiveThriftServer2 Process, we can 
> find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
> "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though 
> we closed all the beeline connections, which caused the leak of Memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22893) Unified the data type mismatch message

2017-12-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303353#comment-16303353
 ] 

Apache Spark commented on SPARK-22893:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20079

> Unified the data type mismatch message
> --
>
> Key: SPARK-22893
> URL: https://issues.apache.org/jira/browse/SPARK-22893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {noformat}
> spark-sql> select cast(1 as binary);
> Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: 
> cannot cast IntegerType to BinaryType; line 1 pos 7;
> {noformat}
> We should use {{dataType.simpleString}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18115) Custom metrics Sink/Source prevent Executor from starting

2017-12-25 Thread Xudong Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303295#comment-16303295
 ] 

Xudong Zheng commented on SPARK-18115:
--

Hello everyone,

Can anybody give a short status about this issue? It has been fixed?

> Custom metrics Sink/Source prevent Executor from starting
> -
>
> Key: SPARK-18115
> URL: https://issues.apache.org/jira/browse/SPARK-18115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Kostya Golikov
>
> Even though there is a semi-official support for custom metrics, in practice 
> specifying either custom sink or custom source will lead to NoClassDefFound 
> exceptions on executor side (but will be fine on driver side).
> The initialization goes as: 
> 1. CoarseGrainedExecutorBackend [prepares SparkEnv for 
> executor|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L223]
> 2. SparkEnv [initializes 
> MetricSystem|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkEnv.scala#L338-L351].
>  In case of executor it also starts it
> 3. On [`.start()` MetricsSystem parses configuration files and creates 
> instances of sinks and 
> sources|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L101-L102].
>  This is where the issue actually happens -- it tries to instantiate classes 
> which are not there yet -- [jars and files are downloaded downstream, in 
> Executor|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/executor/Executor.scala#L257]
> One of the possible solutions is to NOT start MetricSystem this fast, just 
> like driver does, but postpone it until jar with user defined code is fetched 
> and available on classpath. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Lior Chaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303284#comment-16303284
 ] 

Lior Chaga commented on SPARK-22840:


gotcha [~greenhat], thanks

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Lior Chaga (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lior Chaga resolved SPARK-22840.

Resolution: Not A Bug

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Denys Zadorozhnyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281
 ] 

Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM:
-

[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {{last}} function to always return the current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}


was (Author: greenhat):
[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Denys Zadorozhnyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281
 ] 

Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM:
-

[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}


was (Author: greenhat):
[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW} which makes 
{last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Denys Zadorozhnyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281
 ] 

Denys Zadorozhnyi edited comment on SPARK-22840 at 12/25/17 2:14 PM:
-

[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {{last}} function to always return the current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
(empty set) :
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}


was (Author: greenhat):
[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} which 
makes {{last}} function to always return the current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22840) Incorrect results when using distinct on window

2017-12-25 Thread Denys Zadorozhnyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303281#comment-16303281
 ] 

Denys Zadorozhnyi commented on SPARK-22840:
---

[~lio...@taboola.com] [~liorchaga] If you don't specify the frame for the 
window it'll be {RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW} which makes 
{last} function to always return current value ( see - 
[https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#rows-between-and-range-between-clauses]
 ).
If you explicitly specify the frame you should get the result you are expecting 
:
{code}
 |SELECT DISTINCT
|id ,
|LAST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time DESC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
|except
|SELECT DISTINCT
|id ,
|FIRST(stats) over w
|FROM sample
|WINDOW w AS (PARTITION BY id  SORT BY start_time ASC ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
{code}

> Incorrect results when using distinct on window
> ---
>
> Key: SPARK-22840
> URL: https://issues.apache.org/jira/browse/SPARK-22840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
> Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- calibratedRecsHistory: double (nullable = true)
>  |||-- eventTime: long (nullable = true)
>  |||-- itemId: long (nullable = true)
>  |||-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
> id ,
> LAST(stats) over w
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except
> SELECT DISTINCT
> id ,
> FIRST(stats) over w 
> FROM sample
> WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation

2017-12-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303277#comment-16303277
 ] 

Apache Spark commented on SPARK-22900:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20078

> remove unnecessary restrict for streaming dynamic allocation
> 
>
> Key: SPARK-22900
> URL: https://issues.apache.org/jira/browse/SPARK-22900
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: sharkd tu
>
> When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the 
> conf `num-executors` can not be set. As a result, it will allocate default 2 
> executors and all receivers will be run on this 2 executors, there may not be 
> redundant cpu cores for tasks. it will stuck all the time.
> in my opinion, we should remove unnecessary restrict for streaming dynamic 
> allocation. we can set `num-executors` and 
> `spark.streaming.dynamicAllocation.enabled=true` together. when application 
> starts, each receiver will be run on an executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22900:


Assignee: Apache Spark

> remove unnecessary restrict for streaming dynamic allocation
> 
>
> Key: SPARK-22900
> URL: https://issues.apache.org/jira/browse/SPARK-22900
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: sharkd tu
>Assignee: Apache Spark
>
> When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the 
> conf `num-executors` can not be set. As a result, it will allocate default 2 
> executors and all receivers will be run on this 2 executors, there may not be 
> redundant cpu cores for tasks. it will stuck all the time.
> in my opinion, we should remove unnecessary restrict for streaming dynamic 
> allocation. we can set `num-executors` and 
> `spark.streaming.dynamicAllocation.enabled=true` together. when application 
> starts, each receiver will be run on an executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22900:


Assignee: (was: Apache Spark)

> remove unnecessary restrict for streaming dynamic allocation
> 
>
> Key: SPARK-22900
> URL: https://issues.apache.org/jira/browse/SPARK-22900
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: sharkd tu
>
> When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the 
> conf `num-executors` can not be set. As a result, it will allocate default 2 
> executors and all receivers will be run on this 2 executors, there may not be 
> redundant cpu cores for tasks. it will stuck all the time.
> in my opinion, we should remove unnecessary restrict for streaming dynamic 
> allocation. we can set `num-executors` and 
> `spark.streaming.dynamicAllocation.enabled=true` together. when application 
> starts, each receiver will be run on an executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation

2017-12-25 Thread sharkd tu (JIRA)
sharkd tu created SPARK-22900:
-

 Summary: remove unnecessary restrict for streaming dynamic 
allocation
 Key: SPARK-22900
 URL: https://issues.apache.org/jira/browse/SPARK-22900
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.3.0
Reporter: sharkd tu


When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the conf 
`num-executors` can not be set. As a result, it will allocate default 2 
executors and all receivers will be run on this 2 executors, there may not be 
redundant cpu cores for tasks. it will stuck all the time.

in my opinion, we should remove unnecessary restrict for streaming dynamic 
allocation. we can set `num-executors` and 
`spark.streaming.dynamicAllocation.enabled=true` together. when application 
starts, each receiver will be run on an executor.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22899:


Assignee: (was: Apache Spark)

> OneVsRestModel transform on streaming data failed.
> --
>
> Key: SPARK-22899
> URL: https://issues.apache.org/jira/browse/SPARK-22899
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>
> OneVsRestModel transform on streaming data failed.
> Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22899:


Assignee: Apache Spark

> OneVsRestModel transform on streaming data failed.
> --
>
> Key: SPARK-22899
> URL: https://issues.apache.org/jira/browse/SPARK-22899
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> OneVsRestModel transform on streaming data failed.
> Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303194#comment-16303194
 ] 

Apache Spark commented on SPARK-22899:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20077

> OneVsRestModel transform on streaming data failed.
> --
>
> Key: SPARK-22899
> URL: https://issues.apache.org/jira/browse/SPARK-22899
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>
> OneVsRestModel transform on streaming data failed.
> Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22899) OneVsRestModel transform on streaming data failed.

2017-12-25 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-22899:
--

 Summary: OneVsRestModel transform on streaming data failed.
 Key: SPARK-22899
 URL: https://issues.apache.org/jira/browse/SPARK-22899
 Project: Spark
  Issue Type: Bug
  Components: ML, Structured Streaming
Affects Versions: 2.2.1
Reporter: Weichen Xu


OneVsRestModel transform on streaming data failed.
Because of it persisting the input dataset, which streaming do not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22893) Unified the data type mismatch message

2017-12-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22893.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.3.0

> Unified the data type mismatch message
> --
>
> Key: SPARK-22893
> URL: https://issues.apache.org/jira/browse/SPARK-22893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {noformat}
> spark-sql> select cast(1 as binary);
> Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: 
> cannot cast IntegerType to BinaryType; line 1 pos 7;
> {noformat}
> We should use {{dataType.simpleString}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org