date:20180927

[jira] [Commented] (SPARK-22565) Session-based windowing

2018-09-27 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631370#comment-16631370
 ] 

Li Yuanjian commented on SPARK-22565:
-

[~kabhwan]

Great thanks for noticing me, sorry for only searched the "session window" and 
missed SPARK-10816.

{quote}

It would be nice if you could also share the SPIP, as well as some PR or design 
doc, so that we could see spots on making co-work and get better product.

{quote}

No problem, I'll cherry-pick all related patch from internal folk, and actually 
we are translating the internal doc for few days, will also post a design doc 
today, let discuss in SPARK-10816.

Thanks again for your reply.

 

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-27 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631295#comment-16631295
 ] 

Jungtaek Lim edited comment on SPARK-25380 at 9/28/18 3:41 AM:
---

IMHO it depends on how we see the issue and how we would like to tackle this.

If we think 200M of plan string is normal and usual, you're right the issue 
lays in UI and UI should deal with it well.
 (Even 200M of single plan would be out of expectation on end users and they 
might miss to consider allocating enough space on driver side for UI, so 
purging old plan would work for some cases but not for some other cases.)

If we don't think 200M of plan string is normal, we need to see actual case and 
investigate which physical node occupies much space on representing, and 
whether they're really needed or too verbose. If the huge string came from 
representing physical node itself which doesn't change among batches, we may be 
able to try storing template format of message for physical node and variables 
separately and apply just when page is requested.

If we know more, we could have better solution.

Since we are unlikely to get reproducer, I wouldn't want to block anyone to 
work on this. Anyone could tackle on UI issue.

EDIT: I might misunderstand your previous comment, so just removed the lines I 
mentioned it.


was (Author: kabhwan):
IMHO it depends on how we see the issue and how we would like to tackle this.

If we think 200M of plan string is normal and usual, you're right the issue 
lays in UI and UI should deal with it well.
(Even 200M of single plan would be out of expectation on end users and they 
might miss to consider allocating enough space on driver side for UI, so 
purging old plan would work for some cases but not for some other cases.)

If we don't think 200M of plan string is normal, we need to see actual case and 
investigate which physical node occupies much space on representing, and 
whether they're really needed or too verbose. If the huge string came from 
representing physical node itself which doesn't change among batches, we may be 
able to try storing template format of message for physical node and variables 
separately and apply just when page is requested.

If we know more, we could have better solution: according to your previous 
comment, I guess we're on the same page:
{quote}They seem to hold a lot more memory than just the plan graph structures 
do, it would be nice to know what exactly is holding on to that memory.
{quote}
Since we are unlikely to get reproducer, I wouldn't want to block anyone to 
work on this. Anyone could tackle on UI issue.

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-27 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631313#comment-16631313
 ] 

Jungtaek Lim commented on SPARK-25380:
--

Btw, reproducer still helps when we tackle it with only UI side. There're a few 
options to avoid memory issue:
 # Remove feature (or have option to "opt out") on showing physical plan 
description in UI.
 # Find some ways to dramatically reduce memory on storing physical plan 
description.
 # Purge old physical plan (count or memory based).

2 and 3 can be applied individually or together. If we are interested on 2, we 
would still want to have actual string to see how we reduce it (like 
compression/decompression).

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25563) Spark application hangs If container allocate on lost Nodemanager

2018-09-27 Thread devinduan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

devinduan updated SPARK-25563:
--
Summary: Spark application hangs If container allocate on lost Nodemanager  
(was: Spark application hangs)

> Spark application hangs If container allocate on lost Nodemanager
> -
>
> Key: SPARK-25563
> URL: https://issues.apache.org/jira/browse/SPARK-25563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: devinduan
>Priority: Minor
>
>     I met a issue that if  I start a spark application use yarn client mode, 
> application sometimes hang.
>     I check the application logs,  container allocate on a lost NodeManager, 
> but AM don't retry to start another executor.
>     My spark version is 2.3.1
>     Here is my ApplicationMaster log.
>  
> 2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster
> 2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over 
> to rm2 
> 2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than 
> spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please 
> update your configs.
> 2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of 
> spark.dynamicAllocation.initialExecutors, 
> spark.dynamicAllocation.minExecutors and spark.executor.instances
> 2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor 
> container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of 
> overhead)
> 2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container 
> requests.
> 2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter 
> thread with (heartbeat : 3000, initial allocation : 200) intervals
> 2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for 
> container: container_1532951609168_4721728_01_02
> 2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container 
> container_1532951609168_4721728_01_02 (state: COMPLETE, exit status: -100)
> 2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: 
> container_1532951609168_4721728_01_02. Exit status: -100. Diagnostics: 
> Container released on a *lost* node



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25563) Spark application hangs

2018-09-27 Thread devinduan (JIRA)

devinduan created SPARK-25563:
-

 Summary: Spark application hangs
 Key: SPARK-25563
 URL: https://issues.apache.org/jira/browse/SPARK-25563
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: devinduan


    I met a issue that if  I start a spark application use yarn client mode, 
application sometimes hang.
    I check the application logs,  container allocate on a lost NodeManager, 
but AM don't retry to start another executor.
    My spark version is 2.3.1
    Here is my ApplicationMaster log.
 
2018-09-26 05:21:15 INFO YarnRMClient:54 - Registering the ApplicationMaster
2018-09-26 05:21:15 INFO ConfiguredRMFailoverProxyProvider:100 - Failing over 
to rm2 
2018-09-26 05:21:15 WARN Utils:66 - spark.executor.instances less than 
spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please 
update your configs.
2018-09-26 05:21:15 INFO Utils:54 - Using initial executors = 1, max of 
spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors 
and spark.executor.instances
2018-09-26 05:21:15 INFO YarnAllocator:54 - Will request 1 executor 
container(s), each with 24 core(s) and 20275 MB memory (including 1843 MB of 
overhead)
2018-09-26 05:21:15 INFO YarnAllocator:54 - Submitted 1 unlocalized container 
requests.
2018-09-26 05:21:15 INFO ApplicationMaster:54 - Started progress reporter 
thread with (heartbeat : 3000, initial allocation : 200) intervals
2018-09-26 05:21:27 WARN YarnAllocator:66 - Cannot find executorId for 
container: container_1532951609168_4721728_01_02
2018-09-26 05:21:27 INFO YarnAllocator:54 - Completed container 
container_1532951609168_4721728_01_02 (state: COMPLETE, exit status: -100)
2018-09-26 05:21:27 WARN YarnAllocator:66 - Container marked as failed: 
container_1532951609168_4721728_01_02. Exit status: -100. Diagnostics: 
Container released on a *lost* node



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22565) Session-based windowing

2018-09-27 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631301#comment-16631301
 ] 

Jungtaek Lim commented on SPARK-22565:
--

[~XuanYuan]

Hello, I've been initiating supporting this feature around a week ago in 
different JIRA issue. Please refer 
https://issues.apache.org/jira/browse/SPARK-10816 as well as SPIP discussion 
thread on dev@ mailing list. WIP version of PR is also available. 
[https://github.com/apache/spark/pull/22482]

It would be nice if you could also share the SPIP, as well as some PR or design 
doc, so that we could see spots on making co-work and get better product.

Thanks in advance!

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22565) Session-based windowing

2018-09-27 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631298#comment-16631298
 ] 

Li Yuanjian edited comment on SPARK-22565 at 9/28/18 3:23 AM:
--

Thanks for reporting this. Actually we also met this problem in our usage, we 
have an implement about session window in internal folk to resolve this. After 
steady running online for real product env, we want to contribute to community 
within the next few days. We implemented this by a build-in function named 
session_window and corresponding support for window merge in Structure 
Streaming. The usage of dataframe api and SQL can be quickly browsing by the 
test:
 !screenshot-1.png!


was (Author: xuanyuan):
Thanks for reporting this. Actually we also met this problem in our usage, we 
have an implement about session window in internal folk to resolve this. After 
steady running online for real product env, we want to contribute to community 
within the next few days. We implemented this by a build-in function named 
session_window. The usage of dataframe api and SQL can be quickly browsing by 
the test:
 !screenshot-1.png!

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22565) Session-based windowing

2018-09-27 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631300#comment-16631300
 ] 

Li Yuanjian commented on SPARK-22565:
-

Also cc [~zsxwing] [~tdas], we are translating the design doc and will post a 
SPIP in these days, hope you can have a look when you have time, thanks :)

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22565) Session-based windowing

2018-09-27 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631298#comment-16631298
 ] 

Li Yuanjian commented on SPARK-22565:
-

Thanks for reporting this. Actually we also met this problem in our usage, we 
have an implement about session window in internal folk to resolve this. After 
steady running online for real product env, we want to contribute to community 
within the next few days. We implemented this by a build-in function named 
session_window. The usage of dataframe api and SQL can be quickly browsing by 
the test:
 !screenshot-1.png!

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22565) Session-based windowing

2018-09-27 Thread Li Yuanjian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-22565:

Attachment: screenshot-1.png

> Session-based windowing
> ---
>
> Key: SPARK-22565
> URL: https://issues.apache.org/jira/browse/SPARK-22565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Richard Xin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I came across a requirement to support session-based windowing. for example, 
> user activity comes in from kafka, we want to create window per user session 
> (if the time gap of activity from the same user exceeds the predefined value, 
> a new window will be created).
> I noticed that Flink does support this kind of support, any plan/schedule for 
> spark for this? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-27 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631295#comment-16631295
 ] 

Jungtaek Lim commented on SPARK-25380:
--

IMHO it depends on how we see the issue and how we would like to tackle this.

If we think 200M of plan string is normal and usual, you're right the issue 
lays in UI and UI should deal with it well.
(Even 200M of single plan would be out of expectation on end users and they 
might miss to consider allocating enough space on driver side for UI, so 
purging old plan would work for some cases but not for some other cases.)

If we don't think 200M of plan string is normal, we need to see actual case and 
investigate which physical node occupies much space on representing, and 
whether they're really needed or too verbose. If the huge string came from 
representing physical node itself which doesn't change among batches, we may be 
able to try storing template format of message for physical node and variables 
separately and apply just when page is requested.

If we know more, we could have better solution: according to your previous 
comment, I guess we're on the same page:
{quote}They seem to hold a lot more memory than just the plan graph structures 
do, it would be nice to know what exactly is holding on to that memory.
{quote}
Since we are unlikely to get reproducer, I wouldn't want to block anyone to 
work on this. Anyone could tackle on UI issue.

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25562) The Spark add audit log

2018-09-27 Thread yinghua_zh (JIRA)

yinghua_zh created SPARK-25562:
--

 Summary: The Spark add audit log
 Key: SPARK-25562
 URL: https://issues.apache.org/jira/browse/SPARK-25562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit
Affects Versions: 2.1.0
Reporter: yinghua_zh


 
At present, spark does not record audit logs, and can increase audit logs for 
security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25560) Allow Function Injection in SparkSessionExtensions

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25560:


Assignee: Apache Spark

> Allow Function Injection in SparkSessionExtensions
> --
>
> Key: SPARK-25560
> URL: https://issues.apache.org/jira/browse/SPARK-25560
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Russell Spitzer
>Assignee: Apache Spark
>Priority: Major
>
> Currently there is no way to add a set of external functions to all sessions 
> made by users. We could add a small extension to SparkSessionExtensions which 
> would allow this to be done. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25560) Allow Function Injection in SparkSessionExtensions

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25560:


Assignee: (was: Apache Spark)

> Allow Function Injection in SparkSessionExtensions
> --
>
> Key: SPARK-25560
> URL: https://issues.apache.org/jira/browse/SPARK-25560
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Russell Spitzer
>Priority: Major
>
> Currently there is no way to add a set of external functions to all sessions 
> made by users. We could add a small extension to SparkSessionExtensions which 
> would allow this to be done. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25560) Allow Function Injection in SparkSessionExtensions

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631257#comment-16631257
 ] 

Apache Spark commented on SPARK-25560:
--

User 'RussellSpitzer' has created a pull request for this issue:
https://github.com/apache/spark/pull/22576

> Allow Function Injection in SparkSessionExtensions
> --
>
> Key: SPARK-25560
> URL: https://issues.apache.org/jira/browse/SPARK-25560
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Russell Spitzer
>Priority: Major
>
> Currently there is no way to add a set of external functions to all sessions 
> made by users. We could add a small extension to SparkSessionExtensions which 
> would allow this to be done. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24630:


Assignee: Apache Spark

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Assignee: Apache Spark
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24630:


Assignee: (was: Apache Spark)

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631253#comment-16631253
 ] 

Apache Spark commented on SPARK-24630:
--

User 'stczwd' has created a pull request for this issue:
https://github.com/apache/spark/pull/22575

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-27 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631236#comment-16631236
 ] 

Marcelo Vanzin commented on SPARK-25380:


If all you want is see this live, why do you need a query that generates a 
large plan?

Just hack whatever method creates the plan string to return a really large 
string full of garbage. Same result.

The goal here is not to reproduce the original problem, but to provide a 
solution for when that happens; there's no problem with his query other than 
the plan being large and the UI not dealing with that well.

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25459) Add viewOriginalText back to CatalogTable

2018-09-27 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631232#comment-16631232
 ] 

Dongjoon Hyun commented on SPARK-25459:
---

[~chrisz28]. It seems that JIRA is unable to accept your id.


 !error_message.png!

> Add viewOriginalText back to CatalogTable
> -
>
> Key: SPARK-25459
> URL: https://issues.apache.org/jira/browse/SPARK-25459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Zheyuan Zhao
>Priority: Major
> Attachments: error_message.png
>
>
> The {{show create table}} will show a lot of generated attributes for views 
> that created by older Spark version. See this test suite 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala#L115.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25459) Add viewOriginalText back to CatalogTable

2018-09-27 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25459:
--
Attachment: error_message.png

> Add viewOriginalText back to CatalogTable
> -
>
> Key: SPARK-25459
> URL: https://issues.apache.org/jira/browse/SPARK-25459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Zheyuan Zhao
>Priority: Major
> Attachments: error_message.png
>
>
> The {{show create table}} will show a lot of generated attributes for views 
> that created by older Spark version. See this test suite 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala#L115.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-27 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631206#comment-16631206
 ] 

Jungtaek Lim commented on SPARK-10816:
--

To avoid any concerns/wonders, I believe my proposal and 
map/flatMapGroupsWithState can co-exist.

map/flatMapGroupsWithState target for general (arbitrary, not specific to 
window) cases but can't be fully optimized for any specific cases by nature of 
"generalization". Edge-case also came out from generalization, and if we tackle 
the edge-case with map/flatMapGroupsWithState as supporting multiple values per 
key, it would be non-trivial overhead for the cases which don't need to have 
multiple values per key, as well as state function may be more complicated or 
have couple of forms.

The point is whether simple gap session window is worth to be treated for first 
class use cases. Spark supports tumble/slide window natively because we see 
it's worth. I think we see worth of supporting session window since we have 
example on sessionization, and I guess supporting it natively would give much 
benefit over adding some complexities.

Same thing would apply if we would add some other API (function, or DSL) for 
supporting custom window for follow-up issue (SPARK-2 as Arun stated). If 
we feel much convenient and see its worth to support it natively instead of let 
end users play with map/flatMapGroupsWithState, it can be the thing to go.

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-09-27 Thread Karthik Manamcheri (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631193#comment-16631193
 ] 

Karthik Manamcheri commented on SPARK-25561:


The root cause was from SPARK-17992 ping [~michael] what are your thoughts on 
this?

> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-09-27 Thread Karthik Manamcheri (JIRA)

Karthik Manamcheri created SPARK-25561:
--

 Summary: HiveClient.getPartitionsByFilter throws an exception if 
Hive retries directSql
 Key: SPARK-25561
 URL: https://issues.apache.org/jira/browse/SPARK-25561
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Karthik Manamcheri


In HiveShim.scala, the current behavior is that if 
hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
call to succeed. If it fails, we'll throw a RuntimeException.

However, this might not always be the case. Hive's direct SQL functionality is 
best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631186#comment-16631186
 ] 

Hyukjin Kwon edited comment on SPARK-18112 at 9/27/18 11:29 PM:


To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at 
https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 2.3.2. 
Now it's somehow blocked and I need some input there.


was (Author: hyukjin.kwon):
To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at 
https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 3.0.0. 
Now it's somehow blocked and I need some input there.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631186#comment-16631186
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

To avoid Hive fork 1.2.1 in Spark itself at all, please provide some input at 
https://issues.apache.org/jira/browse/SPARK-20202 to upgrade it to Hive 3.0.0. 
Now it's somehow blocked and I need some input there.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631184#comment-16631184
 ] 

Apache Spark commented on SPARK-25559:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631183#comment-16631183
 ] 

Apache Spark commented on SPARK-25559:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-27 Thread DB Tsai (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631182#comment-16631182
 ] 

DB Tsai commented on SPARK-25559:
-

https://github.com/apache/spark/pull/22574

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25556:

Comment: was deleted

(was: User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574)

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25556:

Comment: was deleted

(was: User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574)

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631159#comment-16631159
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

{quote}
 If I use that 1.2.1 fork I was getting some query errors due to me using bloom 
filters on multiple columns of the table
{quote}

Can you file another JIRA then? Some tests were added for that (SPARK-25427). 
If that's not respected, it sounds another problem.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1663#comment-1663
 ] 

Apache Spark commented on SPARK-25556:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25556:


Assignee: Apache Spark  (was: DB Tsai)

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631108#comment-16631108
 ] 

Apache Spark commented on SPARK-25556:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22574

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25556:


Assignee: DB Tsai  (was: Apache Spark)

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25560) Allow Function Injection in SparkSessionExtensions

2018-09-27 Thread Russell Spitzer (JIRA)

Russell Spitzer created SPARK-25560:
---

 Summary: Allow Function Injection in SparkSessionExtensions
 Key: SPARK-25560
 URL: https://issues.apache.org/jira/browse/SPARK-25560
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Russell Spitzer


Currently there is no way to add a set of external functions to all sessions 
made by users. We could add a small extension to SparkSessionExtensions which 
would allow this to be done. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25559:

Description: Currently, in *ParquetFilters*, if one of the children 
predicates is not supported by Parquet, the entire predicates will be thrown 
away. In fact, if the unsupported predicate is in the top level *And* condition 
or in the child before hitting *Not* or *Or* condition, it's safe to just 
remove the unsupported one as unhandled filters.  (was: Currently, in 
*ParquetFilters*, if one of the children predicate is not supported by Parquet, 
the entire predicates will be thrown away. In fact, if the unsupported 
predicate is in the top level *And* condition or in the child before hitting 
*Not* or *Or* condition, it's safe to just remove the unsupported one as 
unhandled filters.)

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630947#comment-16630947
 ] 

Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 8:12 PM:
--

Nice work. Since I cant comment on the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis for the single DF machine support. It supports numpy arrays as 
java object, of course there are downsides.

Are there any tasks for this defined?


was (Author: skonto):
Nice work. Since I cant comment on the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis for the single DF machine support. 

Are there any tasks for this defined?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630947#comment-16630947
 ] 

Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:37 PM:
--

Nice work. Since I cant comment on the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis for the single DF machine support. 

Are there any tasks for this defined?


was (Author: skonto):
Nice work. Since I cant comment on the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis. 

Are there any tasks for this defined?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630947#comment-16630947
 ] 

Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:36 PM:
--

Nice work. Since I cant comment on the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis. 

Are there any tasks for this defined?


was (Author: skonto):
Nice work. Since I cant comment in the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis.

Are there any tasks for this defined?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630947#comment-16630947
 ] 

Stavros Kontopoulos edited comment on SPARK-24579 at 9/27/18 7:35 PM:
--

Nice work. Since I cant comment in the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis.

Are there any tasks for this defined?


was (Author: skonto):
Nice work. Since I cant comment in the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630947#comment-16630947
 ] 

Stavros Kontopoulos commented on SPARK-24579:
-

Nice work. Since I cant comment in the design doc, have you checked 
[https://github.com/ninia/jep/wiki/How-Jep-Works] it could be useful for the 
scala/java apis.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630912#comment-16630912
 ] 

Stavros Kontopoulos edited comment on SPARK-24724 at 9/27/18 7:06 PM:
--

AFAIK in standalone mode passwordless communication (ssh) is needed between 
workers or in general between master-workers.

Right now the design proposal uses netty rpc for task coordination for the 
barriers via the default rpc Env  (env.rpcEnv.setupEndpoint call).

For that part I dont see any issue, but when mpi tasks are run they may need 
extra net configuration I suppose (like ports?).

 


was (Author: skonto):
AFAIK in standalone mode passwordless communication is needed between workers 
or in general between master-workers.

Right now the design proposal uses netty rpc for task coordination for the 
barriers via the default rpc Env  (env.rpcEnv.setupEndpoint call).

For that part I dont see any issue, but when mpi tasks are run they may need 
extra net configuration?

 

> Discuss necessary info and access in barrier mode + Kubernetes
> --
>
> Key: SPARK-24724
> URL: https://issues.apache.org/jira/browse/SPARK-24724
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes, ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Yinan Li
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Kubernetes. There were 
> some past and on-going attempts from the Kubenetes community. So we should 
> find someone with good knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes

2018-09-27 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630912#comment-16630912
 ] 

Stavros Kontopoulos commented on SPARK-24724:
-

AFAIK in standalone mode passwordless communication is needed between workers 
or in general between master-workers.

Right now the design proposal uses netty rpc for task coordination for the 
barriers via the default rpc Env  (env.rpcEnv.setupEndpoint call).

For that part I dont see any issue, but when mpi tasks are run they may need 
extra net configuration?

 

> Discuss necessary info and access in barrier mode + Kubernetes
> --
>
> Key: SPARK-24724
> URL: https://issues.apache.org/jira/browse/SPARK-24724
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes, ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Yinan Li
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Kubernetes. There were 
> some past and on-going attempts from the Kubenetes community. So we should 
> find someone with good knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25334) Default SessionCatalog should support UDFs

2018-09-27 Thread Swapnil Chougule (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630890#comment-16630890
 ] 

Swapnil Chougule commented on SPARK-25334:
--

+1

My work has also been impacted by same.
SessionCatalog calls registerFunction to add a function to function registry. 
However, makeFunctionExpression supports only UserDefinedAggregateFunction.

 

Can we have handler for scala udf, java udf & java udaf as well in 
SessionCatalog?

> Default SessionCatalog should support UDFs
> --
>
> Key: SPARK-25334
> URL: https://issues.apache.org/jira/browse/SPARK-25334
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Tomasz Gawęda
>Priority: Major
>
> SessionCatalog calls registerFunction to add a function to function registry. 
> However, makeFunctionExpression supports only UserDefinedAggregateFunction.
> We should make makeFunctionExpression support UserDefinedFunction, as it's 
> one of functions type.
> Currently we can use persistent functions only with Hive metastore, but 
> "create function" command works also with default SessionCatalog. It 
> sometimes cause user confusion, like in 
> https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2018-09-27 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630879#comment-16630879
 ] 

Dongjoon Hyun commented on SPARK-25557:
---

Got it, [~dbtsai] .

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630871#comment-16630871
 ] 

Apache Spark commented on SPARK-25558:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22573

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630870#comment-16630870
 ] 

Apache Spark commented on SPARK-25558:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22573

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25558:


Assignee: DB Tsai  (was: Apache Spark)

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25558:


Assignee: Apache Spark  (was: DB Tsai)

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25558:

Summary: Pushdown predicates for nested fields in DataSource Strategy   
(was: Creating predicates for nested fields in DataSource Strategy )

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-27 Thread DB Tsai (JIRA)

DB Tsai created SPARK-25559:
---

 Summary: Just remove the unsupported predicates in Parquet
 Key: SPARK-25559
 URL: https://issues.apache.org/jira/browse/SPARK-25559
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: DB Tsai
Assignee: DB Tsai


Currently, in *ParquetFilters*, if one of the children predicate is not 
supported by Parquet, the entire predicates will be thrown away. In fact, if 
the unsupported predicate is in the top level *And* condition or in the child 
before hitting *Not* or *Or* condition, it's safe to just remove the 
unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25558:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-25556

> Creating predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25558:
---

Assignee: DB Tsai

> Creating predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25558) Creating predicates for nested fields in DataSource Strategy

2018-09-27 Thread DB Tsai (JIRA)

DB Tsai created SPARK-25558:
---

 Summary: Creating predicates for nested fields in DataSource 
Strategy 
 Key: SPARK-25558
 URL: https://issues.apache.org/jira/browse/SPARK-25558
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25557) ORC predicate pushdown for nested fields

2018-09-27 Thread DB Tsai (JIRA)

DB Tsai created SPARK-25557:
---

 Summary: ORC predicate pushdown for nested fields
 Key: SPARK-25557
 URL: https://issues.apache.org/jira/browse/SPARK-25557
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: DB Tsai
Assignee: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25556:
---

Assignee: DB Tsai

> Predicate Pushdown for Nested fields
> 
>
> Key: SPARK-25556
> URL: https://issues.apache.org/jira/browse/SPARK-25556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17636) Parquet predicate pushdown for nested fields

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-17636:

Summary: Parquet predicate pushdown for nested fields  (was: Parquet filter 
push down doesn't handle struct fields)

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 2.5.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25556) Predicate Pushdown for Nested fields

2018-09-27 Thread DB Tsai (JIRA)

DB Tsai created SPARK-25556:
---

 Summary: Predicate Pushdown for Nested fields
 Key: SPARK-25556
 URL: https://issues.apache.org/jira/browse/SPARK-25556
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: DB Tsai


This is an umbrella JIRA to support predicate pushdown for nested fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-09-27 Thread Christopher Said (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630830#comment-16630830
 ] 

Christopher Said commented on SPARK-25461:
--

For what it's worth, converting all floats to False goes against my 
expectations. 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2018-09-27 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-17636:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25556

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 2.5.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25521) Job id showing null when insert into command Job is finished.

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630827#comment-16630827
 ] 

Apache Spark commented on SPARK-25521:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/22572

> Job id showing null when insert into command Job is finished.
> -
>
> Key: SPARK-25521
> URL: https://issues.apache.org/jira/browse/SPARK-25521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2018-09-25-12-01-31-871.png
>
>
> scala> spark.sql("create table x1(name string,age int) stored as parquet")
>  scala> spark.sql("insert into x1 select 'a',29")
>  check logs
>  2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 
> (TID 0) in 874 ms on localhost (executor
>  driver) (1/1)
>  2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose 
> tasks have all completed, from pool
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at 
> :24) finished in 1.131 s
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at 
> :24, took 1.233329 s
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job 
> {color:#d04437}null{color} committed.
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for 
> job null.
>  res4: org.apache.spark.sql.DataFrame = []
>  
>   !image-2018-09-25-12-01-31-871.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25521) Job id showing null when insert into command Job is finished.

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25521:


Assignee: Apache Spark

> Job id showing null when insert into command Job is finished.
> -
>
> Key: SPARK-25521
> URL: https://issues.apache.org/jira/browse/SPARK-25521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Babulal
>Assignee: Apache Spark
>Priority: Minor
> Attachments: image-2018-09-25-12-01-31-871.png
>
>
> scala> spark.sql("create table x1(name string,age int) stored as parquet")
>  scala> spark.sql("insert into x1 select 'a',29")
>  check logs
>  2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 
> (TID 0) in 874 ms on localhost (executor
>  driver) (1/1)
>  2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose 
> tasks have all completed, from pool
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at 
> :24) finished in 1.131 s
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at 
> :24, took 1.233329 s
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job 
> {color:#d04437}null{color} committed.
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for 
> job null.
>  res4: org.apache.spark.sql.DataFrame = []
>  
>   !image-2018-09-25-12-01-31-871.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25521) Job id showing null when insert into command Job is finished.

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25521:


Assignee: (was: Apache Spark)

> Job id showing null when insert into command Job is finished.
> -
>
> Key: SPARK-25521
> URL: https://issues.apache.org/jira/browse/SPARK-25521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2018-09-25-12-01-31-871.png
>
>
> scala> spark.sql("create table x1(name string,age int) stored as parquet")
>  scala> spark.sql("insert into x1 select 'a',29")
>  check logs
>  2018-08-19 12:45:36 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 
> (TID 0) in 874 ms on localhost (executor
>  driver) (1/1)
>  2018-08-19 12:45:36 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose 
> tasks have all completed, from pool
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - ResultStage 0 (sql at 
> :24) finished in 1.131 s
>  2018-08-19 12:45:36 INFO DAGScheduler:54 - Job 0 finished: sql at 
> :24, took 1.233329 s
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Job 
> {color:#d04437}null{color} committed.
>  2018-08-19 12:45:36 INFO FileFormatWriter:54 - Finished processing stats for 
> job null.
>  res4: org.apache.spark.sql.DataFrame = []
>  
>   !image-2018-09-25-12-01-31-871.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25533) Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2

2018-09-27 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25533.

   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

> Inconsistent message for Completed Jobs in the  JobUI, when there are failed 
> jobs, compared to spark2.2
> ---
>
> Key: SPARK-25533
> URL: https://issues.apache.org/jira/browse/SPARK-25533
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
> Attachments: Screenshot from 2018-09-26 00-42-00.png, Screenshot from 
> 2018-09-26 00-46-35.png
>
>
> Test steps:
>  1) bin/spark-shell
> {code:java}
> sc.parallelize(1 to 5, 5).collect()
> sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail 
> Job")}.collect()
> {code}
> *Output in spark - 2.3.1:*
> !Screenshot from 2018-09-26 00-42-00.png!
> *Output in spark - 2.2.1:*
> !Screenshot from 2018-09-26 00-46-35.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630781#comment-16630781
 ] 

Apache Spark commented on SPARK-25392:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/22571

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25392:


Assignee: Apache Spark

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630779#comment-16630779
 ] 

Apache Spark commented on SPARK-25392:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/22571

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25392:


Assignee: (was: Apache Spark)

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25454:


Assignee: Apache Spark

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25454:


Assignee: (was: Apache Spark)

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-27 Thread Arun Mahadevan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630697#comment-16630697
 ] 

Arun Mahadevan commented on SPARK-10816:


Reiterating earlier point. "map/flapMapGroupsWithState" is GBK + mapping the 
grouped values to state on the reduce side.

The proposed native session window implements sessions as GBK operation 
(similar to tumbling and sliding windows) plus makes it easy for users to 
express gap based sessions. IMO, we could consider refactoring windowing 
operations to generic "assignWindows" and "mergeWindows" constructs. 
(https://issues.apache.org/jira/browse/SPARK-2)

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-27 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25546:
-

Assignee: Marcelo Vanzin

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25546) RDDInfo uses SparkEnv before it may have been initialized

2018-09-27 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25546.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22558
[https://github.com/apache/spark/pull/22558]

> RDDInfo uses SparkEnv before it may have been initialized
> -
>
> Key: SPARK-25546
> URL: https://issues.apache.org/jira/browse/SPARK-25546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> This code:
> {code}
> private[spark] object RDDInfo {
>   private val callsiteLongForm = 
> SparkEnv.get.conf.get(EVENT_LOG_CALLSITE_LONG_FORM)
> {code}
> Has two problems:
> - it keeps that value across different SparkEnv instances. So e.g. if you 
> have two tests that rely on different values for that config, one of them 
> will break.
> - it assumes tests always initialize a SparkEnv. e.g. if you run 
> "core/testOnly *.AppStatusListenerSuite", it will fail because 
> {{SparkEnv.get}} returns null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Eugeniu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630689#comment-16630689
 ] 

Eugeniu commented on SPARK-18112:
-

_"The problem here looks, you guys completely replaced the jars into higher 
Hive jars. Therefore, it throws {{NoSuchFieldError}}_" - yes you are right. 
That was my intent. I wanted to be able to connect to metastore database 
created by a Hive client 2.x. If I use that 1.2.1 fork I was getting some query 
errors due to me using bloom filters on multiple columns of the table. My 
understanding is that Hive client 1.2.1 is not seeing that information that is 
why I was trying to replace the jars for a higher version.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25555) Generic constructs for windowing and support for custom windows

2018-09-27 Thread Arun Mahadevan (JIRA)

Arun Mahadevan created SPARK-2:
--

 Summary: Generic constructs for windowing and support for custom 
windows
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Arun Mahadevan


Refactor windows logic with generic "assignWindows" and "mergeWindows" 
constructs. The existing windows (Tumbling and sliding) and generic session 
windows can be built on top of this. It could be extended to support different 
types of custom windowing. 

K,Values -> AssignWindows (produces [k, v, timestamp, window]) -> GroupByKey 
(shuffle) -> MergeWindows (optional step) -> GroupWindows -> aggregate values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-09-27 Thread Mingjie Tang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630672#comment-16630672
 ] 

Mingjie Tang commented on SPARK-25501:
--

[~gsomogyi] thanks so much for your hard work, let me look at your SPIP, and 
give some comments 

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-27 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25415:
-
Fix Version/s: (was: 3.0.0)
   2.5.0

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.5.0
>
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25554) Avro logical types get ignored in SchemaConverters.toSqlType

2018-09-27 Thread Yanan Li (JIRA)

Yanan Li created SPARK-25554:


 Summary: Avro logical types get ignored in 
SchemaConverters.toSqlType
 Key: SPARK-25554
 URL: https://issues.apache.org/jira/browse/SPARK-25554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: Below is the maven dependencies:
{code:java}

org.apache.avro
avro
1.8.2



com.databricks
spark-avro_2.11
4.0.0




org.apache.spark
spark-core_2.11
2.3.0


org.apache.spark
spark-sql_2.11
2.3.0

{code}
Reporter: Yanan Li


Having Avro schema defined as follow:
{code:java}
{
   "namespace": "com.xxx.avro",
   "name": "Book",
   "type": "record",
   "fields": [{
 "name": "name",
 "type": ["null", "string"],
 "default": null
  }, {
 "name": "author",
 "type": ["null", "string"],
 "default": null
  }, {
 "name": "published_date",
 "type": ["null", {"type": "int", "logicalType": "date"}],
 "default": null
  }
   ]
}
{code}
Spark Schema converted from above Avro schema, logical type "date" gets ignored.
{code:java}
StructType(StructField(name,StringType,true),StructField(author,StringType,true),StructField(published_date,IntegerType,true))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630543#comment-16630543
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Also, take a look for the codes, open JIRAs and PRs before complaining next 
time.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630542#comment-16630542
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

To be more specific, the code you guys pointed out is executed by Spark's Hive 
fork 1.2.1 which contains that configuration 
(https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1290)
 

That's meant to be executed with Spark's Hive fork. So you should leave the jar 
as is. And then, the higher jars for Hive to create Hive client should be 
provided to {{spark.sql.hive.metastore.jars}} and 
{{spark.sql.hive.metastore.version}} should be set accordingly.

The problem here looks, you guys completely replaced the jars into higher Hive 
jars. Therefore, it throws {{NoSuchFieldError}}

I recently manually tested 1.2.1, 2.3.0 and 3.0.0 (against 
https://github.com/apache/spark/pull/21404) in few months ago against Apache 
Spark. I am pretty sure that it works for now.

If I am mistaken or misunderstood at some points, please provide a reproducible 
step, or at least why it fails. Let me take a look.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630485#comment-16630485
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

At that code you pointed out, Spark's Hive fork should be used. Different jar 
provided in {{spark.sql.hive.metastore.jars}} is used to create the 
correspending client to access different version of Hive via isolated 
classloader.
The problem here is, you guys removed hive-1.2.1 in the jars and didn't provide 
newer Hive jars in {{spark.sql.hive.metastore.jars}} properly.


> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630487#comment-16630487
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

I asked questions because i'm pretty sure it's misconfiguration. That's why I 
am asking reproducible steps.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25553:
--
Priority: Minor  (was: Major)

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Minor
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Tavis Barr (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630472#comment-16630472
 ] 

Tavis Barr commented on SPARK-18112:


Why are you even asking these questions?  I have already pointed to the 
offending lines of code in 
/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala that are causing this 
error and explained why the error is happening (see my comments from April 
23rd).  All you have to do is remove those two parameters and push.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25523) Multi thread execute sparkSession.read().jdbc(url, table, properties) problem

2018-09-27 Thread huanghuai (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630421#comment-16630421
 ] 

huanghuai commented on SPARK-25523:
---

Sorry, I know this question is hard to understand, but my project is right on 
windows 10 , and when a http request coming, it will create a sparkSession and 
run  a spark application, I am also helpless.

 

I will try my best to solve this problem , If it's not solved in a few days 
later, I will close this issue. I know ,this problem is hard to you to 
reproduce.

> Multi thread execute sparkSession.read().jdbc(url, table, properties) problem
> -
>
> Key: SPARK-25523
> URL: https://issues.apache.org/jira/browse/SPARK-25523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: h3. [IntelliJ 
> _IDEA_|http://www.baidu.com/link?url=7ZLtsOfyqR1YxLqcTU0Q-hqXWV_PsY6IzIzZoKhiXZZ4AcLrpQ4DoTG30yIN-Gs8]
>  
> local mode
>  
>Reporter: huanghuai
>Priority: Major
>
> public static void test2() throws Exception{
>  String ckUrlPrefix="jdbc:clickhouse://";
>  String quote = "`";
>  JdbcDialects.registerDialect(new JdbcDialect() {
>  @Override
>  public boolean canHandle(String url)
> { return url.startsWith(ckUrlPrefix); }
> @Override
>  public String quoteIdentifier(String colName)
> { return quote + colName + quote; }
> });
> SparkSession spark = initSpark();
>  String ckUrl = "jdbc:clickhouse://192.168.2.148:8123/default";
>  Properties ckProp = new Properties();
>  ckProp.put("user", "default");
>  ckProp.put("password", "");
> String prestoUrl = "jdbc:presto://192.168.2.148:9002/mysql-xxx/xxx";
>  Properties prestoUrlProp = new Properties();
>  prestoUrlProp.put("user", "root");
>  prestoUrlProp.put("password", "");
> // new Thread(()->{
> // spark.read()
> // .jdbc(ckUrl, "ontime", ckProp).show();
> // }).start();
> System.out.println("--");
> new Thread(()->{
> spark.read()
> .jdbc(prestoUrl, "tx_user", prestoUrlProp).show();
> }).start();
> System.out.println("--");
> new Thread(()->{
> Dataset load = spark.read()
> .format("com.vertica.spark.datasource.DefaultSource")
> .option("host", "192.168.1.102")
> .option("port", 5433)
> .option("user", "dbadmin")
> .option("password", "manager")
> .option("db", "test")
> .option("dbschema", "public")
> .option("table", "customers")
> .load();
> load.printSchema();
> load.show();
> }).start();
>  System.out.println("--");
>  }
> public static SparkSession initSpark() throws Exception
> { return SparkSession.builder() .master("spark://dsjkfb1:7077") 
> //spark://dsjkfb1:7077 .appName("Test") .config("spark.executor.instances",3) 
> .config("spark.executor.cores",2) .config("spark.cores.max",6) 
> //.config("spark.default.parallelism",1) 
> .config("spark.submit.deployMode","client") 
> .config("spark.driver.memory","2G") .config("spark.executor.memory","3G") 
> .config("spark.driver.maxResultSize", "2G") .config("spark.local.dir", 
> "d:\\tmp") .config("spark.driver.host", "192.168.2.148") 
> .config("spark.scheduler.mode", "FAIR") .config("spark.jars", 
> "F:\\project\\xxx\\vertica-jdbc-7.0.1-0.jar," + 
> "F:\\project\\xxx\\clickhouse-jdbc-0.1.40.jar," + 
> "F:\\project\\xxx\\vertica-spark-connector-9.1-2.1.jar," + 
> "F:\\project\\xxx\\presto-jdbc-0.189-mining.jar")  .getOrCreate(); }
>  
>  
> {color:#ff}*  The above is code 
> --*{color}
> {color:#ff}*question: If i open vertica jdbc , thread will pending 
> forever.*{color}
> {color:#ff}*And driver loging like this:*{color}
>  
> 2018-09-26 10:32:51 INFO SharedState:54 - Setting 
> hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir 
> ('file:/C:/Users/admin/Desktop/test-project/sparktest/spark-warehouse/').
>  2018-09-26 10:32:51 INFO SharedState:54 - Warehouse path is 
> 'file:/C:/Users/admin/Desktop/test-project/sparktest/spark-warehouse/'.
>  2018-09-26 10:32:51 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@2f70d6e2\{/SQL,null,AVAILABLE,@Spark}
>  2018-09-26 10:32:51 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@1d66833d\{/SQL/json,null,AVAILABLE,@Spark}
>  2018-09-26 10:32:51 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@65af6f3a\{/SQL/execution,null,AVAILABLE,@Spark}
>  2018-09-26 10:32:51 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@55012968\{/SQL/execution/json,null,AVAILABLE,@Spark}
>  2018-09-26 10:32:51 INFO ContextHandler:781 - Started 
> o.s.j.s.ServletContextHandler@59e3f5aa\{/static/sql,null,AVAILABLE,@Spark}
>  2018-09-26 10:32:52

[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-27 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630410#comment-16630410
 ] 

Felix Cheung commented on SPARK-21291:
--

Wait. I don’t think saveAsTable is the same thing?




> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.5.0
>
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25553:


Assignee: (was: Apache Spark)

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630401#comment-16630401
 ] 

Apache Spark commented on SPARK-25553:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22570

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25553:


Assignee: Apache Spark

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630400#comment-16630400
 ] 

Apache Spark commented on SPARK-25553:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22570

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21743) top-most limit should not cause memory leak

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21743.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21436.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22010
[https://github.com/apache/spark/pull/22010]

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.5.0
>
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21436:
---

Assignee: holdenk

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-27 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-25553:
---

 Summary: Add EmptyInterpolatedStringChecker to 
scalastyle-config.xml
 Key: SPARK-25553
 URL: https://issues.apache.org/jira/browse/SPARK-25553
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.5.0
Reporter: Yuming Wang


h4. Justification

Empty interpolated strings are harder to read and not necessary.

 

More details:

http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25522) Improve type promotion for input arguments of elementAt function

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25522:

Fix Version/s: (was: 2.5.0)
   2.4.0

>  Improve type promotion for input arguments of elementAt function
> -
>
> Key: SPARK-25522
> URL: https://issues.apache.org/jira/browse/SPARK-25522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> In ElementAt, when first argument is MapType, we should coerce the key type 
> and the second argument based on findTightestCommonType. This is not 
> happening currently.
> Also, when the first argument is ArrayType, the second argument should be an 
> integer type or a smaller integral type that can be safely casted to an 
> integer type. Currently we may do an unsafe cast.
> {code:java}
> spark-sql> select element_at(array(1,2), 1.24);
> 1{code}
> {code:java}
> spark-sql> select element_at(map(1,"one", 2, "two"), 2.2);
> two{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25551) Remove unused InSubquery expression

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25551.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22556
[https://github.com/apache/spark/pull/22556]

> Remove unused InSubquery expression
> ---
>
> Key: SPARK-25551
> URL: https://issues.apache.org/jira/browse/SPARK-25551
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.5.0
>
>
> SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was 
> removed in SPARK-18874. Hence now it is not used anymore and it can be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25551) Remove unused InSubquery expression

2018-09-27 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25551:
---

Assignee: Marco Gaido

> Remove unused InSubquery expression
> ---
>
> Key: SPARK-25551
> URL: https://issues.apache.org/jira/browse/SPARK-25551
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.5.0
>
>
> SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was 
> removed in SPARK-18874. Hence now it is not used anymore and it can be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-09-27 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630236#comment-16630236
 ] 

Hyukjin Kwon commented on SPARK-18112:
--

Such things should be asked to mailing list really. It still sounds like 
misconfiguration.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 171 matches

Mail list logo