[jira] [Resolved] (SPARK-28829) Document SET ROLE ADMIN in SQL Reference

2019-08-25 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA resolved SPARK-28829.
--
Resolution: Invalid

Not valid for doc for Spark.

> Document SET ROLE ADMIN in SQL Reference
> 
>
> Key: SPARK-28829
> URL: https://issues.apache.org/jira/browse/SPARK-28829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28842) Cleanup the formatting/trailing spaces in resource-managers/kubernetes/integration-tests/README.md

2019-08-25 Thread Udbhav Agrawal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915508#comment-16915508
 ] 

Udbhav Agrawal commented on SPARK-28842:


[~holdenk] Can i work on this ..

> Cleanup the formatting/trailing spaces in 
> resource-managers/kubernetes/integration-tests/README.md
> --
>
> Key: SPARK-28842
> URL: https://issues.apache.org/jira/browse/SPARK-28842
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> The K8s integration testing guide currently has a bunch of trailing spaces on 
> lines which we could cleanup.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28000) Add comments.sql

2019-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28000:

Description: In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/comments.sql.
  (was: In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/comments.sql.)

> Add comments.sql
> 
>
> Key: SPARK-28000
> URL: https://issues.apache.org/jira/browse/SPARK-28000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/comments.sql.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28852) Implement GetCatalogsOperation for Thrift Server

2019-08-25 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28852.
-
Fix Version/s: 3.0.0
 Assignee: Yuming Wang
   Resolution: Fixed

> Implement GetCatalogsOperation for Thrift Server
> 
>
> Key: SPARK-28852
> URL: https://issues.apache.org/jira/browse/SPARK-28852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28823) Document CREATE ROLE Statement

2019-08-25 Thread jobit mathew (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew resolved SPARK-28823.
--
Resolution: Invalid

CREATE ROLE is not supporting in Spark sql ,so closing the JIRA.

> Document CREATE ROLE Statement 
> ---
>
> Key: SPARK-28823
> URL: https://issues.apache.org/jira/browse/SPARK-28823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28861) Jetty property handling: java.lang.NumberFormatException: For input string: "unknown".

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28861:
-
Summary: Jetty property handling: java.lang.NumberFormatException: For 
input string: "unknown".  (was: Jetty property handling)

> Jetty property handling: java.lang.NumberFormatException: For input string: 
> "unknown".
> --
>
> Key: SPARK-28861
> URL: https://issues.apache.org/jira/browse/SPARK-28861
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: Ketan
>Priority: Minor
>
> While processing data from certain files a {{NumberFormatExceltion}} was seen 
> in the logs. The processing was fine but the following stacktrace was 
> observed:
> {code}
> {"time":"2019-08-16 
> 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
> java.lang.NumberFormatException: For input string: "unknown".
> {code}
> On investigation it is found that in the class Jetty there is the following:
> {code}
> BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
> "unknown")); 
> {code}
> which indicates that the config should have the 'timestamp' property. If the 
> property is not there then the default value is set as 'unknown' and this 
> value causes the stacktrace to show up in the logs in our application. It has 
> no detrimental effect on the application as such but could be addressed.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28861) Jetty property handling: java.lang.NumberFormatException: For input string: "unknown".

2019-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915447#comment-16915447
 ] 

Hyukjin Kwon commented on SPARK-28861:
--

Can you share full stract trace?

> Jetty property handling: java.lang.NumberFormatException: For input string: 
> "unknown".
> --
>
> Key: SPARK-28861
> URL: https://issues.apache.org/jira/browse/SPARK-28861
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: Ketan
>Priority: Minor
>
> While processing data from certain files a {{NumberFormatExceltion}} was seen 
> in the logs. The processing was fine but the following stacktrace was 
> observed:
> {code}
> {"time":"2019-08-16 
> 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
> java.lang.NumberFormatException: For input string: "unknown".
> {code}
> On investigation it is found that in the class Jetty there is the following:
> {code}
> BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
> "unknown")); 
> {code}
> which indicates that the config should have the 'timestamp' property. If the 
> property is not there then the default value is set as 'unknown' and this 
> value causes the stacktrace to show up in the logs in our application. It has 
> no detrimental effect on the application as such but could be addressed.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28861) Jetty property handling

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28861:
-
Description: 
While processing data from certain files a {{NumberFormatExceltion}} was seen 
in the logs. The processing was fine but the following stacktrace was observed:

{code}
{"time":"2019-08-16 
08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
java.lang.NumberFormatException: For input string: "unknown".
{code}

On investigation it is found that in the class Jetty there is the following:

{code}
BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
"unknown")); 
{code}

which indicates that the config should have the 'timestamp' property. If the 
property is not there then the default value is set as 'unknown' and this value 
causes the stacktrace to show up in the logs in our application. It has no 
detrimental effect on the application as such but could be addressed.
 

  was:
While processing data from certain files a NumberFormatExceltion was seen in 
the logs. The processing was fine but the following stacktrace was observed:

{"time":"2019-08-16 
08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
java.lang.NumberFormatException: For input string: "unknown".

On investigation it is found that in the class Jetty there is the following:

BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
"unknown")); 

which indicates that the config should have the 'timestamp' property. If the 
property is not there then the default value is set as 'unknown' and this value 
causes the stacktrace to show up in the logs in our application. It has no 
detrimental effect on the application as such but could be addressed.
 


> Jetty property handling
> ---
>
> Key: SPARK-28861
> URL: https://issues.apache.org/jira/browse/SPARK-28861
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: Ketan
>Priority: Minor
>
> While processing data from certain files a {{NumberFormatExceltion}} was seen 
> in the logs. The processing was fine but the following stacktrace was 
> observed:
> {code}
> {"time":"2019-08-16 
> 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
> java.lang.NumberFormatException: For input string: "unknown".
> {code}
> On investigation it is found that in the class Jetty there is the following:
> {code}
> BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
> "unknown")); 
> {code}
> which indicates that the config should have the 'timestamp' property. If the 
> property is not there then the default value is set as 'unknown' and this 
> value causes the stacktrace to show up in the logs in our application. It has 
> no detrimental effect on the application as such but could be addressed.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28862) Read Impala view with UNION ALL operator

2019-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915446#comment-16915446
 ] 

Hyukjin Kwon commented on SPARK-28862:
--

Seems like a question rather than an issue at the current status. Let's 
interact with mailing list first before filing an issue here.

> Read Impala view with UNION ALL operator
> 
>
> Key: SPARK-28862
> URL: https://issues.apache.org/jira/browse/SPARK-28862
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: francesco
>Priority: Major
>  Labels: Failed, Hive, PySpark, Read, TABLE, Union, VIEW, 
> materializedviews, memorymanager
>
> I would like to report an issue in pySpark 2.2.0 when it is used to read hive 
> views that contain UNION ALL operator. 
>  
> I attach the error: 
> WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), 
> try again. 
>  
> Is there any solution different from materializing this view? 
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28862) Read Impala view with UNION ALL operator

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28862.
--
Resolution: Invalid

> Read Impala view with UNION ALL operator
> 
>
> Key: SPARK-28862
> URL: https://issues.apache.org/jira/browse/SPARK-28862
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: francesco
>Priority: Major
>  Labels: Failed, Hive, PySpark, Read, TABLE, Union, VIEW, 
> materializedviews, memorymanager
>
> I would like to report an issue in pySpark 2.2.0 when it is used to read hive 
> views that contain UNION ALL operator. 
>  
> I attach the error: 
> WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), 
> try again. 
>  
> Is there any solution different from materializing this view? 
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28871) Some codes in 'Policy for handling multiple watermarks' does not show friendly

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28871:
-
Summary: Some codes in 'Policy for handling multiple watermarks' does not 
show friendly   (was: Som codes in 'Policy for handling multiple watermarks' 
does not show friendly )

> Some codes in 'Policy for handling multiple watermarks' does not show 
> friendly 
> ---
>
> Key: SPARK-28871
> URL: https://issues.apache.org/jira/browse/SPARK-28871
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: chaiyongqiang
>Priority: Major
>  Labels: documentation
> Attachments: Policy_for_handling_multiple_watermarks.png
>
>
> The codes in the 'Policy for handling multiple watermarks' in 
> structured-streaming-programming-guide does not show friendly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28869) Roll over event log files

2019-08-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915429#comment-16915429
 ] 

Jungtaek Lim commented on SPARK-28869:
--

Started working on this now.

> Roll over event log files
> -
>
> Key: SPARK-28869
> URL: https://issues.apache.org/jira/browse/SPARK-28869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on rolling over event log files in driver and 
> let SHS replay the multiple event log files correctly.
> This issue doesn't deal with overall size of event log, as well as no 
> guarantee when deleting old event log files.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28871) Som codes in 'Policy for handling multiple watermarks' does not show friendly

2019-08-25 Thread chaiyongqiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaiyongqiang updated SPARK-28871:
--
Attachment: Policy_for_handling_multiple_watermarks.png

> Som codes in 'Policy for handling multiple watermarks' does not show friendly 
> --
>
> Key: SPARK-28871
> URL: https://issues.apache.org/jira/browse/SPARK-28871
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: chaiyongqiang
>Priority: Major
>  Labels: documentation
> Attachments: Policy_for_handling_multiple_watermarks.png
>
>
> The codes in the 'Policy for handling multiple watermarks' in 
> structured-streaming-programming-guide does not show friendly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28871) Som codes in 'Policy for handling multiple watermarks' does not show friendly

2019-08-25 Thread chaiyongqiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaiyongqiang updated SPARK-28871:
--
Attachment: 屏幕快照 2019-08-26 上午9.01.12.png

> Som codes in 'Policy for handling multiple watermarks' does not show friendly 
> --
>
> Key: SPARK-28871
> URL: https://issues.apache.org/jira/browse/SPARK-28871
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: chaiyongqiang
>Priority: Major
>  Labels: documentation
>
> The codes in the 'Policy for handling multiple watermarks' in 
> structured-streaming-programming-guide does not show friendly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28871) Som codes in 'Policy for handling multiple watermarks' does not show friendly

2019-08-25 Thread chaiyongqiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaiyongqiang updated SPARK-28871:
--
Attachment: (was: 屏幕快照 2019-08-26 上午9.01.12.png)

> Som codes in 'Policy for handling multiple watermarks' does not show friendly 
> --
>
> Key: SPARK-28871
> URL: https://issues.apache.org/jira/browse/SPARK-28871
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: chaiyongqiang
>Priority: Major
>  Labels: documentation
>
> The codes in the 'Policy for handling multiple watermarks' in 
> structured-streaming-programming-guide does not show friendly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28871) Som codes in 'Policy for handling multiple watermarks' does not show friendly

2019-08-25 Thread chaiyongqiang (Jira)
chaiyongqiang created SPARK-28871:
-

 Summary: Som codes in 'Policy for handling multiple watermarks' 
does not show friendly 
 Key: SPARK-28871
 URL: https://issues.apache.org/jira/browse/SPARK-28871
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.3
Reporter: chaiyongqiang


The codes in the 'Policy for handling multiple watermarks' in 
structured-streaming-programming-guide does not show friendly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28870) Snapshot old event log files to support compaction

2019-08-25 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-28870:


 Summary: Snapshot old event log files to support compaction
 Key: SPARK-28870
 URL: https://issues.apache.org/jira/browse/SPARK-28870
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28594:
-
Component/s: (was: Input/Output)
 Spark Core

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28869) Roll over event log files

2019-08-25 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-28869:


 Summary: Roll over event log files
 Key: SPARK-28869
 URL: https://issues.apache.org/jira/browse/SPARK-28869
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the effort on rolling over event log files in driver and let 
SHS replay the multiple event log files correctly.

This issue doesn't deal with overall size of event log, as well as no guarantee 
when deleting old event log files.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915357#comment-16915357
 ] 

Jungtaek Lim edited comment on SPARK-28594 at 8/25/19 11:47 PM:


Coincidentally I was working on the design of this feature for 2 weeks. Looks 
like reporter doesn't seem to work on this feature, I'll taking up this issue 
and go forward.

Only POC done. Just started implementing. Here's design doc to describe the 
approach:

[https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing]

 

 


was (Author: kabhwan):
Coincidentally I was working on the design of this feature for 2 weeks. Looks 
like reporter doesn't seem to work on this feature, I'll taking up this issue 
and go forward.

Only POC done. Just started implementing. Here's design doc to describe the 
approach:

[https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit#heading=h.7bmfccqq7ozy]

 

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows

2019-08-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-27433.
--
Resolution: Duplicate

> Spark Structured Streaming left outer joins returns outer nulls for already 
> matched rows
> 
>
> Key: SPARK-27433
> URL: https://issues.apache.org/jira/browse/SPARK-27433
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Binit
>Priority: Blocker
>
> I m basically using the example given in Spark's the documentation here: 
> [https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking]
>  with the built-in test stream in which one stream is ahead by 3 seconds (was 
> originally using kafka but ran into the same issue). The results returned the 
> match columns correctly, however after a while the same key is returned with 
> an outer null.
> Is this the expected behavior? Is there a way to exclude the duplicate outer 
> null results when there was a match?
> Code:
> {{val testStream = session.readStream.format("rate") .option("rowsPerSecond", 
> "5").option("numPartitions", "1").load() val impressions = testStream 
> .select( (col("value") + 15).as("impressionAdId"), 
> col("timestamp").as("impressionTime")) val clicks = testStream .select( 
> col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply 
> watermarks on event-time columns val impressionsWithWatermark = 
> impressions.withWatermark("impressionTime", "20 seconds") val 
> clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join 
> with event-time constraints val result = impressionsWithWatermark.join( 
> clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= 
> impressionTime AND clickTime <= impressionTime + interval 10 seconds """), 
> joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val 
> query = 
> result.writeStream.outputMode("update").format("console").option("truncate", 
> false).start() query.awaitTermination()}}
> Result:
> {{--- Batch: 19 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |100 |2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 
> 22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 
> |2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 
> 22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| 
> +--+---+-+---+ 
> --- Batch: 57 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |290 |2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 
> 22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 
> |2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 
> 22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| 
> |100 |2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null 
> |null | |103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 
> 22:18:38.562|null |null | |102 |2018-05-23 22:18:38.762|null |null | 
> +--+---+-+---+}}
> {{This question is also asked in the stackoverflow. Please find the link 
> below}}
> {{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}}
> {{ }}
> {{101 & 103 have already come in the join but still it is coming in the outer 
> left join.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows

2019-08-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-27433:
-
Issue Type: Bug  (was: Question)

> Spark Structured Streaming left outer joins returns outer nulls for already 
> matched rows
> 
>
> Key: SPARK-27433
> URL: https://issues.apache.org/jira/browse/SPARK-27433
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Binit
>Priority: Blocker
>
> I m basically using the example given in Spark's the documentation here: 
> [https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking]
>  with the built-in test stream in which one stream is ahead by 3 seconds (was 
> originally using kafka but ran into the same issue). The results returned the 
> match columns correctly, however after a while the same key is returned with 
> an outer null.
> Is this the expected behavior? Is there a way to exclude the duplicate outer 
> null results when there was a match?
> Code:
> {{val testStream = session.readStream.format("rate") .option("rowsPerSecond", 
> "5").option("numPartitions", "1").load() val impressions = testStream 
> .select( (col("value") + 15).as("impressionAdId"), 
> col("timestamp").as("impressionTime")) val clicks = testStream .select( 
> col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply 
> watermarks on event-time columns val impressionsWithWatermark = 
> impressions.withWatermark("impressionTime", "20 seconds") val 
> clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join 
> with event-time constraints val result = impressionsWithWatermark.join( 
> clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= 
> impressionTime AND clickTime <= impressionTime + interval 10 seconds """), 
> joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val 
> query = 
> result.writeStream.outputMode("update").format("console").option("truncate", 
> false).start() query.awaitTermination()}}
> Result:
> {{--- Batch: 19 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |100 |2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 
> 22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 
> |2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 
> 22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| 
> +--+---+-+---+ 
> --- Batch: 57 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |290 |2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 
> 22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 
> |2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 
> 22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| 
> |100 |2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null 
> |null | |103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 
> 22:18:38.562|null |null | |102 |2018-05-23 22:18:38.762|null |null | 
> +--+---+-+---+}}
> {{This question is also asked in the stackoverflow. Please find the link 
> below}}
> {{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}}
> {{ }}
> {{101 & 103 have already come in the join but still it is coming in the outer 
> left join.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows

2019-08-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reopened SPARK-27433:
--

Reopening to mark type and resolution correctly.

> Spark Structured Streaming left outer joins returns outer nulls for already 
> matched rows
> 
>
> Key: SPARK-27433
> URL: https://issues.apache.org/jira/browse/SPARK-27433
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Binit
>Priority: Blocker
>
> I m basically using the example given in Spark's the documentation here: 
> [https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking]
>  with the built-in test stream in which one stream is ahead by 3 seconds (was 
> originally using kafka but ran into the same issue). The results returned the 
> match columns correctly, however after a while the same key is returned with 
> an outer null.
> Is this the expected behavior? Is there a way to exclude the duplicate outer 
> null results when there was a match?
> Code:
> {{val testStream = session.readStream.format("rate") .option("rowsPerSecond", 
> "5").option("numPartitions", "1").load() val impressions = testStream 
> .select( (col("value") + 15).as("impressionAdId"), 
> col("timestamp").as("impressionTime")) val clicks = testStream .select( 
> col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply 
> watermarks on event-time columns val impressionsWithWatermark = 
> impressions.withWatermark("impressionTime", "20 seconds") val 
> clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join 
> with event-time constraints val result = impressionsWithWatermark.join( 
> clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= 
> impressionTime AND clickTime <= impressionTime + interval 10 seconds """), 
> joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val 
> query = 
> result.writeStream.outputMode("update").format("console").option("truncate", 
> false).start() query.awaitTermination()}}
> Result:
> {{--- Batch: 19 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |100 |2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 
> 22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 
> |2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 
> 22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| 
> +--+---+-+---+ 
> --- Batch: 57 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |290 |2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 
> 22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 
> |2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 
> 22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| 
> |100 |2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null 
> |null | |103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 
> 22:18:38.562|null |null | |102 |2018-05-23 22:18:38.762|null |null | 
> +--+---+-+---+}}
> {{This question is also asked in the stackoverflow. Please find the link 
> below}}
> {{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}}
> {{ }}
> {{101 & 103 have already come in the join but still it is coming in the outer 
> left join.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915383#comment-16915383
 ] 

Jungtaek Lim commented on SPARK-28594:
--

To ensure creating smaller PRs (easier reviews) I would split this issue into 
two sub-issues:

1) just roll event log files (no compaction)

2) compact old event log files

Note that even rolling event log files without compaction could help for some 
extreme case, where the log file got really huge for running application so you 
decide to drop some old logs bearing that it will lose the ability to replay 
log file. Currently there's no way to do this - deleting event log file which 
is open for writing would bring some unexpected issues and we would end up with 
stopping application.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28868) Specify Jekyll version to 3.8.6 in release docker image

2019-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28868.
---
Fix Version/s: 2.4.4
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 25578
[https://github.com/apache/spark/pull/25578]

> Specify Jekyll version to 3.8.6 in release docker image
> ---
>
> Key: SPARK-28868
> URL: https://issues.apache.org/jira/browse/SPARK-28868
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.0.0, 2.4.4
>
>
> Recently, Jekyll 4.0 is released and it dropped Ruby 2.3 support.
> This breaks our release docker image build.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25549) High level API to collect RDD statistics

2019-08-25 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-25549.
-
Resolution: Won't Fix

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28868) Specify Jekyll version to 3.8.6 in release docker image

2019-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28868:
-

Assignee: Dongjoon Hyun

> Specify Jekyll version to 3.8.6 in release docker image
> ---
>
> Key: SPARK-28868
> URL: https://issues.apache.org/jira/browse/SPARK-28868
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>
> Recently, Jekyll 4.0 is released and it dropped Ruby 2.3 support.
> This breaks our release docker image build.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25549) High level API to collect RDD statistics

2019-08-25 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915362#comment-16915362
 ] 

Liang-Chi Hsieh commented on SPARK-25549:
-

Close this as it is not needed now.

> High level API to collect RDD statistics
> 
>
> Key: SPARK-25549
> URL: https://issues.apache.org/jira/browse/SPARK-25549
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have low level API SparkContext.submitMapStage used for collecting 
> statistics of RDD. However it is too low level and is not so easy to use. We 
> need a high level API for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915358#comment-16915358
 ] 

Jungtaek Lim commented on SPARK-28594:
--

I've raised priority as many end users are suffering with this issue especially 
they run long-running queries.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915357#comment-16915357
 ] 

Jungtaek Lim commented on SPARK-28594:
--

Coincidentally I was working on the design of this feature for 2 weeks. Looks 
like reporter doesn't seem to work on this feature, I'll taking up this issue 
and go forward.

Only POC done. Just started implementing. Here's design doc to describe the 
approach:

[https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit#heading=h.7bmfccqq7ozy]

 

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Minor
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-08-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28594:
-
Priority: Major  (was: Minor)

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28868) Specify Jekyll version to 3.8.6 in release docker image

2019-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28868:
--
Priority: Blocker  (was: Major)

> Specify Jekyll version to 3.8.6 in release docker image
> ---
>
> Key: SPARK-28868
> URL: https://issues.apache.org/jira/browse/SPARK-28868
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> Recently, Jekyll 4.0 is released and it dropped Ruby 2.3 support.
> This breaks our release docker image build.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28868) Specify Jekyll version to 3.8.6 in release docker image

2019-08-25 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-28868:
-

 Summary: Specify Jekyll version to 3.8.6 in release docker image
 Key: SPARK-28868
 URL: https://issues.apache.org/jira/browse/SPARK-28868
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.4, 3.0.0
Reporter: Dongjoon Hyun


Recently, Jekyll 4.0 is released and it dropped Ruby 2.3 support.

This breaks our release docker image build.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28867) InMemoryStore checkpoint to speed up replay log file in HistoryServer

2019-08-25 Thread wuyi (Jira)
wuyi created SPARK-28867:


 Summary: InMemoryStore checkpoint to speed up replay log file in 
HistoryServer
 Key: SPARK-28867
 URL: https://issues.apache.org/jira/browse/SPARK-28867
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


HistoryServer now could be very slow to replay a large log file at the first 
time and it always re-replay an inprogress log file after it changes. we could 
periodically checkpoint InMemoryStore to speed up replay log file.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27988) Add aggregates.sql - Part3

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27988:


Assignee: Yuming Wang

> Add aggregates.sql - Part3
> --
>
> Key: SPARK-27988
> URL: https://issues.apache.org/jira/browse/SPARK-27988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27988) Add aggregates.sql - Part3

2019-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27988.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24829
[https://github.com/apache/spark/pull/24829]

> Add aggregates.sql - Part3
> --
>
> Key: SPARK-27988
> URL: https://issues.apache.org/jira/browse/SPARK-27988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28866) Persist item factors RDD when checkpointing in ALS

2019-08-25 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-28866:
---

 Summary: Persist item factors RDD when checkpointing in ALS
 Key: SPARK-28866
 URL: https://issues.apache.org/jira/browse/SPARK-28866
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


In ALS ML implementation, if `implicitPrefs` is false, we checkpoint the RDD of 
item factors, between intervals. Before checkpointing and materializing RDD, 
this RDD was not persisted. It causes recomputation. In an experiment, there is 
performance difference between persisting and no persisting before 
checkpointing the RDD.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2019-08-25 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915225#comment-16915225
 ] 

angerszhu commented on SPARK-21067:
---

the root cause is in Spark we use just one SessionState. In hiveserver2, each 
session have it's own SessionState

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>Priority: Major
> Attachments: SPARK-21067.patch
>
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.

[jira] [Created] (SPARK-28865) Table inheritance

2019-08-25 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-28865:
---

 Summary: Table inheritance
 Key: SPARK-28865
 URL: https://issues.apache.org/jira/browse/SPARK-28865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


PostgreSQL implements table inheritance, which can be a useful tool for 
database designers. (SQL:1999 and later define a type inheritance feature, 
which differs in many respects from the features described here.)

 

[https://www.postgresql.org/docs/11/ddl-inherit.html|https://www.postgresql.org/docs/9.5/ddl-inherit.html]
[https://www.postgresql.org/docs/11/tutorial-inheritance.html]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28818) FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present

2019-08-25 Thread Matt Hawes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915181#comment-16915181
 ] 

Matt Hawes commented on SPARK-28818:


[PR|https://github.com/apache/spark/pull/25575] created with tests to ensure it 
fixes the original issue.

> FrequentItems applies an incorrect schema to the resulting dataframe when 
> nulls are present
> ---
>
> Key: SPARK-28818
> URL: https://issues.apache.org/jira/browse/SPARK-28818
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Matt Hawes
>Priority: Minor
>
> A trivially reproducible bug in the code for `FrequentItems`. The schema for 
> the resulting arrays of frequent items is [hard coded|#L122]] to have 
> non-nullable array elements:
> {code:scala}
> val outputCols = colInfo.map { v =>
> StructField(v._1 + "_freqItems", ArrayType(v._2, false))
>  }
>  val schema = StructType(outputCols).toAttributes
>  Dataset.ofRows(df.sparkSession, LocalRelation.fromExternalRows(schema, 
> Seq(resultRow)))
> {code}
>  
> However if the column contains frequent nulls then these nulls are included 
> in the frequent items array. This results in various errors such as any 
> attempt to `collect()` resulting in a null pointer exception:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> df = spark.createDataFrame([
>     (1, 'a'),
>     (2, None),
>     (3, 'b'),
> ], schema="id INTEGER, val STRING")
> rows = df.freqItems(df.columns).collect()
> {code}
>  Results in:
> {code:java}
> Traceback (most recent call last):                                            
>   
>   File "", line 1, in 
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.py", 
> line 533, in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
> 63, in deco
>     return f(*a, **kw)
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o40.collectToPython.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:39)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.executeCollect(LocalTableScanExec.scala:70)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3257)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3254)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
>   at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3254)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(