[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265812#comment-17265812
 ] 

dzcxzl commented on SPARK-33790:


ok, I opened a JIRA [SPARK-34125 
|https://issues.apache.org/jira/browse/SPARK-34125]
 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265731#comment-17265731
 ] 

Jungtaek Lim commented on SPARK-33790:
--

Oh OK I haven't encountered the issue but Scala mutable HashMap looks to have 
the issue...

Would you mind filing separate JIRA issue and raise a PR for branch-2.4? 2.4.x 
is still a supported version, so the PR would be reviewed and accepted even 
that's not applied for 3.x.

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724
 ] 

dzcxzl commented on SPARK-33790:


Thread stack when not working
!http://git.dev.sh.ctripcorp.com/framework-di/spark-2.2.0/uploads/9cfa9662f563ac64f77f4d4ee6fd9243/image.png!

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265700#comment-17265700
 ] 

Jungtaek Lim commented on SPARK-33790:
--

{quote}
The following is my case 2.x version EventLoggingListener.codecMap is of type 
mutable.HashMap, which is not thread-safe and may hang.
{quote}

Could you please elaborate the situation of possible hang?

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265695#comment-17265695
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265693#comment-17265693
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265691#comment-17265691
 ] 

dzcxzl commented on SPARK-33790:


This is indeed a performance regression problem.

The following is my case 2.x version EventLoggingListener.codecMap is of type 
mutable.HashMap, which is not thread-safe and may hang.

3.x version changed to EventLogFileReader.codecMap changed to ConcurrentHashMap 
type.

In the 2.x version, the history server may not work. 

I tried to use the 3.x version, and found that a round of scan has slowed down 
a lot, 7min rose to about 23min.

In addition, do I need to fix the thread safety issues in version 2.x?

[~kabhwan]

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265679#comment-17265679
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31186

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265662#comment-17265662
 ] 

Jungtaek Lim commented on SPARK-33790:
--

I've revisited this somehow and I realized this is regression on performance 
for event log v1. (SPARK-28869 caused the regression.)

I'll submit PRs for below branches. This should be fixed in 3.1.x / 3.0.x as 
well.

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2020-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250779#comment-17250779
 ] 

Apache Spark commented on SPARK-33790:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30814

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2020-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249650#comment-17249650
 ] 

Apache Spark commented on SPARK-33790:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/30780

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Priority: Trivial
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org