[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871626#comment-16871626 ] Hudson commented on HDFS-14440: --- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16813 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16813/]) HDFS-14440. RBF: Optimize the file write process in case of multiple (brahma: rev 8e4267650fe52eb6b6d4466fc006e7af4a1326d0) * (edit) hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcServer.java > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: HDFS-13891 > > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847162#comment-16847162 ] Íñigo Goiri commented on HDFS-14440: As far as we understand what the trade-off is, it is fine with me. If you guys want to go with the approach in [^HDFS-14440-HDFS-13891-06.patch], go ahead. I'm +1 for everything except the invokeSequential() vs invokeConcurrent(); for that I'm +0 so no blocker here. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845563#comment-16845563 ] Vinayakumar B commented on HDFS-14440: -- {quote}My only concern was the replacement of invokeSequential with invokeConcurrent. The functionality will be the same but the now we have a trade-off between latency and number of rpc calls. {quote} Yes, its a trade-off. But we should be more lean towards overall latency of Client's operation than number of RPC calls from routers(which occurs in parallel and results in no/negligible overhead). As explained by [~ayushtkn] earlier, in case of File-not-Exist-already, total number of RPC remains same, but latency improves a lot. This is usually the most frequent case. In case of file-already-exist, total number of RPCs, will be same as number of remote locations. Since executed in parallel, overall latency will be negligible. This case is less frequent as compared to former case. I guess, this should be fair enough trade-off for client's RPC performance improvement. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845492#comment-16845492 ] Íñigo Goiri commented on HDFS-14440: The getFileInfo() is a clear improvement. The code style is also better. My only concern was the replacement of invokeSequential with invokeConcurrent. The functionality will be the same but the now we have a trade-off between latency and number of rpc calls. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844738#comment-16844738 ] Vinayakumar B commented on HDFS-14440: -- {quote}I'm still wrapping my head around using invokeConcurrent() and invokeSequential()... What about using sequential for HASH and HASH_ALL and concurrent for the others? {quote} I can see, this would not be a problem at all in the latest patch. Though {{involeConcurrent()}} is used, result check from the returned map happens according to original list of remote locations which is based on the order. Also, {{getFileInfo()}} is much lighter compared to {{getBlockLocations()}}. So, I would say changes are direct and straight forward instead of earlier complex check with blocklocations. +1 > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839654#comment-16839654 ] Íñigo Goiri commented on HDFS-14440: Sorry, I was out last week. As you can see, I still have doubts about this... Probably we should bring others to give their views. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839583#comment-16839583 ] Ayush Saxena commented on HDFS-14440: - [~elgoiri] any further thoughts?? > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832735#comment-16832735 ] Ayush Saxena commented on HDFS-14440: - Back to the same place. :( Firstly, If we do so, in case of successful file write, it shall go over all the destinations, so it will take N time (N being number of namespaces) and successful writes are more than failures due to File being Already exist. Secondly, Not sure we can get the location to trigger from getFileInfo() as we got from blockLocation. Or even doubt if its worth it. Thirdly, In the previous scenario too, in case of empty file you need to make two calls, 2 time Unit, if the file is empty. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832696#comment-16832696 ] Íñigo Goiri commented on HDFS-14440: I'm still wrapping my head around using invokeConcurrent() and invokeSequential()... What about using sequential for HASH and HASH_ALL and concurrent for the others? > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832133#comment-16832133 ] Hadoop QA commented on HDFS-14440: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} HDFS-13891 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 33s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 40s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} HDFS-13891 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 26m 11s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 85m 55s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | HDFS-14440 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967718/HDFS-14440-HDFS-13891-06.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f41c703881aa 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-13891 / 893c708 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26742/testReport/ | | Max. process+thread count | 1003 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26742/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832072#comment-16832072 ] Ayush Saxena commented on HDFS-14440: - Updated v06 with the said change. Pls review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch, > HDFS-14440-HDFS-13891-06.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831848#comment-16831848 ] Hadoop QA commented on HDFS-14440: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 30s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} HDFS-13891 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 49s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} HDFS-13891 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 1s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 79m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | HDFS-14440 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967673/HDFS-14440-HDFS-13891-05.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5f05b4940f25 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-13891 / 40963f9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26741/testReport/ | | Max. process+thread count | 980 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26741/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831855#comment-16831855 ] Íñigo Goiri commented on HDFS-14440: No need to do toString() in the LOG, otherwise you lose the benefits of using {}. LOG takes care of doing the toString() if needed. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831781#comment-16831781 ] Ayush Saxena commented on HDFS-14440: - Have uploaded patch v5 with said changes. Pls review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch, HDFS-14440-HDFS-13891-05.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831763#comment-16831763 ] Íñigo Goiri commented on HDFS-14440: In the javadoc, "else" instead of "else". As we are doing this, it might be good to add a log debug message when we are doing "createLocation = existingLocation"; it wasn't there before but it would help debugging. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831671#comment-16831671 ] Hadoop QA commented on HDFS-14440: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} HDFS-13891 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 56s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 54s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} HDFS-13891 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 22m 20s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 71m 16s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | HDFS-14440 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967642/HDFS-14440-HDFS-13891-04.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 614c7f958a90 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-13891 / aeb3b61 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26740/testReport/ | | Max. process+thread count | 1441 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26740/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831638#comment-16831638 ] Ayush Saxena commented on HDFS-14440: - Thanx [~elgoiri] for the review. Have uploaded patch with the said change. Pls Review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch, > HDFS-14440-HDFS-13891-04.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831121#comment-16831121 ] Íñigo Goiri commented on HDFS-14440: One minor style comment: align the javadoc (confused on why the checkstyle let's this one go) and a break line before. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830843#comment-16830843 ] Ayush Saxena commented on HDFS-14440: - [~elgoiri] can you pls give a check? > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829865#comment-16829865 ] Hadoop QA commented on HDFS-14440: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} HDFS-13891 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 14s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 33s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 55s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} HDFS-13891 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 24m 26s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 77m 26s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | HDFS-14440 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967411/HDFS-14440-HDFS-13891-03.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 9d8e40226e36 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-13891 / aeb3b61 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26731/testReport/ | | Max. process+thread count | 1030 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26731/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829835#comment-16829835 ] Ayush Saxena commented on HDFS-14440: - Ohhh. I got it you mean to say in case two files existed, Well anyway that would have failed from either NN. I have uploaded a new patch, trying to handle it Pls Review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch, HDFS-14440-HDFS-13891-03.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829712#comment-16829712 ] Íñigo Goiri commented on HDFS-14440: Before the patch, we had the list of remote locations say A, B, C, and D. Then we checked the first one that had the file, say B. But now the HashMap can come with A, C, D, B. If the file is in C and B, we would use C now. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829447#comment-16829447 ] Ayush Saxena commented on HDFS-14440: - bq. We should take into account the HASH order for example instead of returning any of them. I didn't catch that. Can you elaborate a little. bq. we are losing the order How?? The location according to the order will be selected here itself. {code:java} RemoteLocation createLocation = locations.get(0); {code} This will only change if the file already exist, to the location where it exist, so that NN can handle. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829437#comment-16829437 ] Íñigo Goiri commented on HDFS-14440: By using {{invokeConcurrent()}} and checking the first key in the {{Map}} we are losing the order. We should take into account the HASH order for example instead of returning any of them. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827417#comment-16827417 ] Hadoop QA commented on HDFS-14440: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 54s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} HDFS-13891 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 14s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 54s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 50s{color} | {color:green} HDFS-13891 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} HDFS-13891 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 20s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 24m 54s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 91m 45s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | HDFS-14440 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967170/HDFS-14440-HDFS-13891-02.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8bc5cd3a2f21 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-13891 / 55f2f7a | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26715/testReport/ | | Max. process+thread count | 1331 (vs. ulimit of 1) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26715/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827303#comment-16827303 ] Ayush Saxena commented on HDFS-14440: - Have uploaded patch v2 changing to getFileInfo(). Ran up a general comparison just b/w getFileInfo() and getBlockLocations() and getFileInfo() tend to fair better average around 23-28 % than the getBlockLocations(). Pls Review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch, > HDFS-14440-HDFS-13891-02.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827165#comment-16827165 ] Íñigo Goiri commented on HDFS-14440: OK, let's try using getFileInfo(). > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827148#comment-16827148 ] Ayush Saxena commented on HDFS-14440: - I am putting it in table, might help understand better. |Operation|Comparison(Old/New)|Details.| |Successful Write Operation|3.83 (Approx 4 equal to number of namespaces). expectedly to increase linearly as increase in number of NS|Scenario where all namespaces are checked to confirm non availability of File, And finally if not a file is successfully written.| |Failed Write- Empty File(HASH ORDER)|1.732 (There are always two sequential call, One to getBlockLocations and then to getFileInfo )|GetBlockLocations expectdlly takes more time than getFileInfo, Guess that is the reason value isn’ t near exact 2| |Failed Write-Non Empty File(HASH)|Approx 1|All operations for both approach took around same time.| |Operations on Non-Hash Orders|Constant with new approach and same as all other scenario.|Dynamically increases depending upon the position of actual location in the results returned. Worst if the location is the last one and it is an empty file. First all locations sequentially invoked for getBlockLocations() and then for getFileInfo()| *Scenario : 4 Namespace, Each averaged on 100 write ops.* ** bq. I guess it makes sense as the first one actually requires going through the block manager while the other is just name space. We should consider this. Yes, Thanks for getting up the reason, seems fair to me. If you agree we can change to getFileInfo(). :) > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827121#comment-16827121 ] Íñigo Goiri commented on HDFS-14440: Do you mind putting the results in a table? Hard to parse for me which results is for what. I guess one compromise would be to use the old approach for HASH based mount points and the new one for SPACE? BTW, the use case for RANDOM is basically read low balance. We have files that are read from thousands of containers and we put those files in all subclusters and read from a random one. Interesting observation on the {{getBlockLocations()}} versus {{getFileInfo()}}. I guess it makes sense as the first one actually requires going through the block manager while the other is just name space. We should consider this. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827112#comment-16827112 ] Ayush Saxena commented on HDFS-14440: - Thanx [~elgoiri] I had a test setup to compare execution, So I tried on two Test setup one with Two NS and one with Four NS. In focused more on the Four NS one, For the execution of only part of the method changed, I recorded the comparison, * On the successful write scenario, For 100 file writes The comparison time avg. landed to 3.83 Approx 4 only(equal to number of NS) * On Empty File Scenario Failure, For same 100 write. Comparison Avg. landed to 1.732 Approx 2 (For HASH, Since older one is for location and other is for fileInfo, I guess fileInfo takes less time as compared getBlockLocations). * On Non Empty File Failure: The time was almost same for the method part. For Non Hash Orders, With older approaches as I said that was very dynamic and sometimes quite high too, if the location landed being among the last locations, So can't conclude from the value, But with newer that was const. like above ones. For RANDOM order, I don't think for us too, Not much use case(but can't say no one has). But Order SPACE finds fair usability and it has good performance impact there. Moreover, Anything good coming as Extras is always good.:) I didn't had the production N/W load environment for the test, So didn't capture the time seconds, As the Comparison number shall stay same at any N/W performance and in test environment that would be like I shall myself deciding how much Latency for each RPC I weasn't to create. So didn't made sense for me to record, So I judged by the comparison b/w both. Pls Review!!! > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827094#comment-16827094 ] Íñigo Goiri commented on HDFS-14440: Thanks [~ayushtkn] for checking. We should focus on the HASH approaches as those are the predictable ones. Is there any number you can share of latencies? > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827072#comment-16827072 ] Ayush Saxena commented on HDFS-14440: - I did some analysis in my test setup for just this part of code to catch some confirmation for the number of RPC's and time * For A Successful Write, The number of RPC stayed same for both approaches. Just the time improved for obvious reasons, we discussed above. * For Non Successful Write: ** For Empty Files : The minimum RPC is got was 2 with Order HASH and 4 with the new approach and time for checking was half with new approach, But for order RANDOM, I guess the optimization that you talked about(First Location always hitting, I guess didn't hold up for me) And the RPC count was also RANDOM and time too with the old Approach, But with new it stayed const and same as other cases. ** For Non-Empty : The time was same with HASH order and For other it was more and was more like dynamic. Well the time difference for the method execution depends on the n/w state and I can't put the number from prod here. Well it is quite mathematical too. Let me know if any doubts pertain. Moreover I don't think any threat to Functionality from this change. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823223#comment-16823223 ] Íñigo Goiri commented on HDFS-14440: I think we have some metrics in JMX for the number of RPC queries. It might be worth evaluating the number of calls and the latency of the calls with one approach and the other. It might also be good to have the Namenode metrics here. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822359#comment-16822359 ] Ayush Saxena commented on HDFS-14440: - Thanx [~elgoiri] for the opinions. It isn't a issue that fixes or does some changes to the functionality, so I am absolutely OK, if you think it isn't worth to have. On a thought, This just tend to save some time efforts in the file write process, I am not sure a WRITE operation is that critical or not, but READ/WRITE are an elementary operation for every FileSystem, everything revolves around. After the change I didn't find any case where the time consumed in general more than the time which was consumed earlier, Yes, there would be extra RPC in case of failure where last block was present, if thats not present we go for additional getfileInfo calls too. If thats the case still we shall be in a better time value than the previous ones. Now if we consider normal writes, i.e where a file write is success and most of our write gets succeeded, we would see in our practical deployments, the number of writes succeeded would be more as compared to the writes failing for File Already exist. So here we had to check all the namespaces for the availability of files, this is an operation we don't actually do when we directly talk to the namenode, or there is a single destination, this is an overhead as part of our multi-destination framework, and most importantly it is a required OP, we can't go away without checking. This time taken is proportional as of now to the number of Namespaces, N times. This only we optimized to 1 time unit and made independent of the number of namespaces, The scalability that we achieve is by spawning a new namespace and having it as an extra destination, Just wanted to make sure that it should incur least or no cost for the elementary process. It is something won't be that much observed if the networks and processing is too fast, but on average cases, it shows up. I might have seen it from a far away distance, Let me know if it interests you.:) > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822031#comment-16822031 ] Íñigo Goiri commented on HDFS-14440: I think we should try to exploit hashing instead of invoking everywhere. There are two cases here: * Old approach: ** The file already exists: we check in the one that we expect it to be. This is 1 RPC calls. ** The file is new: we go over all the subclusters one by one. This is N RPC calls in N times. * The approach in [^HDFS-14440-HDFS-13891-01.patch]: ** The file already exists: we check everywhere. This is N RPC calls in 1 time. ** The file is new: we check everywhere. This is N RPC calls in 1 time. I personally prefer the old approach. The new one reduces the time for an operation that is not that critical. There could even be a hybrid where we check in the one we expect and then concurrently in the rest. Are you seeing this as a bottleneck? > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821550#comment-16821550 ] Ayush Saxena commented on HDFS-14440: - Thanx [~elgoiri] I have uploaded the patch. {quote}Keep in mind that the sequential invoke relies on the idea that in most case, the files will be in the first location. {quote} If the file exists. Then the actual write won't happen, that is we shall land up getting exception from the namenode. The case where write happens shall be when file isn't present in neither of the destinations, for which all destinations shall be checked, then only conclusion can be reached.(This we can do it in parallel) And in general we should aim to optimize the working flow, as compared to the failure case. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-14440-HDFS-13891-01.patch > > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14440) RBF: Optimize the file write process in case of multiple destinations.
[ https://issues.apache.org/jira/browse/HDFS-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821386#comment-16821386 ] Íñigo Goiri commented on HDFS-14440: Thanks [~ayushtkn], I'd like to see what the code changes would look like. Keep in mind that the sequential invoke relies on the idea that in most case, the files will be in the first location. The code is optimized for this. > RBF: Optimize the file write process in case of multiple destinations. > -- > > Key: HDFS-14440 > URL: https://issues.apache.org/jira/browse/HDFS-14440 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > > In case of multiple destinations, We need to check if the file already exists > in one of the subclusters for which we use the existing getBlockLocation() > API which is by default a sequential Call, > In an ideal scenario where the file needs to be created each subcluster shall > be checked sequentially, this can be done concurrently to save time. > In another case where the file is found and if the last block is null, we > need to do getFileInfo to all the locations to get the location where the > file exists. This also can be prevented by use of ConcurrentCall since we > shall be having the remoteLocation to where the getBlockLocation returned a > non null entry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org