[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. IMPALA-4847: Simplify HdfsTable block metadata loading code This commit is a part of ground work for the upcoming multi threaded block metadata loading patches. The patch for IMPALA-4172 introduced code that groups the block location requests for partition directories that reside under the table directory into a single call to the NN in order to reduce the number of RPCs. However, it turns out that the hdfs client library internally makes one RPC per directory thus defeating the purpose of optimization. Also, this made the code unnecessarily complex since we need to map each file to its corresponding partition at runtime. This patch undos that optimization and makes HDFS calls per partition, which is much easier to understand. This also helps the upcoming patch on multi threaded block metadata loading since there is much less shared state when loading multiple partitions in parallel. Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Reviewed-on: http://gerrit.cloudera.org:8080/7652 Reviewed-by: Bharath VissapragadaTested-by: Impala Public Jenkins --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java 2 files changed, 60 insertions(+), 180 deletions(-) Approvals: Impala Public Jenkins: Verified Bharath Vissapragada: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 4: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 4: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1158/ -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 4: Code-Review+2 Here is the perf analysis of this jira [1]. This regresses the S3 metadata loading performance by ~25% but we decide to make it up with multi threaded loading in the subsequent patch. I discussed this with Alex, Dimitris and Mostafa and they are ok with this change not being included in 2.10.0. Given 2.10.0 is branched, now, I'm submitting this for a GVO. Rebased, removed some dead code, carrying +2. [1] https://docs.google.com/spreadsheets/d/1yD47O2aIqqWP6UOBQdemdWpK-sGSpK8eyvR4HBSYUyw/edit?usp=sharing -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7652 to look at the new patch set (#4). Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. IMPALA-4847: Simplify HdfsTable block metadata loading code This commit is a part of ground work for the upcoming multi threaded block metadata loading patches. The patch for IMPALA-4172 introduced code that groups the block location requests for partition directories that reside under the table directory into a single call to the NN in order to reduce the number of RPCs. However, it turns out that the hdfs client library internally makes one RPC per directory thus defeating the purpose of optimization. Also, this made the code unnecessarily complex since we need to map each file to its corresponding partition at runtime. This patch undos that optimization and makes HDFS calls per partition, which is much easier to understand. This also helps the upcoming patch on multi threaded block metadata loading since there is much less shared state when loading multiple partitions in parallel. Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java 2 files changed, 60 insertions(+), 180 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/7652/4 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis
Re: [Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Sounds good to me. Dimitris On Fri, Aug 11, 2017 at 2:21 PM, Bharath Vissapragada (Code Review) < ger...@cloudera.org> wrote: > Bharath Vissapragada has posted comments on this change. > > Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code > .. > > > Patch Set 3: > > Thanks Dimitris for the quick reviews. I'll run the metadata benchmark > again just to make sure nothing regressed. I'll run GVO after that. > > -- > To view, visit http://gerrit.cloudera.org:8080/7652 > To unsubscribe, visit http://gerrit.cloudera.org:8080/settings > > Gerrit-MessageType: comment > Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf > Gerrit-PatchSet: 3 > Gerrit-Project: Impala-ASF > Gerrit-Branch: master > Gerrit-Owner: Bharath Vissapragada> Gerrit-Reviewer: Alex Behm > Gerrit-Reviewer: Bharath Vissapragada > Gerrit-Reviewer: Dimitris Tsirogiannis > Gerrit-HasComments: No >
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 3: Thanks Dimitris for the quick reviews. I'll run the metadata benchmark again just to make sure nothing regressed. I'll run GVO after that. -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has uploaded a new patch set (#3). Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. IMPALA-4847: Simplify HdfsTable block metadata loading code This commit is a part of ground work for the upcoming multi threaded block metadata loading patches. The patch for IMPALA-4172 introduced code that groups the block location requests for partition directories that reside under the table directory into a single call to the NN in order to reduce the number of RPCs. However, it turns out that the hdfs client library internally makes one RPC per directory thus defeating the purpose of optimization. Also, this made the code unnecessarily complex since we need to map each file to its corresponding partition at runtime. This patch undos that optimization and makes HDFS calls per partition, which is much easier to understand. This also helps the upcoming patch on multi threaded block metadata loading since there is much less shared state when loading multiple partitions in parallel. Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java 1 file changed, 60 insertions(+), 154 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/7652/3 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 2: (3 comments) Yay, more lined deleted :) http://gerrit.cloudera.org:8080/#/c/7652/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: PS2, Line 283: fsBlockLocationSupport nit: maybe rename to synthesizeBlocks? synthesizeBlocks = !FileSystemUtil.supporsStorageIds(fs); PS2, Line 616: fsBlockLocationSupport same comment as above PS2, Line 645: Preconditions.checkNotNull(fd); move this above L647 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/7652/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: PS1, Line 252: for all partitions in 'partitions' > nit: "of all the 'partitions' that ..." Done Line 258:* and disk IDs. > Expand this to mention that we synthesize block metadata for FS that don't Done PS1, Line 331: if (partitions == null || partitions.isEmpty()) return; : RemoteIterator fileStatusIter = : FileSystemUtil.listFiles(fs, partDir, false); : if (fileStatusIter == null) return; : while (fileStatusIter.hasNext()) { : LocatedFileStatus fileStatus = fileStatusIter.next(); : if (!FileSystemUtil.isValidDataFile(fileStatus)) continue; : // For the purpose of synthesizing block metadata, we assume that all partitions : // with the same location have the same file format. : FileDescriptor fd = FileDescriptor.createWithSynthesizedBlockMd(fileStatus, : partitions.get(0).getFileFormat(), hostIndex_); : // Update the partitions' metadata that this file belongs to. : for (HdfsPartition partition: partitions) { : partition.getFileDescriptors().add(fd); : } : } > With the exception of L340, this function resembles loadBlockMetadata. Mayb Did some refactoring. No more synthesize* methods in HdfsTable now. LMK if it looks ok. PS1, Line 581: Path tblLocation = FileSystemUtil.createFullyQualifiedPath(getHdfsBaseDirPath()); > It looks like this is only used in L590. Maybe move it closer to that. Done PS1, Line 695: .keys() > remove Done -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has uploaded a new patch set (#2). Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. IMPALA-4847: Simplify HdfsTable block metadata loading code This commit is a part of ground work for the upcoming multi threaded block metadata loading patches. The patch for IMPALA-4172 introduced code that groups the block location requests for partition directories that reside under the table directory into a single call to the NN in order to reduce the number of RPCs. However, it turns out that the hdfs client library internally makes one RPC per directory thus defeating the purpose of optimization. Also, this made the code unnecessarily complex since we need to map each file to its corresponding partition at runtime. This patch undos that optimization and makes HDFS calls per partition, which is much easier to understand. This also helps the upcoming patch on multi threaded block metadata loading since there is much less shared state when loading multiple partitions in parallel. Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java 1 file changed, 60 insertions(+), 154 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/7652/2 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dimitris Tsirogiannis
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/7652/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: PS1, Line 252: for all partitions in 'partitions' nit: "of all the 'partitions' that ..." Line 258:* and disk IDs. Expand this to mention that we synthesize block metadata for FS that don't support disk ids. Also, mention (maybe in the beginning) that 'pathDir' could be in HDFS, S3 and ADLS. PS1, Line 331: if (partitions == null || partitions.isEmpty()) return; : RemoteIterator fileStatusIter = : FileSystemUtil.listFiles(fs, partDir, false); : if (fileStatusIter == null) return; : while (fileStatusIter.hasNext()) { : LocatedFileStatus fileStatus = fileStatusIter.next(); : if (!FileSystemUtil.isValidDataFile(fileStatus)) continue; : // For the purpose of synthesizing block metadata, we assume that all partitions : // with the same location have the same file format. : FileDescriptor fd = FileDescriptor.createWithSynthesizedBlockMd(fileStatus, : partitions.get(0).getFileFormat(), hostIndex_); : // Update the partitions' metadata that this file belongs to. : for (HdfsPartition partition: partitions) { : partition.getFileDescriptors().add(fd); : } : } With the exception of L340, this function resembles loadBlockMetadata. Maybe see if there is a nice way to merge these two. PS1, Line 581: Path tblLocation = FileSystemUtil.createFullyQualifiedPath(getHdfsBaseDirPath()); It looks like this is only used in L590. Maybe move it closer to that. PS1, Line 695: .keys() remove -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath VissapragadaGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4847: Simplify HdfsTable block metadata loading code
Bharath Vissapragada has uploaded a new change for review. http://gerrit.cloudera.org:8080/7652 Change subject: IMPALA-4847: Simplify HdfsTable block metadata loading code .. IMPALA-4847: Simplify HdfsTable block metadata loading code This commit is a part of ground work for the upcoming multi threaded block metadata loading patches. The patch for IMPALA-4172 introduced code that groups the block location requests for partition directories that reside under the table directory into a single call to the NN in order to reduce the number of RPCs. However, it turns out that the hdfs client library internally makes one RPC per directory thus defeating the purpose of optimization. Also, this made the code unnecessarily complex since we need to map each file to its corresponding partition at runtime. This patch undos that optimization and makes HDFS calls per partition, which is much easier to understand. This also helps the upcoming patch on multi threaded block metadata loading since there is much less shared state when loading multiple partitions in parallel. Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf --- M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java 1 file changed, 36 insertions(+), 105 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/7652/1 -- To view, visit http://gerrit.cloudera.org:8080/7652 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I963d647bd2ba11e3843c6ef2ac6be113c74280bf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada