[ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030 ]
lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM: ----------------------------------------------------------- right, this is a nice design, some thoughts: * if the input data is large, need to increase partitions, "candidates" contains all datas for per partition * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs() && populateRangeAndBloomFilters()) [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD, JavaSparkContext jsc, HoodieTable<T> hoodieTable) { return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable)) .flatMap(List::iterator) .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) .filter(Option::isPresent) .map(Option::get); } {code} {code:java} private void initIfNeeded(String partitionPath) throws IOException { if (!Objects.equals(partitionPath, currentPartitionPath)) { cleanup(); this.currentPartitionPath = partitionPath; populateFileIDs(); populateRangeAndBloomFilters(); } }{code} was (Author: lamber-ken): right, this is a nice design, some thoughts: * if the input data is large, need to increase partitions, "candidates" contains all partition datas * if increase partitions, it will cause duplicate loading of the same partition(e.g populateFileIDs() && populateRangeAndBloomFilters()) [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD, JavaSparkContext jsc, HoodieTable<T> hoodieTable) { return recordRDD.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable)) .flatMap(List::iterator) .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) .filter(Option::isPresent) .map(Option::get); } {code} {code:java} private void initIfNeeded(String partitionPath) throws IOException { if (!Objects.equals(partitionPath, currentPartitionPath)) { cleanup(); this.currentPartitionPath = partitionPath; populateFileIDs(); populateRangeAndBloomFilters(); } }{code} > Implement BloomIndexV2 that does not depend on memory caching > ------------------------------------------------------------- > > Key: HUDI-686 > URL: https://issues.apache.org/jira/browse/HUDI-686 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Index, Performance > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > Fix For: 0.6.0 > > Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot > 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, > image-2020-03-19-10-17-43-048.png > > > Main goals here is to provide a much simpler index, without advanced > optimizations like auto tuned parallelism/skew handling but a better > out-of-experience for small workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005)