[jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

lamber-ken (Jira) Mon, 23 Mar 2020 22:42:08 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030
 ]


lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM:
-----------------------------------------------------------

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all datas for per partition
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
                                            JavaSparkContext jsc,
                                            HoodieTable<T> hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
      true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
      .flatMap(List::iterator)
      .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
      .filter(Option::isPresent)
      .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
    cleanup();
    this.currentPartitionPath = partitionPath;
    populateFileIDs();
    populateRangeAndBloomFilters();
  }
}{code}


was (Author: lamber-ken):
right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all partition datas
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
                                            JavaSparkContext jsc,
                                            HoodieTable<T> hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
      true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
      .flatMap(List::iterator)
      .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
      .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
      .filter(Option::isPresent)
      .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
    cleanup();
    this.currentPartitionPath = partitionPath;
    populateFileIDs();
    populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
>                 Key: HUDI-686
>                 URL: https://issues.apache.org/jira/browse/HUDI-686
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Index, Performance
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

Reply via email to