subject:"\[GitHub\] \[incubator\-hudi\] NetsanetGeb edited a comment on issue #714\: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI"

[GitHub] [incubator-hudi] NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI

2019-07-31 Thread GitBox

NetsanetGeb edited a comment on issue #714: Performance Comparison of
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477

After i used hoodie 0.4.6 version, the performance improved and now its
taking 4 minutes.

![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)

I also added a similar code to the countByKey for counting the records in
the HoodieDeltaStreamer class and check why its taking long in the
HoodieBloomIndex and it took about 9 seconds. While the countByKey of the
HoodieBloomIndex is still taking 39 seconds. This change seems to occur due
to parallelism because on the first countByKey it have 22 and on the
HoodieBloomIndex its 2 as observed from the Spark UI below.

![per1](https://user-images.githubusercontent.com/25975892/62196336-1caf1c00-b37e-11e9-89f1-894387485ec7.png)

The effect is clearly seen as we increase the size of the input data from 2
GB to 27 GB. For stage 2, 3, and 4, it was using the 90 executors as provided
and decreases it accordingly. While for stage 5, only 2 executors were running
from the start.

![per3](https://user-images.githubusercontent.com/25975892/62214909-3f552b00-b3a6-11e9-92b5-df197378795d.png)

How do we enhance the parallelism of the bloom index since hoodie is
calculating the parallelism for bloom index inside without the need to set it
as a configuration?
In general, are there specific ways to enhance the performance of bloom
indexing?

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] NetsanetGeb edited a comment on issue #714: Performance Comparison of HoodieDeltaStreamer and DataSourceAPI

2019-07-31 Thread GitBox

NetsanetGeb edited a comment on issue #714: Performance Comparison of
HoodieDeltaStreamer and DataSourceAPI
URL: https://github.com/apache/incubator-hudi/issues/714#issuecomment-516753477

After i used hoodie 0.4.6 version, the performance improved and now its
taking 4 minutes.

![per2](https://user-images.githubusercontent.com/25975892/62195353-1c158600-b37c-11e9-905f-0b04213e614f.png)

I also added a similar code to the countByKey for counting the records in
the HoodieDeltaStreamer class and check why its taking long in the
HoodieBloomIndex and it took about 9 seconds. While the countByKey of the
HoodieBloomIndex is still taking 39 seconds. This change seems to occur due
to parallelism because on the first count it have 22 and on the HoodieBloom
index its 2 as observed from the Spark UI below.