[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
KarthickAN edited a comment on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888 @nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks. However what's

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
KarthickAN commented on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710747888 @nsivabalan I tried out Dynamic filter. It seems to be fine. It's growing along with the number of entries dynamically. That's a good feature. Thanks. However what's the

[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
bvaradar commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710713850 @ashishmgofficial : If I need to test with Kafka, would need a way to generate both Key and Value payload. Do you have some script to publish records to Kafka ? BTW, yeah, you are

[jira] [Created] (HUDI-1346) Fix clean and Asyn Clean when metadata table is enabled

2020-10-16 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1346: Summary: Fix clean and Asyn Clean when metadata table is enabled Key: HUDI-1346 URL: https://issues.apache.org/jira/browse/HUDI-1346 Project: Apache Hudi

[jira] [Updated] (HUDI-1346) Fix clean and Asyn Clean when metadata table is enabled

2020-10-16 Thread Prashant Wason (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1346: - Status: Open (was: New) > Fix clean and Asyn Clean when metadata table is enabled >

[jira] [Updated] (HUDI-1346) Fix clean and Asyn Clean when metadata table is enabled

2020-10-16 Thread Prashant Wason (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1346: - Status: In Progress (was: Open) > Fix clean and Asyn Clean when metadata table is enabled >

[GitHub] [hudi] bvaradar commented on issue #2162: [SUPPORT] Deltastreamer transform cannot add fields

2020-10-16 Thread GitBox
bvaradar commented on issue #2162: URL: https://github.com/apache/hudi/issues/2162#issuecomment-710699297 @liujinhui1994 : Did this work ? This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [hudi] bvaradar commented on issue #2180: [SUPPORT] Unable to read MERGE ON READ table with Snapshot option using Databricks.

2020-10-16 Thread GitBox
bvaradar commented on issue #2180: URL: https://github.com/apache/hudi/issues/2180#issuecomment-710699046 @rahulpoptani : Would it be possible to test with OSS spark version and read the snapshot to verify ? This is an

[GitHub] [hudi] codecov-io commented on pull request #2185: [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle

2020-10-16 Thread GitBox
codecov-io commented on pull request #2185: URL: https://github.com/apache/hudi/pull/2185#issuecomment-710695585 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2185?src=pr=h1) Report > Merging [#2185](https://codecov.io/gh/apache/hudi/pull/2185?src=pr=desc) into

[GitHub] [hudi] codecov-io edited a comment on pull request #2185: [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle

2020-10-16 Thread GitBox
codecov-io edited a comment on pull request #2185: URL: https://github.com/apache/hudi/pull/2185#issuecomment-710695585 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2185?src=pr=h1) Report > Merging [#2185](https://codecov.io/gh/apache/hudi/pull/2185?src=pr=desc) into

[jira] [Updated] (HUDI-1345) undo Hbase and htrace relocation in hudi-utilities bundle as well

2020-10-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1345: - Labels: pull-request-available (was: ) > undo Hbase and htrace relocation in hudi-utilities

[GitHub] [hudi] bhasudha opened a new pull request #2185: [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle

2020-10-16 Thread GitBox
bhasudha opened a new pull request #2185: URL: https://github.com/apache/hudi/pull/2185 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the

[jira] [Updated] (HUDI-1345) undo Hbase and htrace relocation in Hudi-utilities bundle as well

2020-10-16 Thread Bhavani Sudha (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-1345: Summary: undo Hbase and htrace relocation in Hudi-utilities bundle as well (was: undo base and

[jira] [Updated] (HUDI-1345) undo Hbase and htrace relocation in hudi-utilities bundle as well

2020-10-16 Thread Bhavani Sudha (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-1345: Summary: undo Hbase and htrace relocation in hudi-utilities bundle as well (was: undo Hbase and

[jira] [Created] (HUDI-1345) undo base and htrace relocation in Hudi-utilities bundle as well

2020-10-16 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-1345: --- Summary: undo base and htrace relocation in Hudi-utilities bundle as well Key: HUDI-1345 URL: https://issues.apache.org/jira/browse/HUDI-1345 Project: Apache Hudi

[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

2020-10-16 Thread GitBox
vinothchandar commented on issue #1694: URL: https://github.com/apache/hudi/issues/1694#issuecomment-710520474 So to clarify, GLOBAL_SIMPLE helps when the workload is random writes and affecting every file for e.g in each write. But it is indeed slow in the sense, it ll join against the

[GitHub] [hudi] nsivabalan commented on pull request #2092: [HUDI-1285] Fix merge on read DAG to make docker demo pass

2020-10-16 Thread GitBox
nsivabalan commented on pull request #2092: URL: https://github.com/apache/hudi/pull/2092#issuecomment-710060014 LGTM. Do fix the title and description since you have fixed the rollback as well. Once you are done, let me know. or feel free to go ahead and merge it.

[GitHub] [hudi] nsivabalan commented on a change in pull request #2092: [HUDI-1285] Fix merge on read DAG to make docker demo pass

2020-10-16 Thread GitBox
nsivabalan commented on a change in pull request #2092: URL: https://github.com/apache/hudi/pull/2092#discussion_r506441500 ## File path: hudi-integ-test/src/test/resources/unit-test-cow-dag.yaml ## @@ -17,23 +17,53 @@ first_insert: config: record_size: 7

[GitHub] [hudi] nsivabalan edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
nsivabalan edited a comment on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904 If you wish to scale the bloom filer size along with the number of entries, you can try out dynamic bloom filter. Remember this is different from hoodie.index.type which

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710034993 I followed these steps : ``` - Took fresh clone of release-0.6.0 branch - applied the patch provided - build and used the jar to run below commands

[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710034993 AvroKafkaSource : ``` spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.4 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710034993 **AvroKafkaSource** : ``` spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.4 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer

[GitHub] [hudi] nsivabalan edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
nsivabalan edited a comment on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904 If you wish to have dynamic bloom filter that scales its size as the number of entries increase, you can try it out. Remember this is different from hoodie.index.type

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
nsivabalan commented on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710031904 If you wish to have dynamic bloom filter that scales its size as the number of entries incase, you can try it out. Remember this is different from hoodie.index.type which refers

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710023639 @bvaradar Isnt the ``` --source-ordering-field _ts_ms ``` Then precombine should be looking in for _ts_ms right for deletion ? Delete worked fine for me as

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710023639 @bvaradar Isnt the ``` --source-ordering-field _ts_ms ``` Then precombine should be looking in for _ts_ms right for deletion ? Delete worked fine for me as

[GitHub] [hudi] nsivabalan commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
nsivabalan commented on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-710029348 yes, are you right. bitsize to initialize bloom is an objective function of both numEntries and ffp. (int) Math.ceil(numEntries * (-Math.log(errorRate) / (Math.log(2) *

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710023639 @bvaradar Isnt the ``` --source-ordering-field _ts_ms ``` Then precombine should be looking in for _ts_ms right for deletion ? I checked the same scenario

[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710023639 @bvaradar Isnt the ``` --source-ordering-field _ts_ms ``` Then precombine should be looking in for _ts_ms right for deletion ? I checked the same scenario

[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-710023639 @bvaradar Isnt the ``` --source-ordering-field _ts_ms ``` Then precombine should be looking in for _ts_ms right for deletion ?

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming

2020-10-16 Thread GitBox
spyzzz commented on issue #2175: URL: https://github.com/apache/hudi/issues/2175#issuecomment-71914 @naka13 Yes i'll, but i'd like to make something cleaner first. Yet its really Q Still, my avro deserialisation is take 80% of my spark time jobs ... Dunno yet if there is a wait

[GitHub] [hudi] lw309637554 commented on a change in pull request #2177: [HUDI-307] add test to check data type write and read consistent

2020-10-16 Thread GitBox
lw309637554 commented on a change in pull request #2177: URL: https://github.com/apache/hudi/pull/2177#discussion_r506287585 ## File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala ## @@ -194,4 +199,31 @@ class TestCOWDataSource extends

[GitHub] [hudi] leesf commented on a change in pull request #2177: [HUDI-307] add test to check data type write and read consistent

2020-10-16 Thread GitBox
leesf commented on a change in pull request #2177: URL: https://github.com/apache/hudi/pull/2177#discussion_r506276147 ## File path: hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala ## @@ -194,4 +199,31 @@ class TestCOWDataSource extends

[GitHub] [hudi] naka13 commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming

2020-10-16 Thread GitBox
naka13 commented on issue #2175: URL: https://github.com/apache/hudi/issues/2175#issuecomment-709882868 @spyzzz Would it be possible for you to share the complete code? It'll be really helpful for others This is an

[GitHub] [hudi] LeoHsu0802 opened a new issue #2184: [SUPPORT] partition value be duplicated after UPSERT

2020-10-16 Thread GitBox
LeoHsu0802 opened a new issue #2184: URL: https://github.com/apache/hudi/issues/2184 Describe the problem you faced partition value be duplicated after UPSERT **Setting in Jupyter Notebook** ``` %%configure -f { "conf": { "spark.jars":

[GitHub] [hudi] codecov-io edited a comment on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-16 Thread GitBox
codecov-io edited a comment on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-708984716 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2111?src=pr=h1) Report > Merging [#2111](https://codecov.io/gh/apache/hudi/pull/2111?src=pr=desc) into

[GitHub] [hudi] spyzzz commented on issue #2175: [SUPPORT] HUDI MOR/COW tuning with spark structured streaming

2020-10-16 Thread GitBox
spyzzz commented on issue #2175: URL: https://github.com/apache/hudi/issues/2175#issuecomment-709876261 After some deep research i finally found something. I first try to do only a read and write without any transformation and its was way faster (around 500K in 30s) so i tried step by

[GitHub] [hudi] KarthickAN edited a comment on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
KarthickAN edited a comment on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-709875380 @bvaradar @nsivabalan I did run some test around this issue. So I ran the job after setting the config hoodie.index.bloom.num_entries to 150 and inspected the file

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
KarthickAN commented on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-709875380 @nsivabalan I did run some test around this issue. So I ran the job after setting the config hoodie.index.bloom.num_entries to 150 and inspected the file produced. There are

[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
bvaradar commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-709869418 BTW, it looks like both create and delete have the same last_modified_ts which means that precombine would not have deleted the records. Is this fake data ? If so, can you set the

[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-16 Thread GitBox
bvaradar commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-709868001 @ashishmgofficial : With your provided avro file, I am able to ingest without any errors. ``` spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.4 --class

[GitHub] [hudi] bvaradar commented on issue #2174: [SUPPORT] Auto-clean doesn't work

2020-10-16 Thread GitBox
bvaradar commented on issue #2174: URL: https://github.com/apache/hudi/issues/2174#issuecomment-709864864 @halkar : Yes, https://issues.apache.org/jira/browse/HUDI-845 tracks it This is an automated message from the Apache

[GitHub] [hudi] bvaradar closed issue #2174: [SUPPORT] Auto-clean doesn't work

2020-10-16 Thread GitBox
bvaradar closed issue #2174: URL: https://github.com/apache/hudi/issues/2174 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[jira] [Commented] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-10-16 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215226#comment-17215226 ] Balaji Varadarajan commented on HUDI-845: - Yes [~309637554]. this ticket is for tracking general

[GitHub] [hudi] halkar commented on issue #2174: [SUPPORT] Auto-clean doesn't work

2020-10-16 Thread GitBox
halkar commented on issue #2174: URL: https://github.com/apache/hudi/issues/2174#issuecomment-709830826 @bvaradar thanks for confirming. Are there any plans to support concurrent writes? I'll try to change the logic not do concurrent writes.

[GitHub] [hudi] KarthickAN commented on issue #2178: [SUPPORT] Hudi writing 10MB worth of org.apache.hudi.bloomfilter data in each of the parquet files produced

2020-10-16 Thread GitBox
KarthickAN commented on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-709817321 @nsivabalan Please find below my answers 1. That's the average record size. I inspected the parquet files produced and calculated that based on the metrics I found there.