[GitHub] [hudi] liujinhui1994 commented on a change in pull request #1984: [HUDI-1200] Fix NullPointerException, CustomKeyGenerator does not work

2020-08-20 Thread GitBox
liujinhui1994 commented on a change in pull request #1984: URL: https://github.com/apache/hudi/pull/1984#discussion_r474439210 ## File path: hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java ## @@ -41,7 +41,7 @@ private static final String STRUCT_NAME = "hood

[GitHub] [hudi] liujinhui1994 commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-08-20 Thread GitBox
liujinhui1994 commented on a change in pull request #1968: URL: https://github.com/apache/hudi/pull/1968#discussion_r474437973 ## File path: hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java ## @@ -71,6 +71,9 @@ @Parameter(names = {"--use-jdbc"

[GitHub] [hudi] liujinhui1994 commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-08-20 Thread GitBox
liujinhui1994 commented on a change in pull request #1968: URL: https://github.com/apache/hudi/pull/1968#discussion_r474436197 ## File path: hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java ## @@ -117,11 +117,13 @@ private void syncHoodieTable(Stri

[GitHub] [hudi] tooptoop4 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
tooptoop4 commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678066269 > I understand that recently we made changes in Presto to use `Path Filter` instead. @umehrot2 was that fix made on prestosql too or just prestodb? I heard new EMR 6 in September

[GitHub] [hudi] tooptoop4 edited a comment on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
tooptoop4 edited a comment on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678066269 > I understand that recently we made changes in Presto to use `Path Filter` instead. @umehrot2 was that fix made on prestosql too or just prestodb? I heard new EMR 6 in

[GitHub] [hudi] poiyyq removed a comment on issue #1999: What difference from spark and deltaStreamer? which more efficient?

2020-08-20 Thread GitBox
poiyyq removed a comment on issue #1999: URL: https://github.com/apache/hudi/issues/1999#issuecomment-678060189 I can use spark streaming or flink to consume kafka data , then write hoodie table with Spark DataSource, right? ---

[GitHub] [hudi] poiyyq commented on issue #1999: What difference from spark and deltaStreamer? which more efficient?

2020-08-20 Thread GitBox
poiyyq commented on issue #1999: URL: https://github.com/apache/hudi/issues/1999#issuecomment-678060098 I can use spark streaming or flink to consume kafka data , then write hoodie table with Spark DataSource, right? This i

[GitHub] [hudi] poiyyq commented on issue #1999: What difference from spark and deltaStreamer? which more efficient?

2020-08-20 Thread GitBox
poiyyq commented on issue #1999: URL: https://github.com/apache/hudi/issues/1999#issuecomment-678060189 I can use spark streaming or flink to consume kafka data , then write hoodie table with Spark DataSource, right? This i

[GitHub] [hudi] cdmikechen opened a new issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark with local

2020-08-20 Thread GitBox
cdmikechen opened a new issue #2005: URL: https://github.com/apache/hudi/issues/2005 **Describe the problem you faced** A clear and concise description of the problem. Hudi in master branch (0.6.1) can not use `hive-sync` to sync to hive with error ``` Caused by: java.

[GitHub] [hudi] bvaradar commented on issue #1985: [SUPPORT]Error while running deltastreamer on top of backfilled data using Hudi

2020-08-20 Thread GitBox
bvaradar commented on issue #1985: URL: https://github.com/apache/hudi/issues/1985#issuecomment-678044206 @piyushrl : The strategy would be to orchestrate this bootstrap and handoff in 3 steps 1. Copy the earliest checkpoint from kafka and save it after ensuring your upstream so

[GitHub] [hudi] bvaradar commented on issue #1977: Error running hudi on aws glue

2020-08-20 Thread GitBox
bvaradar commented on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-678039436 @umehrot2 : Assigned this support ticket to you as this is AWS specific. This is an automated message from the Apache

[GitHub] [hudi] bvaradar commented on issue #1999: What difference from spark and deltaStreamer? which more efficient?

2020-08-20 Thread GitBox
bvaradar commented on issue #1999: URL: https://github.com/apache/hudi/issues/1999#issuecomment-678038887 DeltaStreamer gives you ability to continuously ingest data from upstream sources such as kafka/DFS log files and other hoodie tables. It manages checkpoints as well. Yes, you can als

[GitHub] [hudi] bvaradar commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query

2020-08-20 Thread GitBox
bvaradar commented on issue #1980: URL: https://github.com/apache/hudi/issues/1980#issuecomment-678036142 @jiegzhan : This is an upcoming feature to re-cluster records in files (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance). We a

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678031526 Do you don't see a solution for this in a near future? This is an automated message from the Apache Git Service. To

[GitHub] [hudi] wangxianghu commented on a change in pull request #1974: [HUDI-1186][DOC]Add description of write commit callback by kafka to document

2020-08-20 Thread GitBox
wangxianghu commented on a change in pull request #1974: URL: https://github.com/apache/hudi/pull/1974#discussion_r474401044 ## File path: docs/_docs/2_4_configurations.cn.md ## @@ -549,7 +549,7 @@ Hudi提供了一个选项,可以通过将对该分区中的插入作为对现 此属性控制报告给驱动程序的失败记录和异常的比例 ### 写提交回调配置 -控制写提交的回调。

[GitHub] [hudi] wangxianghu commented on pull request #1920: [HUDI-1150]Fix unable to parse input partition field :1 exception whe…

2020-08-20 Thread GitBox
wangxianghu commented on pull request #1920: URL: https://github.com/apache/hudi/pull/1920#issuecomment-678028232 > @wangxianghu conflicts should be fixed. @yanghua This pr is ready for review now This is an automated

[GitHub] [hudi] yanghua merged pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
yanghua merged pull request #1997: URL: https://github.com/apache/hudi/pull/1997 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[hudi] branch master updated: [HUDI-781] Introduce HoodieTestTable for test preparation (#1997)

2020-08-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 3a2ae16 [HUDI-781] Introduce HoodieTestTable fo

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #376

2020-08-20 Thread Apache Jenkins Server
See Changes: -- [...truncated 2.59 KB...] cdi-api-1.0.jar cdi-api.license commons-cli-1.4.jar commons-cli.license commons-io-2.5.jar commons-io.license commons-lang3-3.5.jar

[GitHub] [hudi] nsivabalan commented on pull request #2004: [NOT_TO_BE_MERGED] [ONLY FOR TESTING] Fixing Test Suite for docker

2020-08-20 Thread GitBox
nsivabalan commented on pull request #2004: URL: https://github.com/apache/hudi/pull/2004#issuecomment-678010364 I tried the patch and fails with validation. Not sure if hive sync is the issue. Will investigate as to why the validation fails. basically the rollback did not take effect.

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678008043 I think spark sql to query is not an option for us, because we use redash, so redash doesn't connect to spark and my users are not tech experts. I think the only viable option

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
umehrot2 commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678000764 @rubenssoto until this is fixed would you been okay querying through `spark-sql` instead ? Since you are using COW, you can make your spark-sql queries use spark's listing mecha

[GitHub] [hudi] liujinhui1994 commented on pull request #1970: [HUDI-1193] Upgrade http dependent version

2020-08-20 Thread GitBox
liujinhui1994 commented on pull request #1970: URL: https://github.com/apache/hudi/pull/1970#issuecomment-677995500 I am very sure that it is the http version problem. I modified the version and upgraded it to 4.4.1 and the writing was successful. https://help.aliyun.com/document_de

[GitHub] [hudi] yanghua commented on a change in pull request #2000: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #2000: URL: https://github.com/apache/hudi/pull/2000#discussion_r474362513 ## File path: hudi-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java ## @@ -57,9 +55,7 @@ * Provides an RDD based API for accessing/filter

[jira] [Comment Edited] (HUDI-1207) Move kafka implemetation of write commit callback to hudi-client module

2020-08-20 Thread wangxianghu (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181520#comment-17181520 ] wangxianghu edited comment on HUDI-1207 at 8/21/20, 1:43 AM: -

[GitHub] [hudi] wangxianghu commented on pull request #1886: [HUDI-1122] Introduce a kafka implementation of hoodie write commit ca…

2020-08-20 Thread GitBox
wangxianghu commented on pull request #1886: URL: https://github.com/apache/hudi/pull/1886#issuecomment-677990052 > `hudi-client` or `hudi-spark` talking a direct dependency on kafka @yanghua thanks for your time. The ticket filed here: https://issues.apache.org/jira/browse/HUDI-1207

[jira] [Commented] (HUDI-1207) Move kafka implemetation of write commit callback to hudi-client module

2020-08-20 Thread wangxianghu (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181520#comment-17181520 ] wangxianghu commented on HUDI-1207: --- Hi, [~vinoth] as discussed before,  put Kafka callb

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677988455 Hi guys thank you so much for helping me. I'm really want to use Hudi in my production environment and I migrated almost all my datasets to hudi, but until now I've been migrat

[jira] [Created] (HUDI-1207) Move kafka implemetation of write commit callback to hudi-client module

2020-08-20 Thread wangxianghu (Jira)
wangxianghu created HUDI-1207: - Summary: Move kafka implemetation of write commit callback to hudi-client module Key: HUDI-1207 URL: https://issues.apache.org/jira/browse/HUDI-1207 Project: Apache Hudi

[GitHub] [hudi] Trevor-zhang commented on pull request #2000: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
Trevor-zhang commented on pull request #2000: URL: https://github.com/apache/hudi/pull/2000#issuecomment-677981819 @yanghua please take a look when free This is an automated message from the Apache Git Service. To respond

[GitHub] [hudi] yanghua commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #1968: URL: https://github.com/apache/hudi/pull/1968#discussion_r474348589 ## File path: hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java ## @@ -71,6 +71,9 @@ @Parameter(names = {"--use-jdbc"}, des

[GitHub] [hudi] xushiyan commented on a change in pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
xushiyan commented on a change in pull request #1997: URL: https://github.com/apache/hudi/pull/1997#discussion_r474349878 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -227,6 +174,8 @@ public static SparkConf getSparkConfFor

[GitHub] [hudi] yanghua commented on a change in pull request #1974: [HUDI-1186][DOC]Add description of write commit callback by kafka to document

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #1974: URL: https://github.com/apache/hudi/pull/1974#discussion_r474347486 ## File path: docs/_docs/2_4_configurations.md ## @@ -512,7 +512,7 @@ Property: `hoodie.memory.writestatus.failure.fraction` This property controls what fr

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
umehrot2 commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677977073 I understand that recently we made changes in Presto to use `Path Filter` instead. Athena is on an older version and does not have the `Path Filter` patch in Presto. So I am not sure w

[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
umehrot2 commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677974115 @vinothchandar `native parquet readers` are used only in `COW` use-case, but even then splits are fetched through `InputFormat` which also in the process does `listing`. For `MOR` use-

[GitHub] [hudi] yanghua commented on a change in pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #1997: URL: https://github.com/apache/hudi/pull/1997#discussion_r474344290 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -227,6 +174,8 @@ public static SparkConf getSparkConfForT

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677971682 @vinothchandar I opened a ticket to aws. But my perception is when you have more partition takes much more time. The same dataset with 600 partitions count takes more than one minu

[GitHub] [hudi] n3nash commented on pull request #2004: [NOT_TO_BE_MERGED] [ONLY FOR TESTING] Fixing Test Suite for docker

2020-08-20 Thread GitBox
n3nash commented on pull request #2004: URL: https://github.com/apache/hudi/pull/2004#issuecomment-677969432 @nsivabalan @vinothchandar please try to run docker with the changes in this PR This is an automated message from t

[jira] [Commented] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181499#comment-17181499 ] Nishith Agarwal commented on HUDI-1204: --- Here is the PR to run the tests -> [https:/

[GitHub] [hudi] n3nash opened a new pull request #2004: [NOT_TO_BE_MERGED] [ONLY FOR TESTING] Fixing Test Suite for docker

2020-08-20 Thread GitBox
n3nash opened a new pull request #2004: URL: https://github.com/apache/hudi/pull/2004 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull

[GitHub] [hudi] rubenssoto commented on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

2020-08-20 Thread GitBox
rubenssoto commented on issue #2003: URL: https://github.com/apache/hudi/issues/2003#issuecomment-677960641 I need to use 10 R5.4Xlarge and process worked. This is an automated message from the Apache Git Service. To respond

[GitHub] [hudi] rubenssoto closed issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

2020-08-20 Thread GitBox
rubenssoto closed issue #2003: URL: https://github.com/apache/hudi/issues/2003 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

[GitHub] [hudi] brandon-stanley commented on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-20 Thread GitBox
brandon-stanley commented on issue #1960: URL: https://github.com/apache/hudi/issues/1960#issuecomment-677957741 @bhasudha This [post](https://github.com/apache/hudi/issues/1986) points out that the precombine logic can be disabled. Is this true? --

[GitHub] [hudi] yuhadooper commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
yuhadooper commented on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677916258 to be clear: upserts worked before when we didn't have any partitions (we had about 3000 parquet files) under the table. After partitioning the tables into 15000 partitions we now ha

[GitHub] [hudi] luffyd commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
luffyd commented on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677914305 It is weird that increasing in partitions is causing s3 throttles. S3 throttles should be function of number of files per partition and fileSizes a bit. ---

[GitHub] [hudi] yuhadooper edited a comment on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
yuhadooper edited a comment on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677906212 Thank you! I will give this a try. Here are the exceptions I'm seeing, doing an upsert on a few billion records with around 15000 partitions. Though the parquet file exi

[GitHub] [hudi] yuhadooper commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
yuhadooper commented on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677906212 Thank you! I will give this a try. Here are the exceptions I'm seeing, doing an upsert on a few billion records with around 15000 partitions. Though the file parquet exists und

[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

2020-08-20 Thread GitBox
luffyd commented on issue #1913: URL: https://github.com/apache/hudi/issues/1913#issuecomment-677888677 We started executing the emr job in cluster mode and not seeing this issue now. Is your job running in client mode or cluster mode? --

[GitHub] [hudi] luffyd commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
luffyd commented on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677887900 @yuhadooper I added logs when hudi calls createMarkerFile, it was not that high. But probably other part of hudi was consuming s3 limits. Another thing is adding these s3 retry con

[GitHub] [hudi] yuhadooper commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-08-20 Thread GitBox
yuhadooper commented on issue #1872: URL: https://github.com/apache/hudi/issues/1872#issuecomment-677882423 @luffyd do you mind sharing what configurations you changed to the EMR cluster for this to work? This is an automate

[GitHub] [hudi] rubenssoto edited a comment on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

2020-08-20 Thread GitBox
rubenssoto edited a comment on issue #2003: URL: https://github.com/apache/hudi/issues/2003#issuecomment-677858275 Spark stuck on the last screen, I don't know if are doing anything. https://user-images.githubusercontent.com/36298331/90821263-010c1000-e309-11ea-8b59-895e8da9f193.png";

[jira] [Assigned] (HUDI-113) Get rid of using "#" as the separator in BloomIndex shuffling

2020-08-20 Thread Leping Huang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leping Huang reassigned HUDI-113: - Assignee: Leping Huang > Get rid of using "#" as the separator in BloomIndex shuffling > -

[jira] [Assigned] (HUDI-323) Docker demo/integ-test stdout/stderr output only available on process exit

2020-08-20 Thread Leping Huang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leping Huang reassigned HUDI-323: - Assignee: Leping Huang > Docker demo/integ-test stdout/stderr output only available on process exi

[jira] [Commented] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181395#comment-17181395 ] Nishith Agarwal commented on HUDI-1204: --- [~vinoth] The setup works for me now on my

[jira] [Updated] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1204: -- Attachment: complex-dag-cow-2.yaml > NoClassDefFoundError with AbstractSyncTool while running Ho

[GitHub] [hudi] rubenssoto commented on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

2020-08-20 Thread GitBox
rubenssoto commented on issue #2003: URL: https://github.com/apache/hudi/issues/2003#issuecomment-677858275 Spark stuck on the last screen, I don't know if are doing anything. This is an automated message from the Apache Git

[GitHub] [hudi] rubenssoto opened a new issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

2020-08-20 Thread GitBox
rubenssoto opened a new issue #2003: URL: https://github.com/apache/hudi/issues/2003 Hi Guys, I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors. Data Size: 350Gb Spark Master: 4 Cpus, 16 Gb Ram Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH

[GitHub] [hudi] jpugliesi opened a new issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query

2020-08-20 Thread GitBox
jpugliesi opened a new issue #2002: URL: https://github.com/apache/hudi/issues/2002 I am seeing inconsistent commit history when querying a Hudi table incrementally (with `begin_time = 0` ) vs. using the CLI's `commits show`. [Following up on the Slack thread here with @bhasudha ](h

[GitHub] [hudi] kpurella opened a new issue #2001: NPE While writing data to same partition on S3

2020-08-20 Thread GitBox
kpurella opened a new issue #2001: URL: https://github.com/apache/hudi/issues/2001 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes - Join the mailing list to engage in conversations and get faster

[GitHub] [hudi] vinothchandar commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
vinothchandar commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677813495 Athena question could boil down to what version of presto its running internally. Really for aws folks to answer. But on open source Presto, I want to clear up few things.

[jira] [Commented] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Vinoth Chandar (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181354#comment-17181354 ] Vinoth Chandar commented on HUDI-1204: -- folks, please use the code formatting so its

[jira] [Comment Edited] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181350#comment-17181350 ] Nishith Agarwal edited comment on HUDI-1204 at 8/20/20, 5:37 PM: ---

[jira] [Comment Edited] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181350#comment-17181350 ] Nishith Agarwal edited comment on HUDI-1204 at 8/20/20, 5:36 PM: ---

[jira] [Commented] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181350#comment-17181350 ] Nishith Agarwal commented on HUDI-1204: --- Can you replace this in the dag [~shivnaray

[jira] [Comment Edited] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread Nishith Agarwal (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181350#comment-17181350 ] Nishith Agarwal edited comment on HUDI-1204 at 8/20/20, 5:34 PM: ---

[GitHub] [hudi] xushiyan commented on a change in pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
xushiyan commented on a change in pull request #1997: URL: https://github.com/apache/hudi/pull/1997#discussion_r474145442 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -289,27 +245,9 @@ public static String writeParquetFile(

[GitHub] [hudi] KarthickAN edited a comment on issue #1977: Error running hudi on aws glue

2020-08-20 Thread GitBox
KarthickAN edited a comment on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-677735782 @vinothchandar @umehrot2 Thank you for responding. I was able to run it after I uploaded a custom built jar by adding the shadedPattern in the pom for org.apache.hudi.org.ecli

[GitHub] [hudi] KarthickAN edited a comment on issue #1977: Error running hudi on aws glue

2020-08-20 Thread GitBox
KarthickAN edited a comment on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-677735782 @vinothchandar @umehrot2 Thank you for responding. I was able to run it after I uploaded a custom built jar by adding the shadedPattern in the pom for org.apache.hudi.org.ecli

[GitHub] [hudi] KarthickAN edited a comment on issue #1977: Error running hudi on aws glue

2020-08-20 Thread GitBox
KarthickAN edited a comment on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-677735782 @vinothchandar @umehrot2 Thank you for responding. I was able to run it after I uploaded a custom built jar by adding the shadedPattern in the pom for org.apache.hudi.org.ecli

[GitHub] [hudi] yanghua commented on a change in pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #1997: URL: https://github.com/apache/hudi/pull/1997#discussion_r474071930 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -289,27 +245,9 @@ public static String writeParquetFile(S

[GitHub] [hudi] KarthickAN commented on issue #1977: Error running hudi on aws glue

2020-08-20 Thread GitBox
KarthickAN commented on issue #1977: URL: https://github.com/apache/hudi/issues/1977#issuecomment-677735782 @vinothchandar @umehrot2 Thank you for responding. I was able to run it after I uploaded a custom built jar by adding the following in the pom org.eclipse.jetty. org

[hudi] branch master updated (b883b6d -> 34c8c9e)

2020-08-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from b883b6d [HUDI-1122] Introduce a kafka implementation of hoodie write commit ca… (#1886) add 34c8c9e [MINOR] M

[GitHub] [hudi] yanghua merged pull request #1993: [MINOR] Move HoodieUpgradeDowngradeException to exception package

2020-08-20 Thread GitBox
yanghua merged pull request #1993: URL: https://github.com/apache/hudi/pull/1993 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[jira] [Closed] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-08-20 Thread vinoyang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1122. -- Resolution: Implemented Implemented via master branch: b883b6d2682a3043eaf263c86182b19922660fd9 > Introduce a k

[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-08-20 Thread vinoyang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-1122: --- Issue Type: New Feature (was: Task) > Introduce a kafka implementation of hoodie write commit callback > ---

[GitHub] [hudi] yanghua merged pull request #1886: [HUDI-1122] Introduce a kafka implementation of hoodie write commit ca…

2020-08-20 Thread GitBox
yanghua merged pull request #1886: URL: https://github.com/apache/hudi/pull/1886 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[hudi] branch master updated (bd7814d -> b883b6d)

2020-08-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from bd7814d [HUDI-1206] Remove unused variable in Compactor (#1994) add b883b6d [HUDI-1122] Introduce a kafka impl

[GitHub] [hudi] yanghua commented on pull request #1886: [HUDI-1122] Introduce a kafka implementation of hoodie write commit ca…

2020-08-20 Thread GitBox
yanghua commented on pull request #1886: URL: https://github.com/apache/hudi/pull/1886#issuecomment-677718994 > > I was wondering can we move this implement to hudi-client module just like the way all the implementations of metrics does. > > I think we can move this down the line. `h

[jira] [Commented] (HUDI-1204) NoClassDefFoundError with AbstractSyncTool while running HoodieTestSuiteJob

2020-08-20 Thread sivabalan narayanan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181233#comment-17181233 ] sivabalan narayanan commented on HUDI-1204: --- I followed the steps as suggested,

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677672123 I made more tests, but now with the same table, only difference is partition strategy I use Athena. Table01 with regular parquet query: select city,origin, count(1

[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2020-08-20 Thread GitBox
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-677647272 Yeah, I could try. I made some tests, the smaller table was partitioned by day, so now I partitioned by year-month, so now I have greater files...my simple count improve a lot

[GitHub] [hudi] wangxianghu commented on pull request #2000: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
wangxianghu commented on pull request #2000: URL: https://github.com/apache/hudi/pull/2000#issuecomment-677645350 @Trevor-zhang Good catch! This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [hudi] Trevor-zhang opened a new pull request #2000: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
Trevor-zhang opened a new pull request #2000: URL: https://github.com/apache/hudi/pull/2000 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of th

[GitHub] [hudi] poiyyq opened a new issue #1999: What difference from spark and deltaStreamer? which more efficient?

2020-08-20 Thread GitBox
poiyyq opened a new issue #1999: URL: https://github.com/apache/hudi/issues/1999 as I know, deltaStreamer is a tool to operate hudi . If it's just a tool, I just use spark syntax instant of deltaStreamer. right? This

[GitHub] [hudi] Trevor-zhang closed pull request #1998: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
Trevor-zhang closed pull request #1998: URL: https://github.com/apache/hudi/pull/1998 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [hudi] Trevor-zhang opened a new pull request #1998: [MINOR] Remove unused log code in HoodieReadClient

2020-08-20 Thread GitBox
Trevor-zhang opened a new pull request #1998: URL: https://github.com/apache/hudi/pull/1998 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of th

[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-20 Thread wangxianghu (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1103: -- Fix Version/s: 0.6.1 > Improve the code format of Delete data demo in Quick-Start Guide > --

[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-20 Thread wangxianghu (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1103: -- Fix Version/s: (was: 0.6.0) > Improve the code format of Delete data demo in Quick-Start Guide > ---

[GitHub] [hudi] xushiyan commented on pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
xushiyan commented on pull request #1997: URL: https://github.com/apache/hudi/pull/1997#issuecomment-677546876 @yanghua This is to redo #1871, mostly resolving conflicts in `TestCleaner.java`. Thanks. This is an automated me

[GitHub] [hudi] xushiyan opened a new pull request #1997: [HUDI-781] Introduce HoodieTestTable for test preparation

2020-08-20 Thread GitBox
xushiyan opened a new pull request #1997: URL: https://github.com/apache/hudi/pull/1997 Redo #1871 ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Nec

[jira] [Assigned] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-20 Thread Trevorzhang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevorzhang reassigned HUDI-1103: - Assignee: Trevorzhang (was: trevor) > Improve the code format of Delete data demo in Quick-Start

[jira] [Assigned] (HUDI-1067) Replace the integer version field with HoodieLogBlockVersion data structure

2020-08-20 Thread Trevorzhang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevorzhang reassigned HUDI-1067: - Assignee: Trevorzhang (was: trevor) > Replace the integer version field with HoodieLogBlockVersi

[jira] [Assigned] (HUDI-396) Provide an documentation to describe how to use test suite

2020-08-20 Thread Trevorzhang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevorzhang reassigned HUDI-396: Assignee: Trevorzhang (was: trevor) > Provide an documentation to describe how to use test suite >

[jira] [Assigned] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-20 Thread trevor (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] trevor reassigned HUDI-1103: Assignee: trevor (was: Trevorzhang) > Improve the code format of Delete data demo in Quick-Start Guide > -

[jira] [Assigned] (HUDI-396) Provide an documentation to describe how to use test suite

2020-08-20 Thread trevor (Jira)
[ https://issues.apache.org/jira/browse/HUDI-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] trevor reassigned HUDI-396: --- Assignee: trevor (was: Trevorzhang) > Provide an documentation to describe how to use test suite > --

[jira] [Assigned] (HUDI-1067) Replace the integer version field with HoodieLogBlockVersion data structure

2020-08-20 Thread trevor (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] trevor reassigned HUDI-1067: Assignee: trevor (was: Trevorzhang) > Replace the integer version field with HoodieLogBlockVersion data st

[GitHub] [hudi] wangxianghu commented on a change in pull request #1993: [MINOR] Move HoodieUpgradeDowngradeException to exception package

2020-08-20 Thread GitBox
wangxianghu commented on a change in pull request #1993: URL: https://github.com/apache/hudi/pull/1993#discussion_r473878261 ## File path: hudi-client/src/main/java/org/apache/hudi/table/upgrade/UpgradeDowngrade.java ## @@ -27,6 +27,7 @@ import org.apache.hadoop.fs.FSDataOutp

[GitHub] [hudi] yanghua commented on a change in pull request #1993: [MINOR] Move HoodieUpgradeDowngradeException to exception package

2020-08-20 Thread GitBox
yanghua commented on a change in pull request #1993: URL: https://github.com/apache/hudi/pull/1993#discussion_r473865745 ## File path: hudi-client/src/main/java/org/apache/hudi/table/upgrade/UpgradeDowngrade.java ## @@ -27,6 +27,7 @@ import org.apache.hadoop.fs.FSDataOutputSt

[GitHub] [hudi] bvaradar opened a new pull request #1996: [BLOG] Efficient Migration of large Parquet tables

2020-08-20 Thread GitBox
bvaradar opened a new pull request #1996: URL: https://github.com/apache/hudi/pull/1996 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[jira] [Updated] (HUDI-1206) Remove unused variable in Compactor

2020-08-20 Thread vinoyang (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-1206: --- Status: Open (was: New) > Remove unused variable in Compactor > --- > >

  1   2   >