[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

2020-07-15 Thread GitBox
bvaradar commented on issue #1833: URL: https://github.com/apache/hudi/issues/1833#issuecomment-659174736 Can you try setting the config "hoodie.bloom.index.bucketized.checking" to false and try. Kindly report back with the observation.

[GitHub] [hudi] bvaradar edited a comment on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-15 Thread GitBox
bvaradar edited a comment on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659163337 @umehrot2 @srsteinmetz : Thanks for the information. I have not seen similar issues but looking at the trend (increase) in number of file groups and partitions is a good angle

[GitHub] [hudi] bvaradar commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-15 Thread GitBox
bvaradar commented on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659163337 @umehrot2 @srsteinmetz : Thanks for the information. I have not seen similar issues but looking at the trend (increase) in number of file groups and partitions is a good angle to

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #340

2020-07-15 Thread Apache Jenkins Server
See Changes: -- [...truncated 2.35 KB...] /home/jenkins/tools/maven/apache-maven-3.5.4/conf: logging settings.xml toolchains.xml

[GitHub] [hudi] garyli1019 commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
garyli1019 commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r455483211 ## File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -255,6 +262,43 @@ private[hudi] object HoodieSparkSqlWriter {

[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-15 Thread GitBox
lw309637554 commented on a change in pull request #1756: URL: https://github.com/apache/hudi/pull/1756#discussion_r455475745 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java ## @@ -182,12 +162,12 @@ private

[GitHub] [hudi] lw309637554 commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-15 Thread GitBox
lw309637554 commented on a change in pull request #1756: URL: https://github.com/apache/hudi/pull/1756#discussion_r455474421 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java ## @@ -182,12 +162,12 @@ private

[GitHub] [hudi] Mathieu1124 commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-07-15 Thread GitBox
Mathieu1124 commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-659094743 Hi, @vinothchandar @yanghua @leesf @n3nash, ci is green, this pr is ready for review now :) This is an

[GitHub] [hudi] ssomuah commented on issue #1836: [SUPPORT] org.apache.hudi.exception.HoodieException: Unable to instantiate payload class

2020-07-15 Thread GitBox
ssomuah commented on issue #1836: URL: https://github.com/apache/hudi/issues/1836#issuecomment-659089308 Thank you Balaji, that was it. This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [hudi] ssomuah closed issue #1836: [SUPPORT] org.apache.hudi.exception.HoodieException: Unable to instantiate payload class

2020-07-15 Thread GitBox
ssomuah closed issue #1836: URL: https://github.com/apache/hudi/issues/1836 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [hudi] nsivabalan commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-15 Thread GitBox
nsivabalan commented on a change in pull request #1819: URL: https://github.com/apache/hudi/pull/1819#discussion_r455440158 ## File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java ## @@ -176,11 +177,21 @@ public static KeyGenerator

[GitHub] [hudi] nsivabalan commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-15 Thread GitBox
nsivabalan commented on a change in pull request #1819: URL: https://github.com/apache/hudi/pull/1819#discussion_r455440158 ## File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java ## @@ -176,11 +177,21 @@ public static KeyGenerator

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
asheeshgarg commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-659064590 @bvaradar I was assuming that every time we write the content will merged to the existing file based on the size limits we have specify. Other wise we will see lot small files. As

[GitHub] [hudi] umehrot2 commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-15 Thread GitBox
umehrot2 commented on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659057299 @bvaradar thank you for taking a look at this. We had an internal meeting with @srsteinmetz and the team, and yes at the outset to me it looks the the total time for lookup is

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
bvaradar commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-659052418 @asheeshgarg : The 2 parquet files you have listed are essentially different versions of the same file (file_id : 65254296-10d0-49d4-b168-6708e6274712-0_0). Each time you write, only

[GitHub] [hudi] srsteinmetz commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-15 Thread GitBox
srsteinmetz commented on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659049963 Yes I was only able to capture a few runs in a single page of the Spark UI but the processing time for the WorkflodProfile countByKey has an increasing pattern over subsequent

[GitHub] [hudi] bvaradar commented on issue #1836: [SUPPORT] org.apache.hudi.exception.HoodieException: Unable to instantiate payload class

2020-07-15 Thread GitBox
bvaradar commented on issue #1836: URL: https://github.com/apache/hudi/issues/1836#issuecomment-659046835 @ssomuah : If you look at other concrete implementations of HoodieRecordPayload, there are 2 constructors defined. For example : ``` public

[GitHub] [hudi] bvaradar commented on issue #1830: [SUPPORT] Processing time gradually increases while using Spark Streaming

2020-07-15 Thread GitBox
bvaradar commented on issue #1830: URL: https://github.com/apache/hudi/issues/1830#issuecomment-659044105 From the highlighted section in your spark UI image, it looks like there is an increase during index lookup. Between 2 runs, there is an increase of 10 sec (around 4%). Is this the

[GitHub] [hudi] umehrot2 commented on a change in pull request #1768: [HUDI-1054][Peformance] Several performance fixes during finalizing writes

2020-07-15 Thread GitBox
umehrot2 commented on a change in pull request #1768: URL: https://github.com/apache/hudi/pull/1768#discussion_r455390259 ## File path: hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java ## @@ -386,13 +389,26 @@ public void finalizeWrite(JavaSparkContext jsc,

[GitHub] [hudi] bvaradar commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
bvaradar commented on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-659036363 @zuyanton : This sounds like a general Spark/HMS query integration issue. Are we seeing similar behavior when running the same query over non-hudi table ?

[GitHub] [hudi] umehrot2 commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-15 Thread GitBox
umehrot2 commented on issue #1828: URL: https://github.com/apache/hudi/issues/1828#issuecomment-659028374 @kirkuz yes the AWS Athena support was just released yesterday. Please try out the official support and if you face this issue open a support case with AWS Support, and ping on this

[GitHub] [hudi] umehrot2 commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-15 Thread GitBox
umehrot2 commented on pull request #1722: URL: https://github.com/apache/hudi/pull/1722#issuecomment-659010020 > @vinothchandar I agree we should use @umehrot2 RDD approach. > > > So you can also in parallel just proceed? > > Yes, I will change this PR in parallel. > >

[GitHub] [hudi] umehrot2 commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
umehrot2 commented on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-659005038 I think the finding by @zuyanton seems correct. Increasing the `num-threads` will not help because we just set the `basepath` of the table in the `inputpath` of `jobconf`. I believe

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
asheeshgarg commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-659001436 @bvaradar Balaji I tried the mentioned property but doesn't see the impact still see parquet generated 2020-07-15 20:41:40 478.6 KiB

[GitHub] [hudi] zuyanton commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
zuyanton commented on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-658984768 @vinothchandar it didnt have any effect and I believe it shouldn't, since from what it looks like that parameter only gives improvement if you are trying to list statuses of multiple

[GitHub] [hudi] vinothchandar commented on a change in pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-07-15 Thread GitBox
vinothchandar commented on a change in pull request #1756: URL: https://github.com/apache/hudi/pull/1756#discussion_r455279709 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java ## @@ -182,12 +162,12 @@ private

[GitHub] [hudi] bvaradar removed a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
bvaradar removed a comment on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-658902047 @zuyanton : HoodieParquetInputFormat relies on hadoop-mapreduce FileInputFormat listing implementation to perform listing. There is a knob in base FileInputFormat to tune

[GitHub] [hudi] bvaradar commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
bvaradar commented on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-658902047 @zuyanton : HoodieParquetInputFormat relies on hadoop-mapreduce FileInputFormat listing implementation to perform listing. There is a knob in base FileInputFormat to tune listing

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
bvaradar commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-658896451 @asheeshgarg : This sounds like a tuning problem. Please see https://github.com/apache/hudi/issues/1583#issuecomment-622894674 and

[GitHub] [hudi] zuyanton opened a new issue #1837: [SUPPORT]S3 file listing causing compaction to get eventually slow

2020-07-15 Thread GitBox
zuyanton opened a new issue #1837: URL: https://github.com/apache/hudi/issues/1837 We are running incremental updates to our MoR table on S3. We are running updates every 10 minutes. We compact every 10 commits (every ~1.5 hour). we have noticed that if we want to keep history for longer

[GitHub] [hudi] ssomuah opened a new issue #1836: [SUPPORT] org.apache.hudi.exception.HoodieException: Unable to instantiate payload class

2020-07-15 Thread GitBox
ssomuah opened a new issue #1836: URL: https://github.com/apache/hudi/issues/1836 **Describe the problem you faced** It appears that I'm unable to instantiate my custom "hoodie.datasource.write.payload.class" I get an exception saying Caused by:

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
asheeshgarg commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-658881320 @bvaradar I run with the above understanding where I set the small file size limit to 500 MB to match the 500 datasets but after write I see no change in the behavior it still

[GitHub] [hudi] rajgowtham24 opened a new issue #1835: [SUPPORT] HoodieDeltaStreamer with Json as Source

2020-07-15 Thread GitBox
rajgowtham24 opened a new issue #1835: URL: https://github.com/apache/hudi/issues/1835 Hi all, I'm new to Hudi and looking to leverage Delta Streamer for JSON sources that is available in my s3 bucket. Below is the code snippet that i'm using to execute the same Source

[GitHub] [hudi] garyli1019 commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-15 Thread GitBox
garyli1019 commented on pull request #1722: URL: https://github.com/apache/hudi/pull/1722#issuecomment-658842626 @vinothchandar I agree we should use @umehrot2 RDD approach. >So you can also in parallel just proceed? Yes, I will change this PR in parallel. >Do you just want

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
asheeshgarg commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-658837587 @bvaradar Thanks for quick response Balaji. To understand it correctly let me quickly run with an example The data that is generated for a dataset will be in some range of 1MB

[GitHub] [hudi] vinothchandar edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
vinothchandar edited a comment on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-658821557 @zuyanton this seems like a general issue with `FileInputFormat` ``` int numThreads = job .getInt(

[GitHub] [hudi] vinothchandar commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2020-07-15 Thread GitBox
vinothchandar commented on issue #1829: URL: https://github.com/apache/hudi/issues/1829#issuecomment-658821557 @zuyanton this seems like a general issue with `FileInputFormat` ``` int numThreads = job .getInt(

[GitHub] [hudi] ssomuah commented on issue #143: Tracking ticket for folks to be added to slack group

2020-07-15 Thread GitBox
ssomuah commented on issue #143: URL: https://github.com/apache/hudi/issues/143#issuecomment-658813676 akw...@gmail.com This is an automated message from the Apache Git Service. To respond to the message, please log on to

[jira] [Assigned] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1100: Assignee: Balaji Varadarajan > Docs: Update index type in Configuration section

[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-15 Thread GitBox
kirkuz commented on issue #1828: URL: https://github.com/apache/hudi/issues/1828#issuecomment-658808083 Thanks guys! I'll test it out. This is an automated message from the Apache Git Service. To respond to the message,

[jira] [Updated] (HUDI-1087) Realtime Record Reader needs to handle decimal types

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1087: - Status: In Progress (was: Open) > Realtime Record Reader needs to handle decimal types >

[jira] [Resolved] (HUDI-1087) Realtime Record Reader needs to handle decimal types

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan resolved HUDI-1087. -- Resolution: Fixed > Realtime Record Reader needs to handle decimal types >

[GitHub] [hudi] bvaradar commented on issue #1790: [SUPPORT] Querying MoR tables with DecimalType columns via Spark SQL fails

2020-07-15 Thread GitBox
bvaradar commented on issue #1790: URL: https://github.com/apache/hudi/issues/1790#issuecomment-658804172 @zuyanton : Thanks for reporting this issue. We have merged @zhedoubushishi changes to master. This is an automated

[GitHub] [hudi] bvaradar closed issue #1790: [SUPPORT] Querying MoR tables with DecimalType columns via Spark SQL fails

2020-07-15 Thread GitBox
bvaradar closed issue #1790: URL: https://github.com/apache/hudi/issues/1790 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [hudi] bvaradar merged pull request #1831: [HUDI-1087] Handle decimal type for realtime record reader with SparkSQL

2020-07-15 Thread GitBox
bvaradar merged pull request #1831: URL: https://github.com/apache/hudi/pull/1831 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[hudi] branch master updated: [HUDI-1087] Handle decimal type for realtime record reader with SparkSQL (#1831)

2020-07-15 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new bf1d36f [HUDI-1087] Handle decimal type for

[GitHub] [hudi] bvaradar commented on issue #1788: [SUPPORT] typo on website https://hudi.apache.org/docs/deployment.html#deploying

2020-07-15 Thread GitBox
bvaradar commented on issue #1788: URL: https://github.com/apache/hudi/issues/1788#issuecomment-658801170 Added jira : https://issues.apache.org/jira/browse/HUDI-1100 @tooptoop4 : if you are interested, please assign yourself.

[GitHub] [hudi] bvaradar closed issue #1788: [SUPPORT] typo on website https://hudi.apache.org/docs/deployment.html#deploying

2020-07-15 Thread GitBox
bvaradar closed issue #1788: URL: https://github.com/apache/hudi/issues/1788 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[jira] [Updated] (HUDI-1100) Docs: Update index type in Configuration section and fix typo in deployment page

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1100: - Summary: Docs: Update index type in Configuration section and fix typo in deployment page

[jira] [Assigned] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1100: Assignee: (was: Balaji Varadarajan) > Docs: Add GLOBAL_BLOOM as one of index

[jira] [Assigned] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1100: Assignee: Balaji Varadarajan > Docs: Add GLOBAL_BLOOM as one of index type in

[jira] [Updated] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1100: - Description: 1. Add GLOBAL_BLOOM as one of index type in Configuration section :   

[jira] [Commented] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158216#comment-17158216 ] Balaji Varadarajan commented on HUDI-1100: -- Another documentation update : 

[jira] [Updated] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1100: - Description: 1. Add GLOBAL_BLOOM as one of index type in Configuration section 2. Fix

[GitHub] [hudi] bvaradar commented on issue #1785: [SUPPORT] please rollback greater commits first

2020-07-15 Thread GitBox
bvaradar commented on issue #1785: URL: https://github.com/apache/hudi/issues/1785#issuecomment-658797447 > @bhasudha what if i don't clean them up? ie if i just leave them there then what is the issue? @tooptoop4 : You should be cleaning these files before we archive to prevent

[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-15 Thread GitBox
bvaradar commented on issue #1828: URL: https://github.com/apache/hudi/issues/1828#issuecomment-658794018 Yes, @kirkuz . ccing @umehrot2 who can also chime in This is an automated message from the Apache Git Service. To

[GitHub] [hudi] bvaradar commented on issue #1782: Merge-On-Read performance degrades for single partition table [SUPPORT]

2020-07-15 Thread GitBox
bvaradar commented on issue #1782: URL: https://github.com/apache/hudi/issues/1782#issuecomment-658791725 @sam-wmt : Pinging to see if you can provide debugging information. This is an automated message from the Apache Git

[GitHub] [hudi] kirkuz commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-15 Thread GitBox
kirkuz commented on issue #1828: URL: https://github.com/apache/hudi/issues/1828#issuecomment-658790661 Hi @bvaradar thanks for that. Does it mean that it was released on AWS yesterday? Should I use the latest EMR cluster release?

[GitHub] [hudi] bvaradar commented on issue #1777: [SUPPORT] org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were

2020-07-15 Thread GitBox
bvaradar commented on issue #1777: URL: https://github.com/apache/hudi/issues/1777#issuecomment-658791168 Thanks @WaterKnight1998 : Looks like this is resolved. Please open a new ticket if there are any other issues you are seeing.

[GitHub] [hudi] bvaradar closed issue #1777: [SUPPORT] org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were

2020-07-15 Thread GitBox
bvaradar closed issue #1777: URL: https://github.com/apache/hudi/issues/1777 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [hudi] RajasekarSribalan commented on issue #1766: [SUPPORT] Hudi COW - Bulk Insert followed by Upsert via Spark streaming job

2020-07-15 Thread GitBox
RajasekarSribalan commented on issue #1766: URL: https://github.com/apache/hudi/issues/1766#issuecomment-658789885 Hello Balaji, I still do get incorrect entries when queried via hive. In the earlier comment,I have mentioned the hive cdh version that we are using. I m really

[GitHub] [hudi] bvaradar commented on issue #1775: INCREMETNAL QUERY-Null value Exception

2020-07-15 Thread GitBox
bvaradar commented on issue #1775: URL: https://github.com/apache/hudi/issues/1775#issuecomment-658788929 Closing this for now. @prashanthpdesai : Please reopen if this is still an issue. This is an automated message from

[GitHub] [hudi] bvaradar closed issue #1775: INCREMETNAL QUERY-Null value Exception

2020-07-15 Thread GitBox
bvaradar closed issue #1775: URL: https://github.com/apache/hudi/issues/1775 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [hudi] bvaradar closed issue #1773: [SUPPORT] NPE about MOR config?

2020-07-15 Thread GitBox
bvaradar closed issue #1773: URL: https://github.com/apache/hudi/issues/1773 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [hudi] bvaradar commented on issue #1773: [SUPPORT] NPE about MOR config?

2020-07-15 Thread GitBox
bvaradar commented on issue #1773: URL: https://github.com/apache/hudi/issues/1773#issuecomment-658788259 Thanks @tooptoop4 : I am closing this ticket due to inactivity. Kindly reopen if this is a real issue with more details.

[GitHub] [hudi] bvaradar commented on issue #1771: [SUPPORT] https://hudi.apache.org/docs/configurations.html does not mention GLOBAL_BLOOM in hoodie.index.type section

2020-07-15 Thread GitBox
bvaradar commented on issue #1771: URL: https://github.com/apache/hudi/issues/1771#issuecomment-658787103 @tooptoop4 : Would you be interested in fixing the documentation (Jira : https://issues.apache.org/jira/browse/HUDI-1100) ? (FYI : @nsivabalan)

[jira] [Created] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1100: Summary: Docs: Add GLOBAL_BLOOM as one of index type in Configuration section Key: HUDI-1100 URL: https://issues.apache.org/jira/browse/HUDI-1100 Project:

[jira] [Updated] (HUDI-1100) Docs: Add GLOBAL_BLOOM as one of index type in Configuration section

2020-07-15 Thread Balaji Varadarajan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1100: - Status: Open (was: New) > Docs: Add GLOBAL_BLOOM as one of index type in Configuration

[GitHub] [hudi] bvaradar commented on issue #1766: [SUPPORT] Hudi COW - Bulk Insert followed by Upsert via Spark streaming job

2020-07-15 Thread GitBox
bvaradar commented on issue #1766: URL: https://github.com/apache/hudi/issues/1766#issuecomment-658782603 @RajasekarSribalan : Sorry for the delayed response. Is this still a problem ? I am not seeing any issue with the table definition and as you can see from the docker demo, hive

[GitHub] [hudi] bvaradar edited a comment on issue #1764: [SUPPORT] Commits stays INFLIGHT forever after S3 consistency check fails when Hudi tries to delete duplicate datafiles

2020-07-15 Thread GitBox
bvaradar edited a comment on issue #1764: URL: https://github.com/apache/hudi/issues/1764#issuecomment-658778968 We discussed this in yesterday's community weekly sync. We have opened a blocker in 0.6 (Please see https://issues.apache.org/jira/browse/HUDI-1098) to provide ways to control

[GitHub] [hudi] bvaradar commented on issue #1764: [SUPPORT] Commits stays INFLIGHT forever after S3 consistency check fails when Hudi tries to delete duplicate datafiles

2020-07-15 Thread GitBox
bvaradar commented on issue #1764: URL: https://github.com/apache/hudi/issues/1764#issuecomment-658778968 We discussed this in yesterday's community weekly sync. We have opened a blocker in 0.6 (Please see https://issues.apache.org/jira/browse/HUDI-1098) to provide ways to control this

[GitHub] [hudi] bvaradar closed issue #1764: [SUPPORT] Commits stays INFLIGHT forever after S3 consistency check fails when Hudi tries to delete duplicate datafiles

2020-07-15 Thread GitBox
bvaradar closed issue #1764: URL: https://github.com/apache/hudi/issues/1764 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[jira] [Updated] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-07-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1013: - Labels: pull-request-available (was: ) > Bulk Insert w/o converting to RDD >

[GitHub] [hudi] nsivabalan opened a new pull request #1834: [WIP][HUDI-1013] Adding Bulk Insert V2 implementation

2020-07-15 Thread GitBox
nsivabalan opened a new pull request #1834: URL: https://github.com/apache/hudi/pull/1834 ## What is the purpose of the pull request Adding support for "bulk_insert_dataset" which has better performance compared to existing "bulk_insert". ## Brief change log - Added

[GitHub] [hudi] bvaradar commented on issue #1828: [SUPPORT] Cannot force hudi to retain only last commit

2020-07-15 Thread GitBox
bvaradar commented on issue #1828: URL: https://github.com/apache/hudi/issues/1828#issuecomment-658764479 @kirkuz : AWS Athena support for Hudi is just out :

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
bvaradar commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-658760384 @asheeshgarg Meanwhile, you can setup the configs as suggested in https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles

[GitHub] [hudi] vinothchandar commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-07-15 Thread GitBox
vinothchandar commented on pull request #1722: URL: https://github.com/apache/hudi/pull/1722#issuecomment-658757520 @garyli1019 actually @umehrot2 's approach there is wrapping the FileFormat as opposed to extending, which is great as well.. If you can wait, we can land the bootstrap

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-15 Thread GitBox
bvaradar commented on issue #1825: URL: https://github.com/apache/hudi/issues/1825#issuecomment-658757612 @asheeshgarg : Clustering is planned for 0.7 release. We are currently working on getting 0.6 release out at the end of this month.

[GitHub] [hudi] vinothchandar commented on a change in pull request #1702: Bootstrap datasource changes

2020-07-15 Thread GitBox
vinothchandar commented on a change in pull request #1702: URL: https://github.com/apache/hudi/pull/1702#discussion_r444571139 ## File path: hudi-client/pom.xml ## @@ -101,6 +101,11 @@ org.apache.spark spark-sql_${scala.binary.version} + +

[GitHub] [hudi] lw309637554 commented on pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
lw309637554 commented on pull request #1810: URL: https://github.com/apache/hudi/pull/1810#issuecomment-658740024 @vinothchandar @garyli1019 I think in this PR the hudi-sync abstract is ready. Expect your review. Thanks 1. the module abstract is hudi-sync

[GitHub] [hudi] lw309637554 commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
lw309637554 commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r455011023 ## File path: hudi-sync/hudi-dla-sync/src/main/java/org/apache/hudi/dla/DLASyncTool.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software

[GitHub] [hudi] tooptoop4 opened a new issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

2020-07-15 Thread GitBox
tooptoop4 opened a new issue #1833: URL: https://github.com/apache/hudi/issues/1833 I have a single 700MB file containing 10mn rows (all unique keys, key is single column, single partition for all rows). 1. Create brand new table 2. Using spark datasource on the 700MB to write in

[GitHub] [hudi] yanghua commented on a change in pull request #1774: [HUDI-703]Add unit test for HoodieSyncCommand

2020-07-15 Thread GitBox
yanghua commented on a change in pull request #1774: URL: https://github.com/apache/hudi/pull/1774#discussion_r454987028 ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/HoodieTestHiveBase.java ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software

[GitHub] [hudi] yanghua commented on a change in pull request #1770: [HUDI-708]Add temps show and unit test for TempViewCommand

2020-07-15 Thread GitBox
yanghua commented on a change in pull request #1770: URL: https://github.com/apache/hudi/pull/1770#discussion_r454722777 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/HoodieCLI.java ## @@ -115,4 +115,16 @@ public static synchronized TempViewProvider

[GitHub] [hudi] lw309637554 commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
lw309637554 commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r454974889 ## File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -255,6 +262,43 @@ private[hudi] object HoodieSparkSqlWriter {

[GitHub] [hudi] leesf commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
leesf commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r454973961 ## File path: hudi-sync/hudi-dla-sync/src/main/java/org/apache/hudi/dla/DLASyncTool.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software

[GitHub] [hudi] leesf commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
leesf commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r454969135 ## File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -255,6 +262,43 @@ private[hudi] object HoodieSparkSqlWriter {

[GitHub] [hudi] tooptoop4 edited a comment on issue #506: explodeRecordRDDWithFileComparisons is costly with HoodieBloomIndex/range pruning=on

2020-07-15 Thread GitBox
tooptoop4 edited a comment on issue #506: URL: https://github.com/apache/hudi/issues/506#issuecomment-658220450 I seem to have similar issue, running upsert of 700MB csv (twice, ie repeat the same csv upsert next day) with 16gb executor memory, 5 executors and shuffle parallelism of 16

[GitHub] [hudi] mabin001 closed pull request #1832: [HUDI-1099]: improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread GitBox
mabin001 closed pull request #1832: URL: https://github.com/apache/hudi/pull/1832 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [hudi] mabin001 commented on pull request #1832: [HUDI-1099]: improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread GitBox
mabin001 commented on pull request #1832: URL: https://github.com/apache/hudi/pull/1832#issuecomment-658655885 retest it please. This is an automated message from the Apache Git Service. To respond to the message, please log

[jira] [Updated] (HUDI-1099) improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1099: - Labels: pull-request-available (was: ) > improve quality of the code calling the

[GitHub] [hudi] mabin001 closed pull request #1832: [HUDI-1099]: improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread GitBox
mabin001 closed pull request #1832: URL: https://github.com/apache/hudi/pull/1832 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[jira] [Updated] (HUDI-1099) improve quality of the code calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread linshan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan updated HUDI-1099: -- Summary: improve quality of the code calling the method.HiveSyncTool#syncPartitions (was: improve the

[jira] [Updated] (HUDI-1079) Cannot upsert on schema with Array of Record with single field

2020-07-15 Thread Adrian Tanase (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated HUDI-1079: Description: I am trying to trigger upserts on a table that has an array field with records of

[jira] [Updated] (HUDI-1099) improve the code。HiveSyncTool#syncPartitions

2020-07-15 Thread linshan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan updated HUDI-1099: -- Description: I will work on this ticket. Please assign to me。 when list is parameter,We should determine if

[jira] [Updated] (HUDI-1099) improve the code。HiveSyncTool#syncPartitions

2020-07-15 Thread linshan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan updated HUDI-1099: -- Attachment: image-2020-07-15-15-48-08-674.png > improve the code。HiveSyncTool#syncPartitions >

[jira] [Updated] (HUDI-1099) improve the code。HiveSyncTool#syncPartitions

2020-07-15 Thread linshan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan updated HUDI-1099: -- Attachment: image-2020-07-15-15-48-55-821.png > improve the code。HiveSyncTool#syncPartitions >

[jira] [Updated] (HUDI-1099) improve the code。HiveSyncTool#syncPartitions

2020-07-15 Thread linshan (Jira)
[ https://issues.apache.org/jira/browse/HUDI-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan updated HUDI-1099: -- Attachment: image-2020-07-15-15-47-14-471.png > improve the code。HiveSyncTool#syncPartitions >

[GitHub] [hudi] mabin001 opened a new pull request #1832: improve the code。if the list is empty before calling the method.HiveSyncTool#syncPartitions

2020-07-15 Thread GitBox
mabin001 opened a new pull request #1832: URL: https://github.com/apache/hudi/pull/1832 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the

[GitHub] [hudi] lw309637554 commented on a change in pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-15 Thread GitBox
lw309637554 commented on a change in pull request #1810: URL: https://github.com/apache/hudi/pull/1810#discussion_r454853182 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java ## @@ -237,6 +237,9 @@ public Operation

  1   2   >