[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034536#comment-17034536 ] ASF subversion and git services commented on IMPALA-8778: - Commit ea0e1def6160d596082b01365fcbbb6e24afb21d in impala's branch refs/heads/master from Yanjia Li [ https://gitbox.apache.org/repos/asf?p=impala.git;h=ea0e1de ] IMPALA-8778: Support Apache Hudi Read Optimized Table Hudi Read Optimized Table contains multiple versions of parquet files, in order to load the table correctly, Impala needs to recognize Hudi Read Optimized Table as a HdfsTable and load the latest version of the file using HoodieROTablePathFilter. Tests - Unit test for Hudi in FileMetadataLoader - Create table tests in functional_schema_template.sql - Query tests in hudi-parquet.test Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Reviewed-on: http://gerrit.cloudera.org:8080/14711 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017842#comment-17017842 ] Vinoth Chandar commented on IMPALA-8778: Great to see this making progress!! > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015657#comment-17015657 ] Yanjia Gary Li commented on IMPALA-8778: Hello, this PR is ready to review! > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006969#comment-17006969 ] Zoltán Borók-Nagy commented on IMPALA-8778: --- [~garyli1019] you might want to take a look at testdata/bin/create-load-data.sh, probably you'll need a function similar to 'load-custom-schemas()'. This will upload your data files to the test-warehouse directory. You also need to create the tables in the Hive Metastore. You probably want to do that as part of the data loading, in that case you'll need to invoke those CREATE TABLE statements from create-load-data.sh. Alternatively you can also create the tables during test execution. I looked at the output of the Jenkins job. It failed during the RAT check. It means the files omit copyright information. You either want to add copyright statements to your files, or more likely you want to include them in the RAT exclude list: [https://github.com/apache/impala/blob/master/bin/rat_exclude_files.txt] > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000559#comment-17000559 ] Yanjia Gary Li commented on IMPALA-8778: [~boroknagyz] just included the unit test in the PR. I might have some issues on my VM when creating all the mini-clusters and loading test data. So I manually copy the folder /testdata/data/hudicow to HDFS /test-warehouse/hudicow. Not sure if this is the right path when running the automated script. Is there a script that only handling copying files into test-warehouse? Not all the mini-clusters working but my HDFS does. Looks like Jenkins doesn't like those testdata. Should I add it in a different way? > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999469#comment-16999469 ] Yanjia Gary Li commented on IMPALA-8778: [~boroknagyz] that's very helpful, thanks! > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993598#comment-16993598 ] Zoltán Borók-Nagy commented on IMPALA-8778: --- Hi Yanija, Sorry for the late answer, I wasn't watching this Jira. We usually have these kind of things under the testdata/ directory. E.g. under testdata/data there is a bunch of files written in different file formats. During data load or tests we copy these files to HDFS under /test-warehouse/ so the tests can see them. If you need more complex things than copying files, e.g. if you need to write some utility program in java, then you probably want to create your java application under testdata/. An example for that is testdata/TableFlattener. I hope this answer helped. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979693#comment-16979693 ] Yanjia Gary Li commented on IMPALA-8778: Thanks for all the feedback! It will definitely be very interesting to add real-time support in the future. I will focus on setting up the testing environment for now. My idea about the testing environment will be adding an independent folder to the HDFS test-warehouse in the preparing test data stage and then I can either test FileMetadataLoader or sending a complete impala query. The writing data part I can use the test-jar provided by hudi, in this way we can create a real-time data source later, but I have to create a new java module to write the test data into HDFS mini-cluster. So where is the proper location to put this module? Or is there any recommendation that could be a better way? I will be on vacation for the next few weeks so apologies for the delay. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977099#comment-16977099 ] Vinoth Chandar commented on IMPALA-8778: [~garyli1019] there is a merge logic involved that is pretty custom to hudi. So we probably want to still work with the hudi input formats per se. we can tackle this down the line, once we get this really working :) > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977018#comment-16977018 ] Tim Armstrong commented on IMPALA-8778: --- Yeah, a HDFS table can have a mix of input formats. The HdfsScanNodeOperator handles multiple file formats just fine. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976968#comment-16976968 ] Yanjia Gary Li commented on IMPALA-8778: [~vinoth] Make sense to me. I think the Real Time table could also be possible to add without changing anything from the backend if frontend could combine Avro + Parquet into the hdfsTable(based on the code I read but not 100% sure). > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976786#comment-16976786 ] Vinoth Chandar commented on IMPALA-8778: Took a pass at the patch. One suggestion is : may be have the format as `HUDI_PARQUET` so its clearer? We could eventually do RT tables and even ORC when we have it. ? Otherwise, the way you used pathFilters looks good to me. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975336#comment-16975336 ] Yanjia Gary Li commented on IMPALA-8778: Done. Same URL: [https://gerrit.cloudera.org/#/c/14711/] > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975266#comment-16975266 ] Tim Armstrong commented on IMPALA-8778: --- [~garyli1019] drafts in gerrit are only visible to the reviewers list. Can you publish it. Include "WIP" in the commit message so we know it's a work in progress. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974711#comment-16974711 ] Yanjia Gary Li commented on IMPALA-8778: Hi guys, I made a draft [https://gerrit.cloudera.org/#/c/14711/]. Would you guys take a look to see if my approach makes sense? I will implement the test after we agree on this approach. So from my understanding, HdfsTable is handling the partition itself so we could not directly use the HoodieParquetInputFormat class, but we can use HoodieROTablePathFilter to filter the fileStatus when impala is loading every partition. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969633#comment-16969633 ] Yanjia Gary Li commented on IMPALA-8778: Thanks Tim and Vinoth. I will follow the first path then. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968698#comment-16968698 ] Vinoth Chandar commented on IMPALA-8778: > I think you need logic in Impala that understands slices and only uses the > latest slice when querying a partition. +1. in Hive/Spark/Presto, we make the query call HoodieInputFormat to do this > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968581#comment-16968581 ] Tim Armstrong commented on IMPALA-8778: --- I don't see how you could implement reading from a Hudi table without changing Impala (or Hive for that matter). With the original Hive table layout, the contents of a partition are determined by listing a directory, and it looks like if you list the directory of a Hudi partition, you will get back duplicated data from multiple slices. I.e. I think you need logic in Impala that understands slices and only uses the latest slice when querying a partition. The only way to add or remove an individual file to a classic Hive table (Impala/Hive tables are the same thing) is to add or remove it from the partition directory. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yanjia Gary Li >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968038#comment-16968038 ] Yanjia Gary Li commented on IMPALA-8778: Hello [~tarmstrong] , I'd like to resume the discussion on this topic. Yuanbin finished his internship a few months ago so please assign this ticket to me. After reading some code on both impala and hudi sides, the following are the approaches I could think about. * As discussed above, to create a new class similar to hdfsTable with Hudi dependency to filter path. * Implement everything on the Hudi side and send a sequence of queries to the impala server to ALTER the table. The hive sync tool on the Hudi repo is using this method. I think this approach could be easier than the one above because we could follow a similar strategy as the hive sync tool and we don't need to wait until the next release to use this feature. To make sure this method is possible, I'd like to know what query could handle this situation: * first stage: in HDFS partition year=2019/month=10/day=1, we have file1_v1.parquet, file2_v1.parquet * second stage: we ran a Hudi job to update the partition year=2019/month=10/day=1, we have file1_v1.parquet, file1_v2.parquet, file2_v1.parquet If we want to *drop* file1_v1.parquet and *load* file1_v2.parquet to the table, what query should I run? What will happen if another user submits a query when the metadata is updating? Thanks > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901249#comment-16901249 ] Vinoth Chandar commented on IMPALA-8778: >Another small question about how to determine that the Hoodie specific path, >it seems that I can use HoodiePartitionMetadata to check whether it is a valid >dataset if invalid or dataset not found, I can treat it as a no hoodie path, >am I correct? The HoodieTableMetaClient already does those things for you. We can follow up on the HUDI ticket more (to keep this about Impala/Hudi integration alone). Also, I'd suggest that we land this once we have renamed packaged on Hudi to org.apache.hudi and made the first release.. Rough ETA, end of month. So you can keep working on the patch as is, test and finally we can just pick up the new artifact. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900393#comment-16900393 ] Tim Armstrong commented on IMPALA-8778: --- [~Yuanbin] I don't think we can avoid adding the Hudi dependency, that seems OK to add. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900334#comment-16900334 ] Yuanbin Cheng commented on IMPALA-8778: --- [~tarmstrong] As the discussion before, I am trying to make Hudi dataset as the kind of the Hive Table in the Impala, and currently, in order to get the latest version of the files in Hudi partition, it seems that I need to use the Hudi classes directly, which means that Impala needs to take Hudi dependency. I want to ask can I add the Hudi dependency in the Impala? Or if there is some other way that I can call the Hudi classes in the Impala? > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900331#comment-16900331 ] Yuanbin Cheng commented on IMPALA-8778: --- [~vinoth] 1. I have read the HoodieInputFormat, I see, I can use the HUDI class and use the timeline to get and filter the latest version of the partition of the HUDI dataset. We need to ask Tim whether we can add the HUDI dependency in the Impala project. 2. Correct, the table has to have only one version, multiple file versions will have the wrong result. I am thinking to add the support in Impala that makes the Impala can recognize the Hudi specific path and then get the latest version of the files. 3. Another small question about how to determine that the Hoodie specific path, it seems that I can use HoodiePartitionMetadata to check whether it is a valid dataset if invalid or dataset not found, I can treat it as a no hoodie path, am I correct? > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899514#comment-16899514 ] Vinoth Chandar commented on IMPALA-8778: >>Do you have any idea about how to load the latest version of the Hudi dataset >>without using the InputFormat as Hive, or any related code about the Hive >>metadata in Hudi may help a lot? Few options to do this, using the Hudi classes directly.. but that would mean Impala will now take a Hudi dependency. is that okay? In short, if you have a `List` then you can use either [HoodieROTablePathFilter|https://github.com/apache/incubator-hudi/blob/479908fd20a97c5f7007f06ba7ee3904967e1050/hoodie-spark/src/main/scala/com/uber/hoodie/DefaultSource.scala#L66] (like the Spark datasource) or instantiate the Timeline/FileSystemView classes (like the [HoodieInputFormat|https://github.com/apache/incubator-hudi/blob/129e4336413fd2290e137804cf16c515c502c2f7/hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java#L89] does) >>Current I just add the `HoodieInputFormat` as a VALID_INPUT_FORMAT which will >>make the Impala read the Hudi as the regular Parquet table. But the table will have to be purely inserts, right? with upserts (and multiple file versions), you will have incorrect results? > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898457#comment-16898457 ] Yuanbin Cheng commented on IMPALA-8778: --- [~vinoth] I have read the code in the Apache Impala that related to the HdfsTable. For now, because Hudi partitioning is compatible with Hive partitioning. So currently, my thought is changing the partition loading part of the coed in Apache Impala. It is the loadFileMetadataForPartitions method in the HdfsTable class. This method group the path of partitions and for every path create a `FileMetadataLoader` and then parallel call the load method. Here is the load method in the FileMetadataLoader [https://github.com/apache/impala/blob/9ee4a5e1940afa47227a92e0f6fba6d4c9909f63/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L129] Since the Impala didn't use the InputFormat classes as Hive, I think I need to modify this load partition method to teach the Impala how to load the Hudi table. Do you have any idea about how to load the latest version of the Hudi dataset without using the InputFormat as Hive, or any related code about the Hive metadata in Hudi may help a lot? Another thing is that I have created a draft change in Impala's Gerrit. [https://gerrit.cloudera.org/#/c/13948/] Current I just add the `HoodieInputFormat` as a VALID_INPUT_FORMAT which will make the Impala read the Hudi as the regular Parquet table. I am struggling to add some tests in the Impala to verify that this change can actually make the Impala successfully read the Hudi data, it seems that I need to add Hudi dependencies in the test set and set some data for testing. > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897458#comment-16897458 ] Vinoth Chandar commented on IMPALA-8778: [~Yuanbin] any early thoughts on reading the Impala code? > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892286#comment-16892286 ] Tim Armstrong commented on IMPALA-8778: --- I am not all that familiar with this code myself, but I know enough to get you started. Impala doesn't use the same InputFormat classes as hive. Rather, it recognises the Java class names and handles it on its own. E.g. Impala is aware of the "MapredParquetInputFormat" class and refers to it internally as the PARQUET file format - see https://github.com/apache/impala/blob/94652d74521e95e8606ea2d22aabcaddde6fc471/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java#L62 I think the first steps would be to add Hudi to the list of known file formats in HdfsFileFormat.java. Then you could teach the Impala to load the table by making isHdfsInputFormatClass() return true here https://github.com/apache/impala/blob/fc974f944a9266e68e6f1694eecdc2160fd52582/fe/src/main/java/org/apache/impala/catalog/Table.java#L327 Then you would need to teach Impala how to load the files and partitions for HdfsTable. If the partitioning is compatible, then maybe we just need to get the file metadata loading working. The file metadata is loaded here: https://github.com/apache/impala/blob/fc974f944a9266e68e6f1694eecdc2160fd52582/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L554 > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8778) Support read/write Apache Hudi tables
[ https://issues.apache.org/jira/browse/IMPALA-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892276#comment-16892276 ] Yuanbin Cheng commented on IMPALA-8778: --- [~tarmstr...@cloudera.com] Hi Tim, I created this ticket for the task that adds the Hudi support in the Impala. >From the implementation patch of the support Presto by Apache Hudi, I found >that they used the following way to add the Hudi support: "In presto, there was a point in the code that lists the DFS folders with the `inputFormat` object (HoodieInputFormat for Hudi tables) actually constructed already. All we did was check if the `inputFormat` object was an instance of HoodieInputFormat and call inputFormat.getSplits() to obtain the latest Hudi file slices for the presto query." And I got some suggestion about this task. "The `HoodieInputFormat` is annotated with a special annotation. All we need to do in Impala is find the place where it lists the file system for files and check for this condition and filter for the latest file versions by calling `HoodieInputFormat.getSplits()`. This will unblock your use-case and let you query RO view on Impala." Can I ask is there any point in the Impala that "lists the DFS folders with the `inputFormat` object"? It would be so helpful if you can help me determine if or not I can use the same method to do this task. Currently, I am searching for the code in Impala and try to get this point, however, I am not familiar with the Impala source code, it really takes me so much effort. Thanks so much! > Support read/write Apache Hudi tables > - > > Key: IMPALA-8778 > URL: https://issues.apache.org/jira/browse/IMPALA-8778 > Project: IMPALA > Issue Type: New Feature >Reporter: Yuanbin Cheng >Assignee: Yuanbin Cheng >Priority: Major > > Apache Impala currently not support Apache Hudi, cannot even pull metadata > from Hive. > Related issue: > [https://github.com/apache/incubator-hudi/issues/179] > [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146|https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146?filter=allopenissues] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org