Re: Support Apache Hudi

2019-07-19 Thread Tim Armstrong
I added you to the contributor role on JIRA.

On Fri, Jul 19, 2019 at 3:39 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
[email protected]> wrote:

> Hi Tim,
>
> Thanks so much for the information.
> My Jira user name is Yuanbin.
>
> Looking forward to doing some contribution.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
>
> -Original Message-
> From: Tim Armstrong 
> Sent: Friday, July 19, 2019 3:23 PM
> To: dev@impala 
> Subject: Re: Support Apache Hudi
>
> Please feel free to create a JIRA. we can add you as a contributor on
> Apache JIRA if you give us your username then you can assign it to yourself.
>
> You should be able to use our jenkins instance to run tests on a draft
> gerrit patch:
>
> https://cwiki.apache.org/confluence/display/IMPALA/Using+Gerrit+to+submit+and+review+patches#UsingGerrittosubmitandreviewpatches-Verifyingapatch(opentoallImpalacontributors)
> .
>
>
> Unfortunately we don't have a way to accelerate the initial local build.
> We have a few tips for making incremental builds significantly faster here:
>
> https://cloudera.atlassian.net/wiki/spaces/ENG/pages/100832437/Tips+for+Faster+Impala+Builds
> . It is a lot quicker to iterate on code changes if you follow some of the
> tips there, e.g. use ccache and only rebuild the components of impala that
> you modified.
>
> - Tim
>
> On Fri, Jul 19, 2019 at 2:04 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> [email protected]> wrote:
>
> > Hi Tim,
> >
> > The guys from Hudi said that the Hudi partitioning is compatible with
> > Hive partitioning.
> > I think I get some idea from the implementation of the Hive ACID
> > support tickets. And I am trying to implement the Hudi support now.
> >
> > Could I create a Jira ticket for this task and use your Jenkins server
> > for build? It takes me soo much time waiting the build process.
> >
> > Thanks so much!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> > -Original Message-
> > From: Tim Armstrong 
> > Sent: Tuesday, July 16, 2019 3:24 PM
> > To: dev@impala 
> > Subject: Re: Support Apache Hudi
> >
> > Sorry I meant to refer to
> > ./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.jav
> > a; FeHdfsTable is an interface shared by those two classes.
> >
> > There's a default catalog implementation that is based on all Impala
> > daemons holding a cached snapshot of metadata, and a re-implementation
> > where impala daemons fetch metadata on demand from a catalog service.
> > The design doc for the reimplementation is here, although i suspect
> > some details have changed:
> >
> > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97f
> > C_PvqVGCk/edit
> >
> > It may be helpful to look at some recent commits that added Hive ACID
> > support just to get an idea of how that was implemented:
> > https://gerrit.cloudera.org/#/q/acid
> >
> > I guess one detail that may not work so well with HdfsTable is the
> > partitioning - it's unclear to me how compatible the Hudi partitioning
> > is with Hive's partitioning scheme.
> >
> > - Tim
> >
> >
> >
> > On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
> > < [email protected]> wrote:
> >
> > > Hi Tim,
> > >
> > > Thanks so much for the suggestion.
> > > I also think that implement Hudi Table as a variant of HdfsTable
> > > should be a cleaner way.
> > > I will focus on understand the hdfsTable now, it is really a big file.
> > >
> > > Currently, our team only use the Copy-on-Write mode now, so I will
> > > try to implement the Copy-on-Write first.
> > >
> > > Can you explain more about the two catalog implementations?
> > > My understand is that one is more the metadata of the table and one
> > > is for the frontend interface of the table, however, for the
> > > HdfsTable, I only found HdfsTable, no FeHdfsTable.
> > >
> > > Thanks so much!
> > >
> > > Best regards
> > >
> > > Yuanbin Cheng
> > > CR/PJ-AI-S1
> > >
> > >
> > >
> > >
> > > -Original Message-
> > > From: Tim Armstrong 
> > > Sent: Tuesday, July 16, 2019 12:28 PM
> > > To: dev@impala 
> > > Subject: Re: Support Apache Hudi
> > >
> > > Hi Cheng,
> > > 

RE: Support Apache Hudi

2019-07-19 Thread FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
Hi Tim,

Thanks so much for the information.
My Jira user name is Yuanbin.

Looking forward to doing some contribution.

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  




-Original Message-
From: Tim Armstrong  
Sent: Friday, July 19, 2019 3:23 PM
To: dev@impala 
Subject: Re: Support Apache Hudi

Please feel free to create a JIRA. we can add you as a contributor on Apache 
JIRA if you give us your username then you can assign it to yourself.

You should be able to use our jenkins instance to run tests on a draft gerrit 
patch:
https://cwiki.apache.org/confluence/display/IMPALA/Using+Gerrit+to+submit+and+review+patches#UsingGerrittosubmitandreviewpatches-Verifyingapatch(opentoallImpalacontributors).


Unfortunately we don't have a way to accelerate the initial local build. We 
have a few tips for making incremental builds significantly faster here:
https://cloudera.atlassian.net/wiki/spaces/ENG/pages/100832437/Tips+for+Faster+Impala+Builds
. It is a lot quicker to iterate on code changes if you follow some of the tips 
there, e.g. use ccache and only rebuild the components of impala that you 
modified.

- Tim

On Fri, Jul 19, 2019 at 2:04 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < 
[email protected]> wrote:

> Hi Tim,
>
> The guys from Hudi said that the Hudi partitioning is compatible with 
> Hive partitioning.
> I think I get some idea from the implementation of the Hive ACID 
> support tickets. And I am trying to implement the Hudi support now.
>
> Could I create a Jira ticket for this task and use your Jenkins server 
> for build? It takes me soo much time waiting the build process.
>
> Thanks so much!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -Original Message-
> From: Tim Armstrong 
> Sent: Tuesday, July 16, 2019 3:24 PM
> To: dev@impala 
> Subject: Re: Support Apache Hudi
>
> Sorry I meant to refer to
> ./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.jav
> a; FeHdfsTable is an interface shared by those two classes.
>
> There's a default catalog implementation that is based on all Impala 
> daemons holding a cached snapshot of metadata, and a re-implementation 
> where impala daemons fetch metadata on demand from a catalog service. 
> The design doc for the reimplementation is here, although i suspect 
> some details have changed:
>
> https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97f
> C_PvqVGCk/edit
>
> It may be helpful to look at some recent commits that added Hive ACID 
> support just to get an idea of how that was implemented:
> https://gerrit.cloudera.org/#/q/acid
>
> I guess one detail that may not work so well with HdfsTable is the 
> partitioning - it's unclear to me how compatible the Hudi partitioning 
> is with Hive's partitioning scheme.
>
> - Tim
>
>
>
> On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) 
> < [email protected]> wrote:
>
> > Hi Tim,
> >
> > Thanks so much for the suggestion.
> > I also think that implement Hudi Table as a variant of HdfsTable 
> > should be a cleaner way.
> > I will focus on understand the hdfsTable now, it is really a big file.
> >
> > Currently, our team only use the Copy-on-Write mode now, so I will 
> > try to implement the Copy-on-Write first.
> >
> > Can you explain more about the two catalog implementations?
> > My understand is that one is more the metadata of the table and one 
> > is for the frontend interface of the table, however, for the 
> > HdfsTable, I only found HdfsTable, no FeHdfsTable.
> >
> > Thanks so much!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> >
> > -Original Message-
> > From: Tim Armstrong 
> > Sent: Tuesday, July 16, 2019 12:28 PM
> > To: dev@impala 
> > Subject: Re: Support Apache Hudi
> >
> > Hi Cheng,
> >   I think that is one way you could approach it. I'm not really 
> > familiar enough with Hudi to know if that's the right way. I took a 
> > quick look at https://hudi.incubator.apache.org/concepts.html and 
> > I'm wondering if it would actually be cleaner to implement as a 
> > variant of HdfsTable. HdfsTable is used for any Hive 
> > filesystem-based table, not just HDFS - e.g. S3 or whatever. Hudi 
> > seems like it's similar Hive ACID in a lot of ways, which we're 
> > currently adding support for in that
> way.
> >
> > Which Hudi features are you planning to implement? Copy-on-Write 
> > seems like it would be simpler to implement - it might only require 
> > changes in the 

Re: Support Apache Hudi

2019-07-19 Thread Tim Armstrong
Please feel free to create a JIRA. we can add you as a contributor on
Apache JIRA if you give us your username then you can assign it to yourself.

You should be able to use our jenkins instance to run tests on a draft
gerrit patch:
https://cwiki.apache.org/confluence/display/IMPALA/Using+Gerrit+to+submit+and+review+patches#UsingGerrittosubmitandreviewpatches-Verifyingapatch(opentoallImpalacontributors).


Unfortunately we don't have a way to accelerate the initial local build. We
have a few tips for making incremental builds significantly faster here:
https://cloudera.atlassian.net/wiki/spaces/ENG/pages/100832437/Tips+for+Faster+Impala+Builds
. It is a lot quicker to iterate on code changes if you follow some of the
tips there, e.g. use ccache and only rebuild the components of impala that
you modified.

- Tim

On Fri, Jul 19, 2019 at 2:04 PM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
[email protected]> wrote:

> Hi Tim,
>
> The guys from Hudi said that the Hudi partitioning is compatible with Hive
> partitioning.
> I think I get some idea from the implementation of the Hive ACID support
> tickets. And I am trying to implement the Hudi support now.
>
> Could I create a Jira ticket for this task and use your Jenkins server for
> build? It takes me soo much time waiting the build process.
>
> Thanks so much!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -Original Message-
> From: Tim Armstrong 
> Sent: Tuesday, July 16, 2019 3:24 PM
> To: dev@impala 
> Subject: Re: Support Apache Hudi
>
> Sorry I meant to refer to
> ./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.java;
> FeHdfsTable is an interface shared by those two classes.
>
> There's a default catalog implementation that is based on all Impala
> daemons holding a cached snapshot of metadata, and a re-implementation
> where impala daemons fetch metadata on demand from a catalog service. The
> design doc for the reimplementation is here, although i suspect some
> details have changed:
>
> https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit
>
> It may be helpful to look at some recent commits that added Hive ACID
> support just to get an idea of how that was implemented:
> https://gerrit.cloudera.org/#/q/acid
>
> I guess one detail that may not work so well with HdfsTable is the
> partitioning - it's unclear to me how compatible the Hudi partitioning is
> with Hive's partitioning scheme.
>
> - Tim
>
>
>
> On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> [email protected]> wrote:
>
> > Hi Tim,
> >
> > Thanks so much for the suggestion.
> > I also think that implement Hudi Table as a variant of HdfsTable
> > should be a cleaner way.
> > I will focus on understand the hdfsTable now, it is really a big file.
> >
> > Currently, our team only use the Copy-on-Write mode now, so I will try
> > to implement the Copy-on-Write first.
> >
> > Can you explain more about the two catalog implementations?
> > My understand is that one is more the metadata of the table and one is
> > for the frontend interface of the table, however, for the HdfsTable, I
> > only found HdfsTable, no FeHdfsTable.
> >
> > Thanks so much!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
> >
> > -Original Message-
> > From: Tim Armstrong 
> > Sent: Tuesday, July 16, 2019 12:28 PM
> > To: dev@impala 
> > Subject: Re: Support Apache Hudi
> >
> > Hi Cheng,
> >   I think that is one way you could approach it. I'm not really
> > familiar enough with Hudi to know if that's the right way. I took a
> > quick look at https://hudi.incubator.apache.org/concepts.html and I'm
> > wondering if it would actually be cleaner to implement as a variant of
> > HdfsTable. HdfsTable is used for any Hive filesystem-based table, not
> > just HDFS - e.g. S3 or whatever. Hudi seems like it's similar Hive
> > ACID in a lot of ways, which we're currently adding support for in that
> way.
> >
> > Which Hudi features are you planning to implement? Copy-on-Write seems
> > like it would be simpler to implement - it might only require changes
> > in the frontend (i.e. java code). Merge-on-read probably requires
> > backend support for merging the delta files with the base files. Write
> > support also seems more complex than read support.
> >
> > Also another note - currently there are actually two catalog
> > implementations that require their own table implementati

RE: Support Apache Hudi

2019-07-19 Thread FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
Hi Tim,

The guys from Hudi said that the Hudi partitioning is compatible with Hive 
partitioning.
I think I get some idea from the implementation of the Hive ACID support 
tickets. And I am trying to implement the Hudi support now.

Could I create a Jira ticket for this task and use your Jenkins server for 
build? It takes me soo much time waiting the build process.

Thanks so much!

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  



-Original Message-
From: Tim Armstrong  
Sent: Tuesday, July 16, 2019 3:24 PM
To: dev@impala 
Subject: Re: Support Apache Hudi

Sorry I meant to refer to
./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.java;
FeHdfsTable is an interface shared by those two classes.

There's a default catalog implementation that is based on all Impala daemons 
holding a cached snapshot of metadata, and a re-implementation where impala 
daemons fetch metadata on demand from a catalog service. The design doc for the 
reimplementation is here, although i suspect some details have changed:
https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit

It may be helpful to look at some recent commits that added Hive ACID support 
just to get an idea of how that was implemented:
https://gerrit.cloudera.org/#/q/acid

I guess one detail that may not work so well with HdfsTable is the partitioning 
- it's unclear to me how compatible the Hudi partitioning is with Hive's 
partitioning scheme.

- Tim



On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < 
[email protected]> wrote:

> Hi Tim,
>
> Thanks so much for the suggestion.
> I also think that implement Hudi Table as a variant of HdfsTable 
> should be a cleaner way.
> I will focus on understand the hdfsTable now, it is really a big file.
>
> Currently, our team only use the Copy-on-Write mode now, so I will try 
> to implement the Copy-on-Write first.
>
> Can you explain more about the two catalog implementations?
> My understand is that one is more the metadata of the table and one is 
> for the frontend interface of the table, however, for the HdfsTable, I 
> only found HdfsTable, no FeHdfsTable.
>
> Thanks so much!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
>
> -Original Message-----
> From: Tim Armstrong 
> Sent: Tuesday, July 16, 2019 12:28 PM
> To: dev@impala 
> Subject: Re: Support Apache Hudi
>
> Hi Cheng,
>   I think that is one way you could approach it. I'm not really 
> familiar enough with Hudi to know if that's the right way. I took a 
> quick look at https://hudi.incubator.apache.org/concepts.html and I'm 
> wondering if it would actually be cleaner to implement as a variant of 
> HdfsTable. HdfsTable is used for any Hive filesystem-based table, not 
> just HDFS - e.g. S3 or whatever. Hudi seems like it's similar Hive 
> ACID in a lot of ways, which we're currently adding support for in that way.
>
> Which Hudi features are you planning to implement? Copy-on-Write seems 
> like it would be simpler to implement - it might only require changes 
> in the frontend (i.e. java code). Merge-on-read probably requires 
> backend support for merging the delta files with the base files. Write 
> support also seems more complex than read support.
>
> Also another note - currently there are actually two catalog 
> implementations that require their own table implementation, e.g. see 
> fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and 
> fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
>
> On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) 
> < [email protected]> wrote:
>
> > Hi,
> >
> > Our team now is using Apache Hudi to migrate our data pipeline from 
> > batch to incremental processing.
> > However, we find that the Apache Impala cannot pull the Hudi 
> > metadata from the Hive.
> > Here is the issue: 
> > https://github.com/apache/incubator-hudi/issues/179
> > Now I am trying to fix this issue.
> >
> > After reading some code related to the table object of the Impala, 
> > currently, my thought is to implement a new HudiTable class and add 
> > it to the fromMetastoreTable method in Table class.
> > Maybe only add some support methods in the current Table type can 
> > also solve this issue? Not very familiar with the Impala source code.
> > Here is the Jira ticket for this issue:
> > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
> >
> > Do you have any idea about how to solve this issue?
> >
> > I appreciate any help!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
>


Re: Support Apache Hudi

2019-07-16 Thread Tim Armstrong
Sorry I meant to refer to
./fe/src/main/java/org/apache/impala/catalog/local/LocalHbaseTable.java;
FeHdfsTable is an interface shared by those two classes.

There's a default catalog implementation that is based on all Impala
daemons holding a cached snapshot of metadata, and a re-implementation
where impala daemons fetch metadata on demand from a catalog service. The
design doc for the reimplementation is here, although i suspect some
details have changed:
https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit

It may be helpful to look at some recent commits that added Hive ACID
support just to get an idea of how that was implemented:
https://gerrit.cloudera.org/#/q/acid

I guess one detail that may not work so well with HdfsTable is the
partitioning - it's unclear to me how compatible the Hudi partitioning is
with Hive's partitioning scheme.

- Tim



On Wed, Jul 17, 2019 at 6:54 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
[email protected]> wrote:

> Hi Tim,
>
> Thanks so much for the suggestion.
> I also think that implement Hudi Table as a variant of HdfsTable should be
> a cleaner way.
> I will focus on understand the hdfsTable now, it is really a big file.
>
> Currently, our team only use the Copy-on-Write mode now, so I will try to
> implement the Copy-on-Write first.
>
> Can you explain more about the two catalog implementations?
> My understand is that one is more the metadata of the table and one is for
> the frontend interface of the table, however, for the HdfsTable, I only
> found HdfsTable, no FeHdfsTable.
>
> Thanks so much!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
>
> -Original Message-----
> From: Tim Armstrong 
> Sent: Tuesday, July 16, 2019 12:28 PM
> To: dev@impala 
> Subject: Re: Support Apache Hudi
>
> Hi Cheng,
>   I think that is one way you could approach it. I'm not really familiar
> enough with Hudi to know if that's the right way. I took a quick look at
> https://hudi.incubator.apache.org/concepts.html and I'm wondering if it
> would actually be cleaner to implement as a variant of HdfsTable. HdfsTable
> is used for any Hive filesystem-based table, not just HDFS - e.g. S3 or
> whatever. Hudi seems like it's similar Hive ACID in a lot of ways, which
> we're currently adding support for in that way.
>
> Which Hudi features are you planning to implement? Copy-on-Write seems
> like it would be simpler to implement - it might only require changes in
> the frontend (i.e. java code). Merge-on-read probably requires backend
> support for merging the delta files with the base files. Write support also
> seems more complex than read support.
>
> Also another note - currently there are actually two catalog
> implementations that require their own table implementation, e.g. see
> fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and
> fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
>
> On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> [email protected]> wrote:
>
> > Hi,
> >
> > Our team now is using Apache Hudi to migrate our data pipeline from
> > batch to incremental processing.
> > However, we find that the Apache Impala cannot pull the Hudi metadata
> > from the Hive.
> > Here is the issue: https://github.com/apache/incubator-hudi/issues/179
> > Now I am trying to fix this issue.
> >
> > After reading some code related to the table object of the Impala,
> > currently, my thought is to implement a new HudiTable class and add it
> > to the fromMetastoreTable method in Table class.
> > Maybe only add some support methods in the current Table type can also
> > solve this issue? Not very familiar with the Impala source code.
> > Here is the Jira ticket for this issue:
> > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
> >
> > Do you have any idea about how to solve this issue?
> >
> > I appreciate any help!
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
> >
>


RE: Support Apache Hudi

2019-07-16 Thread FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
Hi Tim,

Thanks so much for the suggestion.
I also think that implement Hudi Table as a variant of HdfsTable should be a 
cleaner way.
I will focus on understand the hdfsTable now, it is really a big file.

Currently, our team only use the Copy-on-Write mode now, so I will try to 
implement the Copy-on-Write first.

Can you explain more about the two catalog implementations? 
My understand is that one is more the metadata of the table and one is for the 
frontend interface of the table, however, for the HdfsTable, I only found 
HdfsTable, no FeHdfsTable.

Thanks so much!

Best regards

Yuanbin Cheng
CR/PJ-AI-S1  




-Original Message-
From: Tim Armstrong  
Sent: Tuesday, July 16, 2019 12:28 PM
To: dev@impala 
Subject: Re: Support Apache Hudi

Hi Cheng,
  I think that is one way you could approach it. I'm not really familiar enough 
with Hudi to know if that's the right way. I took a quick look at 
https://hudi.incubator.apache.org/concepts.html and I'm wondering if it would 
actually be cleaner to implement as a variant of HdfsTable. HdfsTable is used 
for any Hive filesystem-based table, not just HDFS - e.g. S3 or whatever. Hudi 
seems like it's similar Hive ACID in a lot of ways, which we're currently 
adding support for in that way.

Which Hudi features are you planning to implement? Copy-on-Write seems like it 
would be simpler to implement - it might only require changes in the frontend 
(i.e. java code). Merge-on-read probably requires backend support for merging 
the delta files with the base files. Write support also seems more complex than 
read support.

Also another note - currently there are actually two catalog implementations 
that require their own table implementation, e.g. see 
fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and 
fe/src/main/java/org/apache/impala/catalog/HBaseTable.java

On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < 
[email protected]> wrote:

> Hi,
>
> Our team now is using Apache Hudi to migrate our data pipeline from 
> batch to incremental processing.
> However, we find that the Apache Impala cannot pull the Hudi metadata 
> from the Hive.
> Here is the issue: https://github.com/apache/incubator-hudi/issues/179
> Now I am trying to fix this issue.
>
> After reading some code related to the table object of the Impala, 
> currently, my thought is to implement a new HudiTable class and add it 
> to the fromMetastoreTable method in Table class.
> Maybe only add some support methods in the current Table type can also 
> solve this issue? Not very familiar with the Impala source code.
> Here is the Jira ticket for this issue:
> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
>
> Do you have any idea about how to solve this issue?
>
> I appreciate any help!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>


Re: Support Apache Hudi

2019-07-16 Thread Tim Armstrong
Hi Cheng,
  I think that is one way you could approach it. I'm not really familiar
enough with Hudi to know if that's the right way. I took a quick look at
https://hudi.incubator.apache.org/concepts.html and I'm wondering if it
would actually be cleaner to implement as a variant of HdfsTable. HdfsTable
is used for any Hive filesystem-based table, not just HDFS - e.g. S3 or
whatever. Hudi seems like it's similar Hive ACID in a lot of ways, which
we're currently adding support for in that way.

Which Hudi features are you planning to implement? Copy-on-Write seems like
it would be simpler to implement - it might only require changes in the
frontend (i.e. java code). Merge-on-read probably requires backend support
for merging the delta files with the base files. Write support also seems
more complex than read support.

Also another note - currently there are actually two catalog
implementations that require their own table implementation, e.g. see
fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java and
fe/src/main/java/org/apache/impala/catalog/HBaseTable.java

On Tue, Jul 16, 2019 at 9:55 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
[email protected]> wrote:

> Hi,
>
> Our team now is using Apache Hudi to migrate our data pipeline from batch
> to incremental processing.
> However, we find that the Apache Impala cannot pull the Hudi metadata from
> the Hive.
> Here is the issue: https://github.com/apache/incubator-hudi/issues/179
> Now I am trying to fix this issue.
>
> After reading some code related to the table object of the Impala,
> currently, my thought is to implement a new HudiTable class and add it to
> the fromMetastoreTable method in Table class.
> Maybe only add some support methods in the current Table type can also
> solve this issue? Not very familiar with the Impala source code.
> Here is the Jira ticket for this issue:
> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-146
>
> Do you have any idea about how to solve this issue?
>
> I appreciate any help!
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>