[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes

2022-03-31 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515372#comment-17515372
 ] 

Bryan Beaudreault commented on HBASE-26909:
---

In investigating this further, I also realized that I don't think the profiles 
in hbase-shaded-client-byo-hadoop and hbase-shaded-mapreduce are working as 
expected. I think the intention is that these profiles ensure that dependents 
of (for example) hbase-shaded-mapreduce pull in the correct hadoop dependencies 
transitively.

The profiles do indeed remain in the dependency reduced pom, but if you try to 
include one of those shaded artifacts as a dependency in a downstream project 
,maven doesn't seem to pick up the transitive dependencies provided by those 
profiles. I tried various ways to trigger the profiles in my downstream 
project, including using {{-Phadoop-3.0}} and {{{}-Dhadoop.profile=3.0{}}}. The 
former complains that there is no hadoop-3.0 profile in my project, and the 
latter does nothing. Even without triggering profiles explicitly, the hadoop-2 
profile should be active by default in branch-2 (or hadoop-3 in master) so i 
should expect something to come through. But nothing does.

I have a POC going which takes a different approach: it creates 2 new wrapper 
modules, hbase-shaded-client-provided-dependencies and 
hbase-shaded-mapreduce-provided-dependencies. The profiles are moved into these 
modules. Per this jira, hbase-shaded-mapreduce-provided-dependencies also 
includes hbase-shaded-client-byo-hadoop.

The resulting artifacts work per the intention in the first parapraph – 
including them in a downstream project results in hadoop dependencies (and in 
the case of mapreduce, the byo-hadoop client) being pulled in transitively. I'm 
going to push a draft PR to illustrate the changes.

> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> --
>
> Key: HBASE-26909
> URL: https://issues.apache.org/jira/browse/HBASE-26909
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> We supply 2 primary artifacts for end-users to consume:
>  * hbase-shaded-client, which is for general use
>  * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One 
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
> jars.
> This may not be a problem if your projects are always very isolated – either 
> doing mapreduce, or not. In that case you just depend in the one you need. 
> Many users might exist in much more complicated environments where 
> dependencies tend to bleed along more between projects. Here's an 
> illustration:
>  * Imagine a project FooService, which includes two modules FooServiceRestWeb 
> (for the rest http resources) and FooServiceData (which includes DAOs for 
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. 
>  In this case, FooServiceData should depend on hbase-shaded-client.
>  * Now imagine another project FooPipeline, which has modules 
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData 
> (which has some DAOs for accessing data). In this case, FooPipelineData might 
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
>  * The problem arises when suddenly we want to include some data from 
> FooService into our pipeline. The most straightforward way to achieve this is 
> by depending on FooServiceData,  which has all of he DAOs for that data but 
> also depends on hbase-shaded-client. At this point you have a problem, 
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like 
> only accessing FooService's data through the API... it's just for 
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is 
> somewhat wasteful but not a huge issue since the implementations are all the 
> same.
> From a maven perspective, it's problematic because the maven dependency 
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to 
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is 
> much more painful in a large and complicated environment where this may come 
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
> hbase-shaded-client and then only expose the classes that aren't already 
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, 

[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes

2022-03-30 Thread Sean Busbey (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514915#comment-17514915
 ] 

Sean Busbey commented on HBASE-26909:
-

that's an important distinction. if we want a refactoring to pull out the 
duplicate classes, hbase-shaded-mapreduce will need to show a runtime 
dependency on the hbase-shaded-client-byo-hadoop artifact and not the one that 
include hadoop classes. the intended use of abase-shaded-mapreduce is always 
with an existing hadoop; the hadoop classes should come from that existing 
install.

> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> --
>
> Key: HBASE-26909
> URL: https://issues.apache.org/jira/browse/HBASE-26909
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> We supply 2 primary artifacts for end-users to consume:
>  * hbase-shaded-client, which is for general use
>  * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One 
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
> jars.
> This may not be a problem if your projects are always very isolated – either 
> doing mapreduce, or not. In that case you just depend in the one you need. 
> Many users might exist in much more complicated environments where 
> dependencies tend to bleed along more between projects. Here's an 
> illustration:
>  * Imagine a project FooService, which includes two modules FooServiceRestWeb 
> (for the rest http resources) and FooServiceData (which includes DAOs for 
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. 
>  In this case, FooServiceData should depend on hbase-shaded-client.
>  * Now imagine another project FooPipeline, which has modules 
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData 
> (which has some DAOs for accessing data). In this case, FooPipelineData might 
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
>  * The problem arises when suddenly we want to include some data from 
> FooService into our pipeline. The most straightforward way to achieve this is 
> by depending on FooServiceData,  which has all of he DAOs for that data but 
> also depends on hbase-shaded-client. At this point you have a problem, 
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like 
> only accessing FooService's data through the API... it's just for 
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is 
> somewhat wasteful but not a huge issue since the implementations are all the 
> same.
> From a maven perspective, it's problematic because the maven dependency 
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to 
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is 
> much more painful in a large and complicated environment where this may come 
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
> hbase-shaded-client and then only expose the classes that aren't already 
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, but I don't have 
> experience with that.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes

2022-03-30 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514903#comment-17514903
 ] 

Bryan Beaudreault commented on HBASE-26909:
---

I've been diffing the contents of the jars and actually, hbase-shaded-mapreduce 
is _not_ a superset of hbase-shaded-client. It lacks all of the hadoop/hdfs 
classes, for example does not include FileSystem. Instead it's more accurate to 
say that hbase-shaded-mapreduce is a superset of hbase-shaded-client-byo-hadoop.

> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> --
>
> Key: HBASE-26909
> URL: https://issues.apache.org/jira/browse/HBASE-26909
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> We supply 2 primary artifacts for end-users to consume:
>  * hbase-shaded-client, which is for general use
>  * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One 
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
> jars.
> This may not be a problem if your projects are always very isolated – either 
> doing mapreduce, or not. In that case you just depend in the one you need. 
> Many users might exist in much more complicated environments where 
> dependencies tend to bleed along more between projects. Here's an 
> illustration:
>  * Imagine a project FooService, which includes two modules FooServiceRestWeb 
> (for the rest http resources) and FooServiceData (which includes DAOs for 
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. 
>  In this case, FooServiceData should depend on hbase-shaded-client.
>  * Now imagine another project FooPipeline, which has modules 
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData 
> (which has some DAOs for accessing data). In this case, FooPipelineData might 
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
>  * The problem arises when suddenly we want to include some data from 
> FooService into our pipeline. The most straightforward way to achieve this is 
> by depending on FooServiceData,  which has all of he DAOs for that data but 
> also depends on hbase-shaded-client. At this point you have a problem, 
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like 
> only accessing FooService's data through the API... it's just for 
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is 
> somewhat wasteful but not a huge issue since the implementations are all the 
> same.
> From a maven perspective, it's problematic because the maven dependency 
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to 
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is 
> much more painful in a large and complicated environment where this may come 
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
> hbase-shaded-client and then only expose the classes that aren't already 
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, but I don't have 
> experience with that.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)