[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
[ https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515372#comment-17515372 ] Bryan Beaudreault commented on HBASE-26909: --- In investigating this further, I also realized that I don't think the profiles in hbase-shaded-client-byo-hadoop and hbase-shaded-mapreduce are working as expected. I think the intention is that these profiles ensure that dependents of (for example) hbase-shaded-mapreduce pull in the correct hadoop dependencies transitively. The profiles do indeed remain in the dependency reduced pom, but if you try to include one of those shaded artifacts as a dependency in a downstream project ,maven doesn't seem to pick up the transitive dependencies provided by those profiles. I tried various ways to trigger the profiles in my downstream project, including using {{-Phadoop-3.0}} and {{{}-Dhadoop.profile=3.0{}}}. The former complains that there is no hadoop-3.0 profile in my project, and the latter does nothing. Even without triggering profiles explicitly, the hadoop-2 profile should be active by default in branch-2 (or hadoop-3 in master) so i should expect something to come through. But nothing does. I have a POC going which takes a different approach: it creates 2 new wrapper modules, hbase-shaded-client-provided-dependencies and hbase-shaded-mapreduce-provided-dependencies. The profiles are moved into these modules. Per this jira, hbase-shaded-mapreduce-provided-dependencies also includes hbase-shaded-client-byo-hadoop. The resulting artifacts work per the intention in the first parapraph – including them in a downstream project results in hadoop dependencies (and in the case of mapreduce, the byo-hadoop client) being pulled in transitively. I'm going to push a draft PR to illustrate the changes. > hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes > -- > > Key: HBASE-26909 > URL: https://issues.apache.org/jira/browse/HBASE-26909 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > We supply 2 primary artifacts for end-users to consume: > * hbase-shaded-client, which is for general use > * hbase-shaded-mapreduce, which is for use when you need to connect to hbase > via mapreduce. For example, TableInputFormat > The problem is that these artifacts expose tons of duplicate classes. One > example (among many) is org.apache.hadoop.hbase.Cell, which appears in both > jars. > This may not be a problem if your projects are always very isolated – either > doing mapreduce, or not. In that case you just depend in the one you need. > Many users might exist in much more complicated environments where > dependencies tend to bleed along more between projects. Here's an > illustration: > * Imagine a project FooService, which includes two modules FooServiceRestWeb > (for the rest http resources) and FooServiceData (which includes DAOs for > accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. > In this case, FooServiceData should depend on hbase-shaded-client. > * Now imagine another project FooPipeline, which has modules > FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData > (which has some DAOs for accessing data). In this case, FooPipelineData might > depend on hbase-shaded-mapreduce since the context is intended for M/R. > * The problem arises when suddenly we want to include some data from > FooService into our pipeline. The most straightforward way to achieve this is > by depending on FooServiceData, which has all of he DAOs for that data but > also depends on hbase-shaded-client. At this point you have a problem, > because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and > hbase-shaded-client. > (Note, this obviously skirts around potential microservice solutions like > only accessing FooService's data through the API... it's just for > illustration, and it does come up.) > From a plain java perspective, having these 2 jars on the classpath is > somewhat wasteful but not a huge issue since the implementations are all the > same. > From a maven perspective, it's problematic because the maven dependency > plugin will complain about the conflicting classes. > One potential fix is to add exclusions to the FooServiceData dependency, to > avoid pulling in hbase-shaded-client. This works on a one-off basis but is > much more painful in a large and complicated environment where this may come > up hundreds of times. > A better fix in my opinion is to make hbase-shaded-mapreduce depend on > hbase-shaded-client and then only expose the classes that aren't already > exposed by the shaded client. > [~busbey] also mentioned a BOM being a potential solution,
[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
[ https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514915#comment-17514915 ] Sean Busbey commented on HBASE-26909: - that's an important distinction. if we want a refactoring to pull out the duplicate classes, hbase-shaded-mapreduce will need to show a runtime dependency on the hbase-shaded-client-byo-hadoop artifact and not the one that include hadoop classes. the intended use of abase-shaded-mapreduce is always with an existing hadoop; the hadoop classes should come from that existing install. > hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes > -- > > Key: HBASE-26909 > URL: https://issues.apache.org/jira/browse/HBASE-26909 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > We supply 2 primary artifacts for end-users to consume: > * hbase-shaded-client, which is for general use > * hbase-shaded-mapreduce, which is for use when you need to connect to hbase > via mapreduce. For example, TableInputFormat > The problem is that these artifacts expose tons of duplicate classes. One > example (among many) is org.apache.hadoop.hbase.Cell, which appears in both > jars. > This may not be a problem if your projects are always very isolated – either > doing mapreduce, or not. In that case you just depend in the one you need. > Many users might exist in much more complicated environments where > dependencies tend to bleed along more between projects. Here's an > illustration: > * Imagine a project FooService, which includes two modules FooServiceRestWeb > (for the rest http resources) and FooServiceData (which includes DAOs for > accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. > In this case, FooServiceData should depend on hbase-shaded-client. > * Now imagine another project FooPipeline, which has modules > FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData > (which has some DAOs for accessing data). In this case, FooPipelineData might > depend on hbase-shaded-mapreduce since the context is intended for M/R. > * The problem arises when suddenly we want to include some data from > FooService into our pipeline. The most straightforward way to achieve this is > by depending on FooServiceData, which has all of he DAOs for that data but > also depends on hbase-shaded-client. At this point you have a problem, > because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and > hbase-shaded-client. > (Note, this obviously skirts around potential microservice solutions like > only accessing FooService's data through the API... it's just for > illustration, and it does come up.) > From a plain java perspective, having these 2 jars on the classpath is > somewhat wasteful but not a huge issue since the implementations are all the > same. > From a maven perspective, it's problematic because the maven dependency > plugin will complain about the conflicting classes. > One potential fix is to add exclusions to the FooServiceData dependency, to > avoid pulling in hbase-shaded-client. This works on a one-off basis but is > much more painful in a large and complicated environment where this may come > up hundreds of times. > A better fix in my opinion is to make hbase-shaded-mapreduce depend on > hbase-shaded-client and then only expose the classes that aren't already > exposed by the shaded client. > [~busbey] also mentioned a BOM being a potential solution, but I don't have > experience with that. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26909) hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
[ https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514903#comment-17514903 ] Bryan Beaudreault commented on HBASE-26909: --- I've been diffing the contents of the jars and actually, hbase-shaded-mapreduce is _not_ a superset of hbase-shaded-client. It lacks all of the hadoop/hdfs classes, for example does not include FileSystem. Instead it's more accurate to say that hbase-shaded-mapreduce is a superset of hbase-shaded-client-byo-hadoop. > hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes > -- > > Key: HBASE-26909 > URL: https://issues.apache.org/jira/browse/HBASE-26909 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > We supply 2 primary artifacts for end-users to consume: > * hbase-shaded-client, which is for general use > * hbase-shaded-mapreduce, which is for use when you need to connect to hbase > via mapreduce. For example, TableInputFormat > The problem is that these artifacts expose tons of duplicate classes. One > example (among many) is org.apache.hadoop.hbase.Cell, which appears in both > jars. > This may not be a problem if your projects are always very isolated – either > doing mapreduce, or not. In that case you just depend in the one you need. > Many users might exist in much more complicated environments where > dependencies tend to bleed along more between projects. Here's an > illustration: > * Imagine a project FooService, which includes two modules FooServiceRestWeb > (for the rest http resources) and FooServiceData (which includes DAOs for > accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. > In this case, FooServiceData should depend on hbase-shaded-client. > * Now imagine another project FooPipeline, which has modules > FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData > (which has some DAOs for accessing data). In this case, FooPipelineData might > depend on hbase-shaded-mapreduce since the context is intended for M/R. > * The problem arises when suddenly we want to include some data from > FooService into our pipeline. The most straightforward way to achieve this is > by depending on FooServiceData, which has all of he DAOs for that data but > also depends on hbase-shaded-client. At this point you have a problem, > because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and > hbase-shaded-client. > (Note, this obviously skirts around potential microservice solutions like > only accessing FooService's data through the API... it's just for > illustration, and it does come up.) > From a plain java perspective, having these 2 jars on the classpath is > somewhat wasteful but not a huge issue since the implementations are all the > same. > From a maven perspective, it's problematic because the maven dependency > plugin will complain about the conflicting classes. > One potential fix is to add exclusions to the FooServiceData dependency, to > avoid pulling in hbase-shaded-client. This works on a one-off basis but is > much more painful in a large and complicated environment where this may come > up hundreds of times. > A better fix in my opinion is to make hbase-shaded-mapreduce depend on > hbase-shaded-client and then only expose the classes that aren't already > exposed by the shaded client. > [~busbey] also mentioned a BOM being a potential solution, but I don't have > experience with that. > -- This message was sent by Atlassian Jira (v8.20.1#820001)