The main problem I see with a SerDe based approach is that this abstraction is not able to expose the needed set of metadata for the target table. While the SerDe can return the schema (via getObjectInspector() I presume), there is no provision for the delivery of available partitions, or table and column statistics.
On a related note, I believe this might also preclude the SerDe from acting as the main integration point for an iceberg integration ( https://issues.apache.org/jira/browse/HIVE-19457), as this too will need to pass additional metadata that is stored outside of the metastore and does not fall into the scope of the SerDe interface. The org.apache.hadoop.hive.ql.metadata.MetastoreClientFactory integration point looks promising for both of these cases, but I can only find an operational implementation of this in EMR. Cheers, Elliot. On 27 April 2018 at 17:32, Elliot West <tea...@gmail.com> wrote: > Hi Johannes, > > We did not. I presume that your suggestion is that my use case could be > implemented as a storage handler, and not that we access remote Hive data > via JDBC (and by implication, HS2)? > > I must confess that I hadn't considered this approach, likely because for > some time I'd assumed that a storage handler could not also be the source > of table metadata. However, lately I've been externalizing schemas with the > AvroSerDe and so I now have practical experience that demonstrates that > isn't the case. > > It's a very good idea and I'm keen to look into the practicalities. > > Thank you for your helpful reply. > > Elliot. > > > On 26 April 2018 at 17:28, Johannes Alberti <johan...@altiscale.com> > wrote: > >> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler >> and discussed the pros/cons/similarities of the qubole approach >> >> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <tea...@gmail.com> wrote: >> >>> Hello, >>> >>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle >>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for >>> federating multiple Hive clusters and providing the illusion of a unified >>> data catalog from disparate instances. >>> >>> We’ve been running Waggle Dance in production for well over a year and >>> it has formed a critical part of our data platform architecture and >>> infrastructure.We believe that this type of functionality will be of >>> increasing importance as Hadoop and Hive workloads migrate to the cloud. >>> While Waggle Dance is one solution, significant benefits could be realized >>> if these kinds of abilities were an integral part of the Hive platform. >>> >>> If this sounds of interest, I've created a proposal on the Hive wiki. >>> I've outlined why we think such a feature is needed in Hive, the benefits >>> gained by offering it as a built-in feature, and representation of a >>> possible implementation. Our proposed implementation draws inspiration from >>> the remote table features present in some traditional RDBMSes, which may >>> already be familiar to you. >>> >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa >>> geId=80452092 >>> >>> Feedback gratefully accepted, >>> >>> Elliot. >>> >>> Senior Engineer >>> Big Data Platform Team >>> Hotels.com >>> >> >> >