lirui-apache commented on issue #8449: [FLINK-12235][hive] Support Hive
partition in HiveCatalog
URL: https://github.com/apache/flink/pull/8449#issuecomment-495051154
> @lirui-apache Thank you very much for the update!
>
> I just spotted that, currently how several APIs impl work like this: 1)
get a raw hive table 2) parse part of the raw table. The latter step actually
duplicate with logic in `instantiateHiveCatalogTable()`. E.g.
`ensureTableAndPartitionMatch()` parses `FLINK_PROPERTY_IS_GENERIC`,
`instantiateHivePartition()` parses partition keys, `ensurePartitionedTable()`
parses the raw table's partition key size, all of which we can get by just
parsing the raw table to a `CatalogTable` thru `instantiateHiveCatalogTable()`
in advance. The current duplication also means if we change some general logic
in parsing a hive table, we need to change two places. Thus I wonder if it
makes sense to just parse the raw table as whole at the beginning rather than
having scattered places each parsing only part of it themselves. And we can
remove util methods such as `getFieldNames()` which is only used to get the
partition keys which is already available in `CatalogTable`.
>
> For example, change
>
> ```
> public void createPartition(...) {
> Table hiveTable = getHiveTable(tablePath);
> ensureTableAndPartitionMatch(hiveTable, partition);
> ensurePartitionedTable(tablePath, hiveTable);
> try {
> client.add_partition(instantiateHivePartition(hiveTable,
partitionSpec, partition));
> } ...
> }
> ```
>
> to something like:
>
> ```
> public void createPartition(...) {
> Table hiveTable = getHiveTable(tablePath);
> CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
> ... check whether catalogTabe and catalogPartition type matches would be
much easier here ...
> ... check whether catalogTable is partitioned would be easier here ...
> try {
> client.add_partition(
> instantiateHivePartition(catalogTable, partitionSpec, partition,
hiveTable.getSd()));
> } ...
> }
> ```
I'm not sure how much benefit this can bring us. It might make
`ensureTableAndPartitionMatch` a little easier -- we can check the type of
CatalogBaseTable instead of parsing a property. But I don't think we should
take the same approach for `ensurePartitionedTable`. `ensurePartitionedTable`
is already simple enough. And for APIs like `listPartitions`, creating a
CatalogBaseTable just to get num of partition cols seems an overkill to me. And
the same applies to `getFieldNames`. E.g. I think it'll be an overkill to
create a CatalogBaseTable to get partition col names in `dropPartition`.
Maybe an alternative approach to the problem you mentioned is to have more
util methods, in order to avoid duplication. For example, we should have a util
method to decide whether a Hive table is generic. And all the APIs needing this
logic can call the util method. What do you think?
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services