Hi, /cc Peter, as you might have some thoughts based on your experience with Iceberg :)
I'm noticed another odd behavior with the "hive.io.file.readcolumn.names" property. Consider this query that reads from two separate tables at once: SELECT * FROM ( SELECT num as number, str_val as text FROM t1, UNION ALL SELECT * FROM t2 ) unioned_table ORDER BY number When using the "mr" execution engine, the value of the "hive.io.file.readcolumn.names" property cannot be relied on as it seems to be stuck on the fields of just one of the tables. As a workaround, I have to use all of the tables' columns when querying the external storage in my custom storage handler, which is unfortunately quite inefficient. Interestingly, that issue doesn't occur with Tez. I've noticed that the Iceberg storage handler does this: jobConf.set("tez.mrreader.config.update.properties", "hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids"); Link: https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538 However, it still works fine for me with tez even without setting "tez.mrreader.config.update.properties". Do you know what's causing this? Is there a workaround for the "mr" engine to consistently get the proper value for "hive.io.file.readcolumn.names"? Thank you, Julien On 2022/05/16 04:03:11 Julien Phalip wrote: > Also, I forgot to mention, I'm using Hive v3.1.2. > > On 2022/05/16 03:09:19 Julien Phalip wrote: > > Hi, > > > > I've noticed an odd behavior with the 'hive.io.file.readcolumn.names' conf > > property. > > > > Imagine a simple table "mytable" with two fields: "text" and "number". > > > > - If you run the query "SELECT * FROM mytable", then the > > "hive.io.file.readcolumn.names" has the value: "text,number". Makes sense > > so far. > > - If you run the query "SELECT text FROM mytable", then the > > "hive.io.file.readcolumn.names" has the value: "text". Still makes sense. > > > > However, if you add a predicate (WHERE clause), then the behavior of that > > property seems strange to me: > > > > - If you run the query "SELECT * FROM mytable WHERE number = 999", then > the > > "hive.io.file.readcolumn.names" has the value: "text". The "number" column > > is missing from the property. > > - If you run the query "SELECT number FROM mytable WHERE number = 999", > > then the "hive.io.file.readcolumn.names" has the value: "" (empty string). > > The "number" column is still missing from the property. > > > > In other terms, it looks like if a column is part of a predicate, then it > > is omitted from the "hive.io.file.readcolumn.names" property. Do you know > > why that is? > > > > I'm writing a custom StorageHandler and so I would need to know exactly > > what columns the user is requesting. Is there a way to consistently > > retrieve all the requested columns either from the configuration or from > > within the InputFormat class, even when there is a WHERE clause? > > > > Thanks, > > > > Julien > > >