Hi,

/cc Peter, as you might have some thoughts based on your experience with
Iceberg :)

I'm noticed another odd behavior with the "hive.io.file.readcolumn.names"
property.

Consider this query that reads from two separate tables at once:

SELECT * FROM (
    SELECT
            num as number,
            str_val as text
    FROM t1,
    UNION ALL
    SELECT *
    FROM t2
) unioned_table ORDER BY number

When using the "mr" execution engine, the value of the
"hive.io.file.readcolumn.names" property cannot be relied on as it seems to
be stuck on the fields of just one of the tables. As a workaround, I have
to use all of the tables' columns when querying the external storage in my
custom storage handler, which is unfortunately quite inefficient.

Interestingly, that issue doesn't occur with Tez.

I've noticed that the Iceberg storage handler does this:

jobConf.set("tez.mrreader.config.update.properties",
"hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids");

Link:
https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538

However, it still works fine for me with tez even without setting
"tez.mrreader.config.update.properties".

Do you know what's causing this? Is there a workaround for the "mr" engine
to consistently get the proper value for "hive.io.file.readcolumn.names"?

Thank you,

Julien

On 2022/05/16 04:03:11 Julien Phalip wrote:
> Also, I forgot to mention, I'm using Hive v3.1.2.
>
> On 2022/05/16 03:09:19 Julien Phalip wrote:
> > Hi,
> >
> > I've noticed an odd behavior with the 'hive.io.file.readcolumn.names'
conf
> > property.
> >
> > Imagine a simple table "mytable" with two fields: "text" and "number".
> >
> > - If you run the query "SELECT * FROM mytable", then the
> > "hive.io.file.readcolumn.names" has the value: "text,number". Makes
sense
> > so far.
> > - If you run the query "SELECT text FROM mytable", then the
> > "hive.io.file.readcolumn.names" has the value: "text". Still makes
sense.
> >
> > However, if you add a predicate (WHERE clause), then the behavior of
that
> > property seems strange to me:
> >
> > - If you run the query "SELECT * FROM mytable WHERE number = 999", then
> the
> > "hive.io.file.readcolumn.names" has the value: "text". The "number"
column
> > is missing from the property.
> > - If you run the query "SELECT number FROM mytable WHERE number = 999",
> > then the "hive.io.file.readcolumn.names" has the value: "" (empty
string).
> > The "number" column is still missing from the property.
> >
> > In other terms, it looks like if a column is part of a predicate, then
it
> > is omitted from the "hive.io.file.readcolumn.names" property. Do you
know
> > why that is?
> >
> > I'm writing a custom StorageHandler and so I would need to know exactly
> > what columns the user is requesting. Is there a way to consistently
> > retrieve all the requested columns either from the configuration or from
> > within the InputFormat class, even when there is a WHERE clause?
> >
> > Thanks,
> >
> > Julien
> >
>

Reply via email to