Some questions about external tables in BeamSQL

Steve Niemitz Thu, 13 Jan 2022 10:29:44 -0800

I've been playing around with CREATE EXTERNAL TABLE (using a custom
TableProvider as well) w/ BeamSQL and really love it.  I have a few
questions though that I've accumulated as I've been using it I wanted to
ask.


- I'm a little confused about the need to define columns in the CREATE
EXTERNAL TABLE statement.  If I have a BeamSqlTable implementation that can
provide the schema on its own, it seems like the columns supplied to the
CREATE statement are ignored.  This is ideal anyways, since it's infeasible
for users to provide the entire schema up-front, especially for more
complicated sources.  Should the column list be optional here instead?

- It seems like predicate pushdown only works if the schema is "flat" (has
no nested rows).  I understand the complication in pushing down more
complicated nested predicates, however, assuming the table implementation
doesn't actually attempt to push them down, it seems like it would be fine
to allow?

- As a follow up on the above, I'd like to expose a "virtual" field in my
schema that represents the partition the data has come from.  For example
BigQuery has a similar concept called _PARTITIONTIME.  This would be picked
up by the predicate pushdown and used to filter the partitions being read.
I can't really figure out how I'd construct something similar here, even if
pushdown worked in all cases.  For example, for this query:

SELECT * from table
where _PARTITIONTIME between X and Y

I'd want that filter to be pushed down to my IO, but also the
_PARTITIONTIME column wouldn't be returned in the select list.  I was
hoping to use BigQueryIO as an example of how to do this, but it doesn't
seem like it exposes the virtual _PARTITIONTIME column either.

Some questions about external tables in BeamSQL

Reply via email to