Hello,
I would like to understand a bit more about how the Presto integration
handles message timestamps, i.e. Publish Time and/or Event Time.
(1) Are the Publish Time and/or Event Time available as SQL
pseudo-columns? The example
<http://pulsar.apache.org/docs/en/sql-getting-started/> doesn't show
them when "select *" is used.
(2) Suppose they are available. Suppose also I have a topic which has
been accumulating messages for a very long period of time (years even),
and I write a query like this:
select * from events where data like "%foo%"
and publish_time between "2019-01-01T12:00:00" and
"2019-01-01T13:00:00";
Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time
is within those boundaries? Is there an index somewhere which says for
each segment what is the lowest and highest publish_time it contains?
(3) Ditto for event_time? (It's much more likely that event_time is not
monotonically increasing, whereas publish_time normally would be)
I'm considering whether it's feasible to store and query log streams
this way. This would be extremely convenient because of the transparent
tiered storage that Pulsar offers.
Many thanks,
Brian Candler.