Hello,

I would like to understand a bit more about how the Presto integration handles message timestamps, i.e. Publish Time and/or Event Time.

(1) Are the Publish Time and/or Event Time available as SQL pseudo-columns?  The example <http://pulsar.apache.org/docs/en/sql-getting-started/> doesn't show them when "select *" is used.

(2) Suppose they are available.  Suppose also I have a topic which has been accumulating messages for a very long period of time (years even), and I write a query like this:

    select * from events where data like "%foo%"
        and publish_time between "2019-01-01T12:00:00" and "2019-01-01T13:00:00";

Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time is within those boundaries?  Is there an index somewhere which says for each segment what is the lowest and highest publish_time it contains?

(3) Ditto for event_time?  (It's much more likely that event_time is not monotonically increasing, whereas publish_time normally would be)

I'm considering whether it's feasible to store and query log streams this way.  This would be extremely convenient because of the transparent tiered storage that Pulsar offers.

Many thanks,

Brian Candler.

Reply via email to