Hi Brian, I think it is already implemented. The Pulsar Presto Connector supports predicate pushdown based on publish time.
> Or it could include min/max/bloom filter on user data too, like ORC <https://orc.apache.org/docs/indexes.html> does Alternatively, we can leverage Pulsar's tiered storage mechanism and implement a schema-aware columnar offloader to offload a row-based segment into a columnar segment using Parquet or ORC format. That's one item in our roadmap. Thanks, Sijie On Wed, Oct 9, 2019 at 3:35 AM Brian Candler <[email protected]> wrote: > On 08/10/2019 18:33, Brian Candler wrote: > > select * from events where data like "%foo%" > and publish_time between "2019-01-01T12:00:00" and > "2019-01-01T13:00:00"; > > Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time > is within those boundaries? Is there an index somewhere which says for > each segment what is the lowest and highest publish_time it contains? > > Ah, I found listed under Phase 2 features at > https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL > > 4. Time boxed queries > 5. When doing a query over a subset of the data, based on publish time, we > should be able to only scan the relevant data instead of everything stored > in the topic > > So I guess this is an "upcoming feature". > > (Aside: it occurs to me that if every closed segment published its minimum > and maximum publish time on a meta-topic, that would be an efficient way to > locate the segments of interest. Or it could include min/max/bloom filter > on user data too, like ORC <https://orc.apache.org/docs/indexes.html> > does) >
