Hi Brian,

I think it is already implemented. The Pulsar Presto Connector supports
predicate pushdown based on publish time.

> Or it could include min/max/bloom filter on user data too, like ORC
<https://orc.apache.org/docs/indexes.html> does

Alternatively, we can leverage Pulsar's tiered storage mechanism and
implement a schema-aware columnar offloader to offload a row-based segment
into a columnar segment using Parquet or ORC format.
That's one item in our roadmap.

Thanks,
Sijie

On Wed, Oct 9, 2019 at 3:35 AM Brian Candler <[email protected]> wrote:

> On 08/10/2019 18:33, Brian Candler wrote:
>
>     select * from events where data like "%foo%"
>         and publish_time between "2019-01-01T12:00:00" and
> "2019-01-01T13:00:00";
>
> Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time
> is within those boundaries?  Is there an index somewhere which says for
> each segment what is the lowest and highest publish_time it contains?
>
> Ah, I found listed under Phase 2 features at
> https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL
>
> 4. Time boxed queries
> 5. When doing a query over a subset of the data, based on publish time, we
> should be able to only scan the relevant data instead of everything stored
> in the topic
>
> So I guess this is an "upcoming feature".
>
> (Aside: it occurs to me that if every closed segment published its minimum
> and maximum publish time on a meta-topic, that would be an efficient way to
> locate the segments of interest.  Or it could include min/max/bloom filter
> on user data too, like ORC <https://orc.apache.org/docs/indexes.html>
> does)
>

Reply via email to