Re: Can Apache Drill perform streaming queries?

Saurabh Mahapatra Thu, 09 Nov 2017 18:42:13 -0800

Hi Anil,

I think the start and offset feature would be very useful. Interestingly,
Kafka SQL has the concept of a tumbling window that is measured in seconds.


https://www.confluent.io/product/ksql/

I think we should have that concept as well because an end user will not
know what an offset really means unless they have deep knowledge of the
guts of the stream itself.

Yep, incremental updates with windowing seems to be the right semantics. As
a user, I expect the data (complete or incomplete)to flow through this "SQL
transformer" and I should get a real-time view of this transformed data.
The heavier the SQL workload, the more will be the latency between the
transformed output and the input.

Does continuous update mean you will introduce trigger semantics in Drill?

By the way, the above ideas seem to exist in SparkSQL (structured
streaming). They also seem to have the concept of an event time window:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets

Confluent's claim that that they are turning the database inside out
through streams seems self serving. Because from an analytics
standpoint-the accuracy of the data depends on whether the data is complete
in the stream itself i.e. the SQL transformer is a time-based function
operating on an event stream. Data is typically defined as complete by the
time it enters the data warehouse.

Best,
Saurabh



On Thu, Nov 9, 2017 at 1:10 PM, AnilKumar B <[email protected]> wrote:

> You are correct Kant.
>
> It will be great, If you can raise a JIRA for discussing *feasibility* of
> incremental query support for Drill. Because, I can also see this is a very
> good requirement for plugins like Kafka, HBase and Cassandra and thanks for
> asking this question.
>
> Thanks & Regards,
> B Anil Kumar.
>
> On Thu, Nov 9, 2017 at 12:45 PM, kant kodali <[email protected]> wrote:
>
> > HI Anil,
> >
> > Thanks a lot for your response and look like I am indeed looking for
> > incremental queries. so if I have a thread that polls every second to get
> > the latest updates I just have to change partition values to minimize the
> > scans right?
> >
> > Also I guess I can build some notification mechanism in case if my older
> > partitions have an update.
> >
> > Thanks!
> >
> >
> >
> >
> > On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <[email protected]>
> > wrote:
> >
> > > Hi Kant,
> > >
> > > If I understand your questions properly, you are looking for
> incremental
> > > queries.
> > >
> > > Drill supports predicates pushed down with most of the Data sources. In
> > > your case, suppose you are generating hourly partitions in HDFS using
> > Spark
> > > aplication. Then Drill is optmized to scan specific partition based on
> > > query predicates(by using partition pruning) like for example
> > > https://issues.apache.org/jira/browse/DRILL-3121.
> > >
> > > But Drill will not manage any checkpointing. So If BI/Dashboards tools
> > like
> > > Tableau etc can support this checkpointing then it's possible to
> connect
> > > with Drill incrementally.
> > >
> > > Coming to latest Kafka storage plugin, In first version we are
> targetting
> > > to support batch, I mean, at query time it will fetch all the messages
> > from
> > > start to end offsets for each topic partition and processes the data.
> > > Currently it will support JSON and in next version we are targetting
> for
> > > Avro support with schema registry. We are also discussing on
> fiseability
> > > for metioning start and end offsset ranges, so that we can acheive
> > > incremental support by managing checkpoining externally.
> > >
> > > Thanks,
> > > B Anil Kumar.
> > >
> > > Thanks & Regards,
> > > B Anil Kumar.
> > >
> > > On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <[email protected]>
> wrote:
> > >
> > > > Can someone elaborate on what happens underneath if I poll every
> second
> > > > (Specifically related to my questions in my previous email)?
> > > >
> > > > Thanks!
> > > >
> > > > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > > > Confluent has a non-Apache product, I think, for streaming SQL.
> > > > >
> > > > >
> > > > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > > > >
> > > > > > Best,
> > > > > > Saurabh
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > Hi Tug,
> > > > > > >
> > > > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> > > written
> > > > by
> > > > > > > spark while consuming from Kafka.
> > > > > > >
> > > > > > > Is polling a common technique for say real time analytics
> > > dashboard ?
> > > > > > More
> > > > > > > importantly if I poll does Drill due the scan every time? if
> the
> > > > answer
> > > > > > is
> > > > > > > no, how does it know which is the new data? since the data is
> > > written
> > > > > > HDFS
> > > > > > > constantly as a stream (The query can be the same however the
> new
> > > > data
> > > > > > will
> > > > > > > be appended or updated to HDFS in parquet format as a stream).
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <
> > [email protected]>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Hello,
> > > > > > >>
> > > > > > >>
> > > > > > >> Today Drill cannot do continuous/streaming query, so as you
> > > > mentioned
> > > > > > you
> > > > > > >> will have to use a polling technique.
> > > > > > >>
> > > > > > >>
> > > > > > >> Just out of curiosity, Which data source are you planning to
> > use ?
> > > > > > >>
> > > > > > >> Regards
> > > > > > >> Tug
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <[email protected]>
> > > > wrote:
> > > > > > >>>
> > > > > > >>> Hi All,
> > > > > > >>>
> > > > > > >>> I am new to Apache Drill. I am wondering if Apache Drill can
> > > > perform
> > > > > > >>> Streaming Queries? For example, I have a constant stream of
> > data
> > > in
> > > > > 24
> > > > > > >> hour
> > > > > > >>> period and I would like to get updates as soon as I receive
> > them.
> > > > > > >>>
> > > > > > >>> Do I need to have a polling thread that issues a Drill query
> > > every
> > > > > > >> second?
> > > > > > >>>
> > > > > > >>> Thanks!
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Reply via email to