Re: Can Apache Drill perform streaming queries?

kant kodali Thu, 09 Nov 2017 19:01:23 -0800

Hi Saurabh,

Yes those concept do exist in Spark SQL and Spark in general is awesome but
what Spark SQL lacks is the REST interface where user can submit normal or
streaming queries via REST and get the results out . Right now, a user have
to write imperative code to achieve whatever they want and whenever
requirements change like addition of new queries one need to go back and
change the spark code again. so its not as simple as submitting a new query
via REST. I don't see any engine that can do this as of today.


Thanks!

On Thu, Nov 9, 2017 at 6:41 PM, Saurabh Mahapatra <
[email protected]> wrote:

> Hi Anil,
>
> I think the start and offset feature would be very useful. Interestingly,
> Kafka SQL has the concept of a tumbling window that is measured in seconds.
>
> https://www.confluent.io/product/ksql/
>
> I think we should have that concept as well because an end user will not
> know what an offset really means unless they have deep knowledge of the
> guts of the stream itself.
>
> Yep, incremental updates with windowing seems to be the right semantics. As
> a user, I expect the data (complete or incomplete)to flow through this "SQL
> transformer" and I should get a real-time view of this transformed data.
> The heavier the SQL workload, the more will be the latency between the
> transformed output and the input.
>
> Does continuous update mean you will introduce trigger semantics in Drill?
>
> By the way, the above ideas seem to exist in SparkSQL (structured
> streaming). They also seem to have the concept of an event time window:
>
> https://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html#operations-on-streaming-dataframesdatasets
>
> Confluent's claim that that they are turning the database inside out
> through streams seems self serving. Because from an analytics
> standpoint-the accuracy of the data depends on whether the data is complete
> in the stream itself i.e. the SQL transformer is a time-based function
> operating on an event stream. Data is typically defined as complete by the
> time it enters the data warehouse.
>
> Best,
> Saurabh
>
>
>
> On Thu, Nov 9, 2017 at 1:10 PM, AnilKumar B <[email protected]> wrote:
>
> > You are correct Kant.
> >
> > It will be great, If you can raise a JIRA for discussing *feasibility* of
> > incremental query support for Drill. Because, I can also see this is a
> very
> > good requirement for plugins like Kafka, HBase and Cassandra and thanks
> for
> > asking this question.
> >
> > Thanks & Regards,
> > B Anil Kumar.
> >
> > On Thu, Nov 9, 2017 at 12:45 PM, kant kodali <[email protected]> wrote:
> >
> > > HI Anil,
> > >
> > > Thanks a lot for your response and look like I am indeed looking for
> > > incremental queries. so if I have a thread that polls every second to
> get
> > > the latest updates I just have to change partition values to minimize
> the
> > > scans right?
> > >
> > > Also I guess I can build some notification mechanism in case if my
> older
> > > partitions have an update.
> > >
> > > Thanks!
> > >
> > >
> > >
> > >
> > > On Thu, Nov 9, 2017 at 11:58 AM, AnilKumar B <[email protected]>
> > > wrote:
> > >
> > > > Hi Kant,
> > > >
> > > > If I understand your questions properly, you are looking for
> > incremental
> > > > queries.
> > > >
> > > > Drill supports predicates pushed down with most of the Data sources.
> In
> > > > your case, suppose you are generating hourly partitions in HDFS using
> > > Spark
> > > > aplication. Then Drill is optmized to scan specific partition based
> on
> > > > query predicates(by using partition pruning) like for example
> > > > https://issues.apache.org/jira/browse/DRILL-3121.
> > > >
> > > > But Drill will not manage any checkpointing. So If BI/Dashboards
> tools
> > > like
> > > > Tableau etc can support this checkpointing then it's possible to
> > connect
> > > > with Drill incrementally.
> > > >
> > > > Coming to latest Kafka storage plugin, In first version we are
> > targetting
> > > > to support batch, I mean, at query time it will fetch all the
> messages
> > > from
> > > > start to end offsets for each topic partition and processes the data.
> > > > Currently it will support JSON and in next version we are targetting
> > for
> > > > Avro support with schema registry. We are also discussing on
> > fiseability
> > > > for metioning start and end offsset ranges, so that we can acheive
> > > > incremental support by managing checkpoining externally.
> > > >
> > > > Thanks,
> > > > B Anil Kumar.
> > > >
> > > > Thanks & Regards,
> > > > B Anil Kumar.
> > > >
> > > > On Thu, Nov 9, 2017 at 11:14 AM, kant kodali <[email protected]>
> > wrote:
> > > >
> > > > > Can someone elaborate on what happens underneath if I poll every
> > second
> > > > > (Specifically related to my questions in my previous email)?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Thu, Nov 9, 2017 at 7:56 AM, Ted Dunning <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Confluent has a non-Apache product, I think, for streaming SQL.
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 9, 2017 at 4:50 PM, Saurabh Mahapatra <
> > > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Isn't there the new Kafka plugin? What does that exactly do?
> > > > > > >
> > > > > > > Best,
> > > > > > > Saurabh
> > > > > > >
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Nov 9, 2017, at 5:15 AM, kant kodali <[email protected]>
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi Tug,
> > > > > > > >
> > > > > > > > It's Parquet data on HDFS and the data to HDFS is constantly
> > > > written
> > > > > by
> > > > > > > > spark while consuming from Kafka.
> > > > > > > >
> > > > > > > > Is polling a common technique for say real time analytics
> > > > dashboard ?
> > > > > > > More
> > > > > > > > importantly if I poll does Drill due the scan every time? if
> > the
> > > > > answer
> > > > > > > is
> > > > > > > > no, how does it know which is the new data? since the data is
> > > > written
> > > > > > > HDFS
> > > > > > > > constantly as a stream (The query can be the same however the
> > new
> > > > > data
> > > > > > > will
> > > > > > > > be appended or updated to HDFS in parquet format as a
> stream).
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >> On Thu, Nov 9, 2017 at 4:47 AM, Tugdual Grall <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Hello,
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Today Drill cannot do continuous/streaming query, so as you
> > > > > mentioned
> > > > > > > you
> > > > > > > >> will have to use a polling technique.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Just out of curiosity, Which data source are you planning to
> > > use ?
> > > > > > > >>
> > > > > > > >> Regards
> > > > > > > >> Tug
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>> On Thu 9 Nov 2017 at 04:31, kant kodali <
> [email protected]>
> > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> Hi All,
> > > > > > > >>>
> > > > > > > >>> I am new to Apache Drill. I am wondering if Apache Drill
> can
> > > > > perform
> > > > > > > >>> Streaming Queries? For example, I have a constant stream of
> > > data
> > > > in
> > > > > > 24
> > > > > > > >> hour
> > > > > > > >>> period and I would like to get updates as soon as I receive
> > > them.
> > > > > > > >>>
> > > > > > > >>> Do I need to have a polling thread that issues a Drill
> query
> > > > every
> > > > > > > >> second?
> > > > > > > >>>
> > > > > > > >>> Thanks!
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Can Apache Drill perform streaming queries?

Reply via email to