Re: Recursive CTE Support in Drill

Jason Altekruse Fri, 17 Jul 2015 14:10:27 -0700

Jacques,

Alexander has brought up this problem previously in one of the hangouts and
said that submitting a physical plan was not possible through ODBC. If he
is able to modify the driver code to make it possible to submit one, that
would be an option, as I believe the C++ client is capable of submitting
plans. The issue I seem to recall him mentioning is that the ODBC driver
was running a little sanity checking on they sql query to try to prevent
submitting complete garbage queries to a server. I think he had concerns
that a JSON formatted physical plan would fail these checks and he would
have to disable them along with trying to allow submitting two types of
queries from ODBC.


On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <[email protected]> wrote:

> Removing cross posting
>
> Alexander,
>
> There is currently no way for Drill to generate a large amount of data
> using SQL.  However, you can generate large generic data by using the
> MockStoragePlugin if you submit a plan.  You can find an example plan using
> this at [1].
>
> I heard someone might be working on extending the MockStoragePlugin to
> support SQL which would provide the outcome you requested.
>
> [1]
>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
>
> On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <[email protected]>
> wrote:
>
> >
> > Also, just doing a Cartesian join of three copies of 1000 records will
> > give you a billion records with negligible I/o.
> >
> > Sent from my iPhone
> >
> > > On Jul 16, 2015, at 15:43, Jason Altekruse <[email protected]>
> > wrote:
> > >
> > > @Alexander If you want to test the speed of the ODBC driver you can do
> > that
> > > without a new storage plugin.
> > >
> > > If you get the entire dataset into memory, it will be returned from
> > Drill a
> > > quickly as we can possibly send it to the client. One way to do this is
> > to
> > > insert a sort; we cannot send along any of the data until the compete
> > sort
> > > is done. As long as you don't read so much data that we will start
> > spilling
> > > the sort to disk, all of the records will be in memory. To take the
> read
> > > and sort time out of your test, just make sure to record the time you
> > first
> > > receive data from Drill, not the query start time.
> > >
> > > There is one gotcha here. To make the BI tools more responsive, we
> > > implemented a feature that will send along one empty batch of records
> > with
> > > the schema information populated. This schema is generated by applying
> > all
> > > of the transformations that happen throughout the query. For example,
> the
> > > join operator handles this schema population by sending along the
> schema
> > > merged from the two sides of the join, project will similarly add or
> > remove
> > > column based on the expressions and columns requested. You will want to
> > > make sure you record your start time when you receive the first batch
> > with
> > > actual records. This can give you an accurate measurement of the ODBC
> > > performance, removing the bottleneck of the disk.
> > >
> > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
> > [email protected]
> > >> wrote:
> > >
> > >> Thanks for the answers.
> > >>
> > >> @Ted my only goal is to pump a large amount of data without having to
> > read
> > >> from Hard Disk. I am measuring the ODBC driver performance and I need
> a
> > >> higher data transfer rate. So any method that helps pumping data out
> of
> > >> Drill faster would help. The log-synth seems a good way to generate
> data
> > >> for testing. However, I'd need a ram only option which hopefully
> > provides a
> > >> higher throughput.
> > >>
> > >> @Jacques How involved is it to write a dummy plugin that returns one
> > >> hardcoded row repeatedly 12 million times?
> > >>
> > >> Thanks,
> > >> Alex
> > >>
> > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <[email protected]>
> > >> wrote:
> > >>
> > >>> It may be easy, but it is completely opaque about what really needs
> to
> > >>> happen.
> > >>>
> > >>> For instance,
> > >>>
> > >>> 1) how is schema exposed?
> > >>>
> > >>> 2) which classes do I really need to implement?
> > >>>
> > >>> 3) how do I express partitioning of a format?
> > >>>
> > >>> 4) how do I test it?
> > >>>
> > >>> Just a bit of documentation and comments would go a very, very long
> > way.
> > >>>
> > >>> Even answers on the mailing list that have more details than "oh,
> > that's
> > >>> easy".  I would be happy to transcribe answers into the code if I
> could
> > >>> just get some.
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <[email protected]
> >
> > >>> wrote:
> > >>>
> > >>>> Creating an EasyFormatPlugin is pretty simple.  They were designed
> to
> > >> get
> > >>>> rid of much of the scaffolding required for a standard FormatPlugin.
> > >>>>
> > >>>> JSON
> > >>
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
> > >>>>
> > >>>> Text
> > >>
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
> > >>>>
> > >>>> AVRO
> > >>
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
> > >>>>
> > >>>> In all cases, the connection code is pretty light.  A fully
> > schematized
> > >>>> format like log-synth should be even simpler to implement.
> > >>>>
> > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <
> [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> I don't think we need a full on storage plugin.  I think a data
> > >> format
> > >>>>> should be sufficient, basically CSV on steroids.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
> > >>>>> [email protected]
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Yeah, we still lack documentation on how to write a storage
> plugin.
> > >>> One
> > >>>>>> advice I've been seeing a lot is to take a look at the mongo-db
> > >>> plugin,
> > >>>>> it
> > >>>>>> was basically added in one single commit:
> > >>
> >
> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
> > >>>>>>
> > >>>>>> I think this will give some general ideas on what to expect when
> > >>>> writing
> > >>>>> a
> > >>>>>> storage plugin.
> > >>>>>>
> > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
> > >> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hakim,
> > >>>>>>>
> > >>>>>>> Not yet.  Still very much in the stage of gathering feedback.
> > >>>>>>>
> > >>>>>>> I would think it very simple.  The biggest obstacles are
> > >>>>>>>
> > >>>>>>> 1) no documentation on how to write a data format
> > >>>>>>>
> > >>>>>>> 2) I need to release a jar for log-synth to Maven Central.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
> > >>>>>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm
> > >>>>> already
> > >>>>>>>> seeing many unit tests that could benefit from this. Do you
> > >> have
> > >>> a
> > >>>>>> github
> > >>>>>>>> repo for your ongoing work ?
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
> > >>>> [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Are you hard set on using common table expressions?
> > >>>>>>>>>
> > >>>>>>>>> I have discussed a bit off-list creating a data format that
> > >>> would
> > >>>>>> allow
> > >>>>>>>>> tables to be read from a log-synth [1] schema.  That would
> > >> let
> > >>>> you
> > >>>>>> read
> > >>>>>>>> as
> > >>>>>>>>> much data as you might like with an arbitrarily complex (or
> > >>>> simple)
> > >>>>>>>> query.
> > >>>>>>>>>
> > >>>>>>>>> Operationally, you would create a file containing a log-synth
> > >>>>> schema
> > >>>>>>> that
> > >>>>>>>>> has the extension .synth.  Your data source would have to be
> > >>>>>> configured
> > >>>>>>>> to
> > >>>>>>>>> connect that extension with the log-synth format.  At that
> > >>> point,
> > >>>>> you
> > >>>>>>>> could
> > >>>>>>>>> select as much or little data as you like from the file and
> > >> you
> > >>>>> would
> > >>>>>>> see
> > >>>>>>>>> generated data rather than the schema.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> [1] https://github.com/tdunning/log-synth
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
> > >>>>>>>>> [email protected]
> > >>>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi All,
> > >>>>>>>>>>
> > >>>>>>>>>> I am trying to come up with a query which returns a given
> > >>>> number
> > >>>>> of
> > >>>>>>>> rows
> > >>>>>>>>>> without having a real table on Storage.
> > >>>>>>>>>>
> > >>>>>>>>>> I am hoping to achieve something like this:
> > >>
> >
> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
> > >>>>>>>>>>
> > >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000;
> > >>>>>>>>>> WITH numbers AS (
> > >>>>>>>>>>    SELECT @start AS number
> > >>>>>>>>>>    UNION ALL
> > >>>>>>>>>>    SELECT number + 1
> > >>>>>>>>>>    FROM  numbers
> > >>>>>>>>>>    WHERE number < @end)SELECT *FROM numbersOPTION
> > >>>> (MAXRECURSION
> > >>>>>> 0);
> > >>>>>>>>>>
> > >>>>>>>>>> I do not actually need to create different values and
> > >>> returning
> > >>>>>>>> identical
> > >>>>>>>>>> rows would work too.I just need to bypass the "from clause"
> > >>> in
> > >>>>> the
> > >>>>>>>> query.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Alex
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>>
> > >>>>>>>> Abdelhakim Deneche
> > >>>>>>>>
> > >>>>>>>> Software Engineer
> > >>>>>>>>
> > >>>>>>>>  <http://www.mapr.com/>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Now Available - Free Hadoop On-Demand Training
> > >>>>>>>> <
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>>
> > >>>>>> Abdelhakim Deneche
> > >>>>>>
> > >>>>>> Software Engineer
> > >>>>>>
> > >>>>>>  <http://www.mapr.com/>
> > >>>>>>
> > >>>>>>
> > >>>>>> Now Available - Free Hadoop On-Demand Training
> > >>>>>> <
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >>
> >
>

Re: Recursive CTE Support in Drill

Reply via email to