Re: Recursive CTE Support in Drill

Alexander Zarei Mon, 20 Jul 2015 14:53:59 -0700

Thanks for more elaboration Ted, Jacques and Jason!

@Ted that is a very cool idea. I tried the cross join but figured cross
join is not supported in drill yet but we have DRILL-786 for it. The new
method looks very promising. It seems it is an implicit cross join, isn't
it? I just tried it out and it worked like a charm. I will go on with this
method.


@Jaques, yes as Jason said, we discussed this before and I have talked to
my colleagues to help me with modifying the ODBC driver so it sends a plan.
Also thanks for the query. I tied it out for tow tables and it worked find
but extending it to three tables gives me a syntax error.

 select * from

((select column1, 1 as join_keyb from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1

  Join

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2

  on t1.join_key=t2.join_key) t12

Join

(SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 on t12.join_keyb=t3.join_key)



*The other syntax was easier for me to use the join three times so I could
test it with *


 select t1.column1 from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 where

 t1.join_key=t2.join_key and t1.join_key=t3.join_key


Thank you very much for your time Ted, Jacques and Jason!

Thanks,
Alex

On Fri, Jul 17, 2015 at 2:09 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Jacques,
>
> Alexander has brought up this problem previously in one of the hangouts and
> said that submitting a physical plan was not possible through ODBC. If he
> is able to modify the driver code to make it possible to submit one, that
> would be an option, as I believe the C++ client is capable of submitting
> plans. The issue I seem to recall him mentioning is that the ODBC driver
> was running a little sanity checking on they sql query to try to prevent
> submitting complete garbage queries to a server. I think he had concerns
> that a JSON formatted physical plan would fail these checks and he would
> have to disable them along with trying to allow submitting two types of
> queries from ODBC.
>
> On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
> > Removing cross posting
> >
> > Alexander,
> >
> > There is currently no way for Drill to generate a large amount of data
> > using SQL.  However, you can generate large generic data by using the
> > MockStoragePlugin if you submit a plan.  You can find an example plan
> using
> > this at [1].
> >
> > I heard someone might be working on extending the MockStoragePlugin to
> > support SQL which would provide the outcome you requested.
> >
> > [1]
> >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
> >
> > On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > >
> > > Also, just doing a Cartesian join of three copies of 1000 records will
> > > give you a billion records with negligible I/o.
> > >
> > > Sent from my iPhone
> > >
> > > > On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com
> >
> > > wrote:
> > > >
> > > > @Alexander If you want to test the speed of the ODBC driver you can
> do
> > > that
> > > > without a new storage plugin.
> > > >
> > > > If you get the entire dataset into memory, it will be returned from
> > > Drill a
> > > > quickly as we can possibly send it to the client. One way to do this
> is
> > > to
> > > > insert a sort; we cannot send along any of the data until the compete
> > > sort
> > > > is done. As long as you don't read so much data that we will start
> > > spilling
> > > > the sort to disk, all of the records will be in memory. To take the
> > read
> > > > and sort time out of your test, just make sure to record the time you
> > > first
> > > > receive data from Drill, not the query start time.
> > > >
> > > > There is one gotcha here. To make the BI tools more responsive, we
> > > > implemented a feature that will send along one empty batch of records
> > > with
> > > > the schema information populated. This schema is generated by
> applying
> > > all
> > > > of the transformations that happen throughout the query. For example,
> > the
> > > > join operator handles this schema population by sending along the
> > schema
> > > > merged from the two sides of the join, project will similarly add or
> > > remove
> > > > column based on the expressions and columns requested. You will want
> to
> > > > make sure you record your start time when you receive the first batch
> > > with
> > > > actual records. This can give you an accurate measurement of the ODBC
> > > > performance, removing the bottleneck of the disk.
> > > >
> > > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
> > > alexanderz.si...@gmail.com
> > > >> wrote:
> > > >
> > > >> Thanks for the answers.
> > > >>
> > > >> @Ted my only goal is to pump a large amount of data without having
> to
> > > read
> > > >> from Hard Disk. I am measuring the ODBC driver performance and I
> need
> > a
> > > >> higher data transfer rate. So any method that helps pumping data out
> > of
> > > >> Drill faster would help. The log-synth seems a good way to generate
> > data
> > > >> for testing. However, I'd need a ram only option which hopefully
> > > provides a
> > > >> higher throughput.
> > > >>
> > > >> @Jacques How involved is it to write a dummy plugin that returns one
> > > >> hardcoded row repeatedly 12 million times?
> > > >>
> > > >> Thanks,
> > > >> Alex
> > > >>
> > > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <
> ted.dunn...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> It may be easy, but it is completely opaque about what really needs
> > to
> > > >>> happen.
> > > >>>
> > > >>> For instance,
> > > >>>
> > > >>> 1) how is schema exposed?
> > > >>>
> > > >>> 2) which classes do I really need to implement?
> > > >>>
> > > >>> 3) how do I express partitioning of a format?
> > > >>>
> > > >>> 4) how do I test it?
> > > >>>
> > > >>> Just a bit of documentation and comments would go a very, very long
> > > way.
> > > >>>
> > > >>> Even answers on the mailing list that have more details than "oh,
> > > that's
> > > >>> easy".  I would be happy to transcribe answers into the code if I
> > could
> > > >>> just get some.
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <
> jacq...@apache.org
> > >
> > > >>> wrote:
> > > >>>
> > > >>>> Creating an EasyFormatPlugin is pretty simple.  They were designed
> > to
> > > >> get
> > > >>>> rid of much of the scaffolding required for a standard
> FormatPlugin.
> > > >>>>
> > > >>>> JSON
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
> > > >>>>
> > > >>>> Text
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
> > > >>>>
> > > >>>> AVRO
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
> > > >>>>
> > > >>>> In all cases, the connection code is pretty light.  A fully
> > > schematized
> > > >>>> format like log-synth should be even simpler to implement.
> > > >>>>
> > > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <
> > ted.dunn...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I don't think we need a full on storage plugin.  I think a data
> > > >> format
> > > >>>>> should be sufficient, basically CSV on steroids.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
> > > >>>>> adene...@maprtech.com
> > > >>>>>> wrote:
> > > >>>>>
> > > >>>>>> Yeah, we still lack documentation on how to write a storage
> > plugin.
> > > >>> One
> > > >>>>>> advice I've been seeing a lot is to take a look at the mongo-db
> > > >>> plugin,
> > > >>>>> it
> > > >>>>>> was basically added in one single commit:
> > > >>
> > >
> >
> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
> > > >>>>>>
> > > >>>>>> I think this will give some general ideas on what to expect when
> > > >>>> writing
> > > >>>>> a
> > > >>>>>> storage plugin.
> > > >>>>>>
> > > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
> > > >> ted.dunn...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hakim,
> > > >>>>>>>
> > > >>>>>>> Not yet.  Still very much in the stage of gathering feedback.
> > > >>>>>>>
> > > >>>>>>> I would think it very simple.  The biggest obstacles are
> > > >>>>>>>
> > > >>>>>>> 1) no documentation on how to write a data format
> > > >>>>>>>
> > > >>>>>>> 2) I need to release a jar for log-synth to Maven Central.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
> > > >>>>>>> adene...@maprtech.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm
> > > >>>>> already
> > > >>>>>>>> seeing many unit tests that could benefit from this. Do you
> > > >> have
> > > >>> a
> > > >>>>>> github
> > > >>>>>>>> repo for your ongoing work ?
> > > >>>>>>>>
> > > >>>>>>>> Thanks!
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
> > > >>>> ted.dunn...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Are you hard set on using common table expressions?
> > > >>>>>>>>>
> > > >>>>>>>>> I have discussed a bit off-list creating a data format that
> > > >>> would
> > > >>>>>> allow
> > > >>>>>>>>> tables to be read from a log-synth [1] schema.  That would
> > > >> let
> > > >>>> you
> > > >>>>>> read
> > > >>>>>>>> as
> > > >>>>>>>>> much data as you might like with an arbitrarily complex (or
> > > >>>> simple)
> > > >>>>>>>> query.
> > > >>>>>>>>>
> > > >>>>>>>>> Operationally, you would create a file containing a log-synth
> > > >>>>> schema
> > > >>>>>>> that
> > > >>>>>>>>> has the extension .synth.  Your data source would have to be
> > > >>>>>> configured
> > > >>>>>>>> to
> > > >>>>>>>>> connect that extension with the log-synth format.  At that
> > > >>> point,
> > > >>>>> you
> > > >>>>>>>> could
> > > >>>>>>>>> select as much or little data as you like from the file and
> > > >> you
> > > >>>>> would
> > > >>>>>>> see
> > > >>>>>>>>> generated data rather than the schema.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> [1] https://github.com/tdunning/log-synth
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
> > > >>>>>>>>> alexanderz.si...@gmail.com
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi All,
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am trying to come up with a query which returns a given
> > > >>>> number
> > > >>>>> of
> > > >>>>>>>> rows
> > > >>>>>>>>>> without having a real table on Storage.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am hoping to achieve something like this:
> > > >>
> > >
> >
> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
> > > >>>>>>>>>>
> > > >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000;
> > > >>>>>>>>>> WITH numbers AS (
> > > >>>>>>>>>>    SELECT @start AS number
> > > >>>>>>>>>>    UNION ALL
> > > >>>>>>>>>>    SELECT number + 1
> > > >>>>>>>>>>    FROM  numbers
> > > >>>>>>>>>>    WHERE number < @end)SELECT *FROM numbersOPTION
> > > >>>> (MAXRECURSION
> > > >>>>>> 0);
> > > >>>>>>>>>>
> > > >>>>>>>>>> I do not actually need to create different values and
> > > >>> returning
> > > >>>>>>>> identical
> > > >>>>>>>>>> rows would work too.I just need to bypass the "from clause"
> > > >>> in
> > > >>>>> the
> > > >>>>>>>> query.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks,
> > > >>>>>>>>>> Alex
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>>
> > > >>>>>>>> Abdelhakim Deneche
> > > >>>>>>>>
> > > >>>>>>>> Software Engineer
> > > >>>>>>>>
> > > >>>>>>>>  <http://www.mapr.com/>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Now Available - Free Hadoop On-Demand Training
> > > >>>>>>>> <
> > > >>
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>>
> > > >>>>>> Abdelhakim Deneche
> > > >>>>>>
> > > >>>>>> Software Engineer
> > > >>>>>>
> > > >>>>>>  <http://www.mapr.com/>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Now Available - Free Hadoop On-Demand Training
> > > >>>>>> <
> > > >>
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >>
> > >
> >
>

Re: Recursive CTE Support in Drill

Reply via email to