Re: Recursive CTE Support in Drill

Jacques Nadeau Sat, 18 Jul 2015 16:47:31 -0700

Good point.  In fact, you can just use a literal expression and some sample
data such tpch lineitem:


SELECT * FROM
(select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key
from cp.`tpch/lineitem.parquet`) t1
JOIN
(select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key
from cp.`tpch/lineitem.parquet`) t2 on t1.join_key = t2.join_key

...SNIP...

3,621,030,625 rows selected









On Sat, Jul 18, 2015 at 2:38 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Jacques, Alexander,
>
> The cartesian join approach will produce an enormous stream of data with
> only a very small amount of disk read.  It works now.  If you start with
> small file, you can increase the amount of output in small increments by
> simply adding another join. You do have to have some kind of equality join
> involved since drill will (anomalously in my view) refuse to plan the query
> otherwise.
>
> Here is an example.  Note that I have deleted many lines of output and
> indicated that with ellipsis (...).  The three queries here produce 4, 16
> and 64 lines of output from a 4 line input file.  That could have been n,
> n^2 and n^3 just as easily for any value of n that you might like.
>
> ted:apache-drill-1.0.0$ cat > x.csv <<!
> > a,1
> > a,2
> > a,3
> > a,4
> > !
> ted:apache-drill-1.0.0$     bin/sqlline -u jdbc:drill:zk=local -n admin -p
> admin
> 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
>  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;*
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | a       | 1       |
> | a       | 2       |
> | a       | 3       |
> | a       | 4       |
> +---------+---------+
> 4 rows selected (0.097 seconds)
> 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
>  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
> dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] =
> y.columns[0];*
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | a       | 1       |
> | a       | 1       |
> | a       | 1       |
> | a       | 1       |
> | a       | 2       |
> ...
> | a       | 4       |
> | a       | 4       |
> +---------+---------+
> 16 rows selected (0.614 seconds)
> 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
>  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
> dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y,
> dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] =
> y.columns[0] and x.columns[0] = z.columns[0];*
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | a       | 1       |
> | a       | 1       |
> | a       | 1       |
> | a       | 1       |
> ...
> | a       | 4       |
> | a       | 4       |
> | a       | 4       |
> | a       | 4       |
> +---------+---------+
> 64 rows selected (0.63 seconds)
> 0: jdbc:drill:zk=local>
>
>
>
> On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
> > Removing cross posting
> >
> > Alexander,
> >
> > There is currently no way for Drill to generate a large amount of data
> > using SQL.  However, you can generate large generic data by using the
> > MockStoragePlugin if you submit a plan.  You can find an example plan
> using
> > this at [1].
> >
> > I heard someone might be working on extending the MockStoragePlugin to
> > support SQL which would provide the outcome you requested.
> >
> > [1]
> >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
> >
> > On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > >
> > > Also, just doing a Cartesian join of three copies of 1000 records will
> > > give you a billion records with negligible I/o.
> > >
> > > Sent from my iPhone
> > >
> > > > On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com
> >
> > > wrote:
> > > >
> > > > @Alexander If you want to test the speed of the ODBC driver you can
> do
> > > that
> > > > without a new storage plugin.
> > > >
> > > > If you get the entire dataset into memory, it will be returned from
> > > Drill a
> > > > quickly as we can possibly send it to the client. One way to do this
> is
> > > to
> > > > insert a sort; we cannot send along any of the data until the compete
> > > sort
> > > > is done. As long as you don't read so much data that we will start
> > > spilling
> > > > the sort to disk, all of the records will be in memory. To take the
> > read
> > > > and sort time out of your test, just make sure to record the time you
> > > first
> > > > receive data from Drill, not the query start time.
> > > >
> > > > There is one gotcha here. To make the BI tools more responsive, we
> > > > implemented a feature that will send along one empty batch of records
> > > with
> > > > the schema information populated. This schema is generated by
> applying
> > > all
> > > > of the transformations that happen throughout the query. For example,
> > the
> > > > join operator handles this schema population by sending along the
> > schema
> > > > merged from the two sides of the join, project will similarly add or
> > > remove
> > > > column based on the expressions and columns requested. You will want
> to
> > > > make sure you record your start time when you receive the first batch
> > > with
> > > > actual records. This can give you an accurate measurement of the ODBC
> > > > performance, removing the bottleneck of the disk.
> > > >
> > > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
> > > alexanderz.si...@gmail.com
> > > >> wrote:
> > > >
> > > >> Thanks for the answers.
> > > >>
> > > >> @Ted my only goal is to pump a large amount of data without having
> to
> > > read
> > > >> from Hard Disk. I am measuring the ODBC driver performance and I
> need
> > a
> > > >> higher data transfer rate. So any method that helps pumping data out
> > of
> > > >> Drill faster would help. The log-synth seems a good way to generate
> > data
> > > >> for testing. However, I'd need a ram only option which hopefully
> > > provides a
> > > >> higher throughput.
> > > >>
> > > >> @Jacques How involved is it to write a dummy plugin that returns one
> > > >> hardcoded row repeatedly 12 million times?
> > > >>
> > > >> Thanks,
> > > >> Alex
> > > >>
> > > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <
> ted.dunn...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> It may be easy, but it is completely opaque about what really needs
> > to
> > > >>> happen.
> > > >>>
> > > >>> For instance,
> > > >>>
> > > >>> 1) how is schema exposed?
> > > >>>
> > > >>> 2) which classes do I really need to implement?
> > > >>>
> > > >>> 3) how do I express partitioning of a format?
> > > >>>
> > > >>> 4) how do I test it?
> > > >>>
> > > >>> Just a bit of documentation and comments would go a very, very long
> > > way.
> > > >>>
> > > >>> Even answers on the mailing list that have more details than "oh,
> > > that's
> > > >>> easy".  I would be happy to transcribe answers into the code if I
> > could
> > > >>> just get some.
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <
> jacq...@apache.org
> > >
> > > >>> wrote:
> > > >>>
> > > >>>> Creating an EasyFormatPlugin is pretty simple.  They were designed
> > to
> > > >> get
> > > >>>> rid of much of the scaffolding required for a standard
> FormatPlugin.
> > > >>>>
> > > >>>> JSON
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
> > > >>>>
> > > >>>> Text
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
> > > >>>>
> > > >>>> AVRO
> > > >>
> > >
> >
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
> > > >>>>
> > > >>>> In all cases, the connection code is pretty light.  A fully
> > > schematized
> > > >>>> format like log-synth should be even simpler to implement.
> > > >>>>
> > > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <
> > ted.dunn...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I don't think we need a full on storage plugin.  I think a data
> > > >> format
> > > >>>>> should be sufficient, basically CSV on steroids.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
> > > >>>>> adene...@maprtech.com
> > > >>>>>> wrote:
> > > >>>>>
> > > >>>>>> Yeah, we still lack documentation on how to write a storage
> > plugin.
> > > >>> One
> > > >>>>>> advice I've been seeing a lot is to take a look at the mongo-db
> > > >>> plugin,
> > > >>>>> it
> > > >>>>>> was basically added in one single commit:
> > > >>
> > >
> >
> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
> > > >>>>>>
> > > >>>>>> I think this will give some general ideas on what to expect when
> > > >>>> writing
> > > >>>>> a
> > > >>>>>> storage plugin.
> > > >>>>>>
> > > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
> > > >> ted.dunn...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hakim,
> > > >>>>>>>
> > > >>>>>>> Not yet.  Still very much in the stage of gathering feedback.
> > > >>>>>>>
> > > >>>>>>> I would think it very simple.  The biggest obstacles are
> > > >>>>>>>
> > > >>>>>>> 1) no documentation on how to write a data format
> > > >>>>>>>
> > > >>>>>>> 2) I need to release a jar for log-synth to Maven Central.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
> > > >>>>>>> adene...@maprtech.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm
> > > >>>>> already
> > > >>>>>>>> seeing many unit tests that could benefit from this. Do you
> > > >> have
> > > >>> a
> > > >>>>>> github
> > > >>>>>>>> repo for your ongoing work ?
> > > >>>>>>>>
> > > >>>>>>>> Thanks!
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
> > > >>>> ted.dunn...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Are you hard set on using common table expressions?
> > > >>>>>>>>>
> > > >>>>>>>>> I have discussed a bit off-list creating a data format that
> > > >>> would
> > > >>>>>> allow
> > > >>>>>>>>> tables to be read from a log-synth [1] schema.  That would
> > > >> let
> > > >>>> you
> > > >>>>>> read
> > > >>>>>>>> as
> > > >>>>>>>>> much data as you might like with an arbitrarily complex (or
> > > >>>> simple)
> > > >>>>>>>> query.
> > > >>>>>>>>>
> > > >>>>>>>>> Operationally, you would create a file containing a log-synth
> > > >>>>> schema
> > > >>>>>>> that
> > > >>>>>>>>> has the extension .synth.  Your data source would have to be
> > > >>>>>> configured
> > > >>>>>>>> to
> > > >>>>>>>>> connect that extension with the log-synth format.  At that
> > > >>> point,
> > > >>>>> you
> > > >>>>>>>> could
> > > >>>>>>>>> select as much or little data as you like from the file and
> > > >> you
> > > >>>>> would
> > > >>>>>>> see
> > > >>>>>>>>> generated data rather than the schema.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> [1] https://github.com/tdunning/log-synth
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
> > > >>>>>>>>> alexanderz.si...@gmail.com
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi All,
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am trying to come up with a query which returns a given
> > > >>>> number
> > > >>>>> of
> > > >>>>>>>> rows
> > > >>>>>>>>>> without having a real table on Storage.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am hoping to achieve something like this:
> > > >>
> > >
> >
> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
> > > >>>>>>>>>>
> > > >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000;
> > > >>>>>>>>>> WITH numbers AS (
> > > >>>>>>>>>>    SELECT @start AS number
> > > >>>>>>>>>>    UNION ALL
> > > >>>>>>>>>>    SELECT number + 1
> > > >>>>>>>>>>    FROM  numbers
> > > >>>>>>>>>>    WHERE number < @end)SELECT *FROM numbersOPTION
> > > >>>> (MAXRECURSION
> > > >>>>>> 0);
> > > >>>>>>>>>>
> > > >>>>>>>>>> I do not actually need to create different values and
> > > >>> returning
> > > >>>>>>>> identical
> > > >>>>>>>>>> rows would work too.I just need to bypass the "from clause"
> > > >>> in
> > > >>>>> the
> > > >>>>>>>> query.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks,
> > > >>>>>>>>>> Alex
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>>
> > > >>>>>>>> Abdelhakim Deneche
> > > >>>>>>>>
> > > >>>>>>>> Software Engineer
> > > >>>>>>>>
> > > >>>>>>>>  <http://www.mapr.com/>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Now Available - Free Hadoop On-Demand Training
> > > >>>>>>>> <
> > > >>
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>>
> > > >>>>>> Abdelhakim Deneche
> > > >>>>>>
> > > >>>>>> Software Engineer
> > > >>>>>>
> > > >>>>>>  <http://www.mapr.com/>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Now Available - Free Hadoop On-Demand Training
> > > >>>>>> <
> > > >>
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >>
> > >
> >
>

Re: Recursive CTE Support in Drill

Reply via email to