Re: Recursive CTE Support in Drill

Daniel Barclay Thu, 27 Aug 2015 17:38:15 -0700

Ted Dunning wrote:

The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.

You don't even need an external seed file (or to query a built-in table), and 
using WITH can extend multiplication to exponentiation:


WITH q(key) AS (
  WITH q(key) AS (
    VALUES 1, 1
  )
  SELECT q1.key
    FROM q q1
    INNER JOIN q q2  ON q1.key = q2.key
    INNER JOIN q q3  ON q2.key = q3.key
    INNER JOIN q q4  ON q3.key = q4.key
)
SELECT COUNT(*)
  FROM q q1
  INNER JOIN q q2 ON q1.key = q2.key
  INNER JOIN q q3 ON q2.key = q3.key
  INNER JOIN q q4 ON q3.key = q4.key ;

That generates 65536 ((2^4) ^4) rows (before the COUNT(*)).

Augmenting the table value constructor (VALUES ...) to list a third row (a third 
"1") increases the resulting row count to 43,046,721 ((3^4) ^4).

Going to four rows bumps that to about 4 billion ((4^4) ^4):

0: jdbc:drill:drillbit=localhost> WITH q(key) AS (   WITH q(key) AS (VALUES 1, 
1, 1, 1)  SELECT q1.key  FROM q q1 INNER JOIN q q2  ON q1.key = q2.key INNER JOIN 
q q3  ON q2.key = q3.key INNER JOIN q q4  ON q3.key = q4.key   )   SELECT COUNT(*) 
 FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key 
INNER JOIN q q4 ON q3.key = q4.key     ;
+-------------+
|   EXPR$0    |
+-------------+
| 4294967296  |
+-------------+
1 row selected (1197.674 seconds)

!

Daniel


Ted Dunning wrote:

Jacques, Alexander,

The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.  It works now.  If you start with
small file, you can increase the amount of output in small increments by
simply adding another join. You do have to have some kind of equality join
involved since drill will (anomalously in my view) refuse to plan the query
otherwise.

Here is an example.  Note that I have deleted many lines of output and
indicated that with ellipsis (...).  The three queries here produce 4, 16
and 64 lines of output from a 4 line input file.  That could have been n,
n^2 and n^3 just as easily for any value of n that you might like.

ted:apache-drill-1.0.0$ cat > x.csv <<!

a,1
a,2
a,3
a,4
!

ted:apache-drill-1.0.0$     bin/sqlline -u jdbc:drill:zk=local -n admin -p
admin
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 2       |
| a       | 3       |
| a       | 4       |
+---------+---------+
4 rows selected (0.097 seconds)
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] =
y.columns[0];*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 2       |
...
| a       | 4       |
| a       | 4       |
+---------+---------+
16 rows selected (0.614 seconds)
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] =
y.columns[0] and x.columns[0] = z.columns[0];*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 1       |
...
| a       | 4       |
| a       | 4       |
| a       | 4       |
| a       | 4       |
+---------+---------+
64 rows selected (0.63 seconds)
0: jdbc:drill:zk=local>



On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacq...@dremio.com> wrote:

Removing cross posting

Alexander,

There is currently no way for Drill to generate a large amount of data
using SQL.  However, you can generate large generic data by using the
MockStoragePlugin if you submit a plan.  You can find an example plan using
this at [1].

I heard someone might be working on extending the MockStoragePlugin to
support SQL which would provide the outcome you requested.

[1]

https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json

On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

Also, just doing a Cartesian join of three copies of 1000 records will
give you a billion records with negligible I/o.

Sent from my iPhone

On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com>

wrote:

@Alexander If you want to test the speed of the ODBC driver you can do

that

without a new storage plugin.

If you get the entire dataset into memory, it will be returned from

Drill a

quickly as we can possibly send it to the client. One way to do this is

to

insert a sort; we cannot send along any of the data until the compete

sort

is done. As long as you don't read so much data that we will start

spilling

the sort to disk, all of the records will be in memory. To take the

read

and sort time out of your test, just make sure to record the time you

first

receive data from Drill, not the query start time.

There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records

with

the schema information populated. This schema is generated by applying

all

of the transformations that happen throughout the query. For example,

the

join operator handles this schema population by sending along the

schema

merged from the two sides of the join, project will similarly add or

remove

column based on the expressions and columns requested. You will want to
make sure you record your start time when you receive the first batch

with

actual records. This can give you an accurate measurement of the ODBC
performance, removing the bottleneck of the disk.

On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <

alexanderz.si...@gmail.com

wrote:
Thanks for the answers.

@Ted my only goal is to pump a large amount of data without having to

read

from Hard Disk. I am measuring the ODBC driver performance and I need

higher data transfer rate. So any method that helps pumping data out

of

Drill faster would help. The log-synth seems a good way to generate

data

for testing. However, I'd need a ram only option which hopefully

provides a

higher throughput.

@Jacques How involved is it to write a dummy plugin that returns one
hardcoded row repeatedly 12 million times?

Thanks,
Alex

On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

It may be easy, but it is completely opaque about what really needs

to

happen.

For instance,

1) how is schema exposed?

2) which classes do I really need to implement?

3) how do I express partitioning of a format?

4) how do I test it?

Just a bit of documentation and comments would go a very, very long

way.

Even answers on the mailing list that have more details than "oh,

that's

easy".  I would be happy to transcribe answers into the code if I

could

just get some.



On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <jacq...@apache.org
wrote:

Creating an EasyFormatPlugin is pretty simple.  They were designed

to

get

rid of much of the scaffolding required for a standard FormatPlugin.

JSON

https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json

Text

https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text

AVRO

https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro

In all cases, the connection code is pretty light.  A fully

schematized

format like log-synth should be even simpler to implement.

On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <

ted.dunn...@gmail.com>

wrote:

I don't think we need a full on storage plugin.  I think a data

format

should be sufficient, basically CSV on steroids.





On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
adene...@maprtech.com

wrote:
Yeah, we still lack documentation on how to write a storage

plugin.

One

advice I've been seeing a lot is to take a look at the mongo-db

plugin,

it

was basically added in one single commit:

https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304

I think this will give some general ideas on what to expect when

writing

storage plugin.

On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <

ted.dunn...@gmail.com>

wrote:

Hakim,

Not yet.  Still very much in the stage of gathering feedback.

I would think it very simple.  The biggest obstacles are

1) no documentation on how to write a data format

2) I need to release a jar for log-synth to Maven Central.




On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
adene...@maprtech.com>
wrote:

@Ted, the log-synth storage format would be really useful. I'm

already

seeing many unit tests that could benefit from this. Do you

have

github

repo for your ongoing work ?

Thanks!

On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <

ted.dunn...@gmail.com>

wrote:

Are you hard set on using common table expressions?

I have discussed a bit off-list creating a data format that

would

allow

tables to be read from a log-synth [1] schema.  That would

let

you

read

as

much data as you might like with an arbitrarily complex (or

simple)

query.

Operationally, you would create a file containing a log-synth

schema

that

has the extension .synth.  Your data source would have to be

configured

to

connect that extension with the log-synth format.  At that

point,

you

could

select as much or little data as you like from the file and

you

would

see

generated data rather than the schema.



[1] https://github.com/tdunning/log-synth

On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
alexanderz.si...@gmail.com

wrote:
Hi All,

I am trying to come up with a query which returns a given

number

of

rows

without having a real table on Storage.

I am hoping to achieve something like this:

http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table

DECLARE @start INT = 1;DECLARE @end INT = 1000000;
WITH numbers AS (
    SELECT @start AS number
    UNION ALL
    SELECT number + 1
    FROM  numbers
    WHERE number < @end)SELECT *FROM numbersOPTION

(MAXRECURSION

0);

I do not actually need to create different values and

returning

identical

rows would work too.I just need to bypass the "from clause"

in

the

query.

Thanks,
Alex



--

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<

http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available



--

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<

http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available



--
Daniel Barclay
MapR Technologies

Re: Recursive CTE Support in Drill

Reply via email to