Ted Dunning wrote:
The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.
You don't even need an external seed file (or to query a built-in table), and 
using WITH can extend multiplication to exponentiation:

WITH q(key) AS (
  WITH q(key) AS (
    VALUES 1, 1
  )
  SELECT q1.key
    FROM q q1
    INNER JOIN q q2  ON q1.key = q2.key
    INNER JOIN q q3  ON q2.key = q3.key
    INNER JOIN q q4  ON q3.key = q4.key
)
SELECT COUNT(*)
  FROM q q1
  INNER JOIN q q2 ON q1.key = q2.key
  INNER JOIN q q3 ON q2.key = q3.key
  INNER JOIN q q4 ON q3.key = q4.key ;

That generates 65536 ((2^4) ^4) rows (before the COUNT(*)).

Augmenting the table value constructor (VALUES ...) to list a third row (a third 
"1") increases the resulting row count to 43,046,721 ((3^4) ^4).

Going to four rows bumps that to about 4 billion ((4^4) ^4):

0: jdbc:drill:drillbit=localhost> WITH q(key) AS (   WITH q(key) AS (VALUES 1, 
1, 1, 1)  SELECT q1.key  FROM q q1 INNER JOIN q q2  ON q1.key = q2.key INNER JOIN 
q q3  ON q2.key = q3.key INNER JOIN q q4  ON q3.key = q4.key   )   SELECT COUNT(*) 
 FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key 
INNER JOIN q q4 ON q3.key = q4.key     ;
+-------------+
|   EXPR$0    |
+-------------+
| 4294967296  |
+-------------+
1 row selected (1197.674 seconds)

!

Daniel


Ted Dunning wrote:
Jacques, Alexander,

The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.  It works now.  If you start with
small file, you can increase the amount of output in small increments by
simply adding another join. You do have to have some kind of equality join
involved since drill will (anomalously in my view) refuse to plan the query
otherwise.

Here is an example.  Note that I have deleted many lines of output and
indicated that with ellipsis (...).  The three queries here produce 4, 16
and 64 lines of output from a 4 line input file.  That could have been n,
n^2 and n^3 just as easily for any value of n that you might like.

ted:apache-drill-1.0.0$ cat > x.csv <<!
a,1
a,2
a,3
a,4
!
ted:apache-drill-1.0.0$     bin/sqlline -u jdbc:drill:zk=local -n admin -p
admin
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 2       |
| a       | 3       |
| a       | 4       |
+---------+---------+
4 rows selected (0.097 seconds)
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] =
y.columns[0];*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 2       |
...
| a       | 4       |
| a       | 4       |
+---------+---------+
16 rows selected (0.614 seconds)
0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] =
y.columns[0] and x.columns[0] = z.columns[0];*
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| a       | 1       |
| a       | 1       |
| a       | 1       |
| a       | 1       |
...
| a       | 4       |
| a       | 4       |
| a       | 4       |
| a       | 4       |
+---------+---------+
64 rows selected (0.63 seconds)
0: jdbc:drill:zk=local>



On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacq...@dremio.com> wrote:

Removing cross posting

Alexander,

There is currently no way for Drill to generate a large amount of data
using SQL.  However, you can generate large generic data by using the
MockStoragePlugin if you submit a plan.  You can find an example plan using
this at [1].

I heard someone might be working on extending the MockStoragePlugin to
support SQL which would provide the outcome you requested.

[1]

https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json

On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

Also, just doing a Cartesian join of three copies of 1000 records will
give you a billion records with negligible I/o.

Sent from my iPhone

On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com>
wrote:
@Alexander If you want to test the speed of the ODBC driver you can do
that
without a new storage plugin.

If you get the entire dataset into memory, it will be returned from
Drill a
quickly as we can possibly send it to the client. One way to do this is
to
insert a sort; we cannot send along any of the data until the compete
sort
is done. As long as you don't read so much data that we will start
spilling
the sort to disk, all of the records will be in memory. To take the
read
and sort time out of your test, just make sure to record the time you
first
receive data from Drill, not the query start time.

There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records
with
the schema information populated. This schema is generated by applying
all
of the transformations that happen throughout the query. For example,
the
join operator handles this schema population by sending along the
schema
merged from the two sides of the join, project will similarly add or
remove
column based on the expressions and columns requested. You will want to
make sure you record your start time when you receive the first batch
with
actual records. This can give you an accurate measurement of the ODBC
performance, removing the bottleneck of the disk.

On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
alexanderz.si...@gmail.com
wrote:
Thanks for the answers.

@Ted my only goal is to pump a large amount of data without having to
read
from Hard Disk. I am measuring the ODBC driver performance and I need
a
higher data transfer rate. So any method that helps pumping data out
of
Drill faster would help. The log-synth seems a good way to generate
data
for testing. However, I'd need a ram only option which hopefully
provides a
higher throughput.

@Jacques How involved is it to write a dummy plugin that returns one
hardcoded row repeatedly 12 million times?

Thanks,
Alex

On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

It may be easy, but it is completely opaque about what really needs
to
happen.

For instance,

1) how is schema exposed?

2) which classes do I really need to implement?

3) how do I express partitioning of a format?

4) how do I test it?

Just a bit of documentation and comments would go a very, very long
way.
Even answers on the mailing list that have more details than "oh,
that's
easy".  I would be happy to transcribe answers into the code if I
could
just get some.



On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <jacq...@apache.org
wrote:

Creating an EasyFormatPlugin is pretty simple.  They were designed
to
get
rid of much of the scaffolding required for a standard FormatPlugin.

JSON
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
Text
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
AVRO
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
In all cases, the connection code is pretty light.  A fully
schematized
format like log-synth should be even simpler to implement.

On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <
ted.dunn...@gmail.com>
wrote:

I don't think we need a full on storage plugin.  I think a data
format
should be sufficient, basically CSV on steroids.





On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
adene...@maprtech.com
wrote:
Yeah, we still lack documentation on how to write a storage
plugin.
One
advice I've been seeing a lot is to take a look at the mongo-db
plugin,
it
was basically added in one single commit:
https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
I think this will give some general ideas on what to expect when
writing
a
storage plugin.

On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
ted.dunn...@gmail.com>
wrote:

Hakim,

Not yet.  Still very much in the stage of gathering feedback.

I would think it very simple.  The biggest obstacles are

1) no documentation on how to write a data format

2) I need to release a jar for log-synth to Maven Central.




On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
adene...@maprtech.com>
wrote:

@Ted, the log-synth storage format would be really useful. I'm
already
seeing many unit tests that could benefit from this. Do you
have
a
github
repo for your ongoing work ?

Thanks!

On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
ted.dunn...@gmail.com>
wrote:

Are you hard set on using common table expressions?

I have discussed a bit off-list creating a data format that
would
allow
tables to be read from a log-synth [1] schema.  That would
let
you
read
as
much data as you might like with an arbitrarily complex (or
simple)
query.
Operationally, you would create a file containing a log-synth
schema
that
has the extension .synth.  Your data source would have to be
configured
to
connect that extension with the log-synth format.  At that
point,
you
could
select as much or little data as you like from the file and
you
would
see
generated data rather than the schema.



[1] https://github.com/tdunning/log-synth

On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
alexanderz.si...@gmail.com
wrote:
Hi All,

I am trying to come up with a query which returns a given
number
of
rows
without having a real table on Storage.

I am hoping to achieve something like this:
http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
DECLARE @start INT = 1;DECLARE @end INT = 1000000;
WITH numbers AS (
    SELECT @start AS number
    UNION ALL
    SELECT number + 1
    FROM  numbers
    WHERE number < @end)SELECT *FROM numbersOPTION
(MAXRECURSION
0);
I do not actually need to create different values and
returning
identical
rows would work too.I just need to bypass the "from clause"
in
the
query.
Thanks,
Alex


--

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<
http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available


--

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<
http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available


--
Daniel Barclay
MapR Technologies

Reply via email to