Re: Recursive CTE Support in Drill

2015-08-27 Thread Daniel Barclay

Ted Dunning wrote:

The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.

You don't even need an external seed file (or to query a built-in table), and 
using WITH can extend multiplication to exponentiation:

WITH q(key) AS (
  WITH q(key) AS (
VALUES 1, 1
  )
  SELECT q1.key
FROM q q1
INNER JOIN q q2  ON q1.key = q2.key
INNER JOIN q q3  ON q2.key = q3.key
INNER JOIN q q4  ON q3.key = q4.key
)
SELECT COUNT(*)
  FROM q q1
  INNER JOIN q q2 ON q1.key = q2.key
  INNER JOIN q q3 ON q2.key = q3.key
  INNER JOIN q q4 ON q3.key = q4.key ;

That generates 65536 ((2^4) ^4) rows (before the COUNT(*)).

Augmenting the table value constructor (VALUES ...) to list a third row (a third 
1) increases the resulting row count to 43,046,721 ((3^4) ^4).

Going to four rows bumps that to about 4 billion ((4^4) ^4):

0: jdbc:drill:drillbit=localhost WITH q(key) AS (   WITH q(key) AS (VALUES 1, 
1, 1, 1)  SELECT q1.key  FROM q q1 INNER JOIN q q2  ON q1.key = q2.key INNER JOIN 
q q3  ON q2.key = q3.key INNER JOIN q q4  ON q3.key = q4.key   )   SELECT COUNT(*) 
 FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key 
INNER JOIN q q4 ON q3.key = q4.key ;
+-+
|   EXPR$0|
+-+
| 4294967296  |
+-+
1 row selected (1197.674 seconds)

!

Daniel


Ted Dunning wrote:

Jacques, Alexander,

The cartesian join approach will produce an enormous stream of data with
only a very small amount of disk read.  It works now.  If you start with
small file, you can increase the amount of output in small increments by
simply adding another join. You do have to have some kind of equality join
involved since drill will (anomalously in my view) refuse to plan the query
otherwise.

Here is an example.  Note that I have deleted many lines of output and
indicated that with ellipsis (...).  The three queries here produce 4, 16
and 64 lines of output from a 4 line input file.  That could have been n,
n^2 and n^3 just as easily for any value of n that you might like.

ted:apache-drill-1.0.0$ cat  x.csv !

a,1
a,2
a,3
a,4
!

ted:apache-drill-1.0.0$ bin/sqlline -u jdbc:drill:zk=local -n admin -p
admin
0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;*
+-+-+
| EXPR$0  | EXPR$1  |
+-+-+
| a   | 1   |
| a   | 2   |
| a   | 3   |
| a   | 4   |
+-+-+
4 rows selected (0.097 seconds)
0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] =
y.columns[0];*
+-+-+
| EXPR$0  | EXPR$1  |
+-+-+
| a   | 1   |
| a   | 1   |
| a   | 1   |
| a   | 1   |
| a   | 2   |
...
| a   | 4   |
| a   | 4   |
+-+-+
16 rows selected (0.614 seconds)
0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y,
dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] =
y.columns[0] and x.columns[0] = z.columns[0];*
+-+-+
| EXPR$0  | EXPR$1  |
+-+-+
| a   | 1   |
| a   | 1   |
| a   | 1   |
| a   | 1   |
...
| a   | 4   |
| a   | 4   |
| a   | 4   |
| a   | 4   |
+-+-+
64 rows selected (0.63 seconds)
0: jdbc:drill:zk=local



On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com wrote:


Removing cross posting

Alexander,

There is currently no way for Drill to generate a large amount of data
using SQL.  However, you can generate large generic data by using the
MockStoragePlugin if you submit a plan.  You can find an example plan using
this at [1].

I heard someone might be working on extending the MockStoragePlugin to
support SQL which would provide the outcome you requested.

[1]

https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json

On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com
wrote:


Also, just doing a Cartesian join of three copies of 1000 records will
give you a billion records with negligible I/o.

Sent from my iPhone


On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com

wrote:

@Alexander If you want to test the speed of the ODBC driver you can do

that

without a new storage plugin.

If you get the entire dataset into memory, it will be returned from

Drill a

quickly as we can possibly send it to the client. One way to do this is

to

insert a sort; we cannot send along any of the data until the compete

sort

is done. As long as you don't read so much data that we will start

spilling

the sort to disk, all of the records will 

Re: Recursive CTE Support in Drill

2015-07-20 Thread Alexander Zarei
Thanks for more elaboration Ted, Jacques and Jason!

@Ted that is a very cool idea. I tried the cross join but figured cross
join is not supported in drill yet but we have DRILL-786 for it. The new
method looks very promising. It seems it is an implicit cross join, isn't
it? I just tried it out and it worked like a charm. I will go on with this
method.

@Jaques, yes as Jason said, we discussed this before and I have talked to
my colleagues to help me with modifying the ODBC driver so it sends a plan.
Also thanks for the query. I tied it out for tow tables and it worked find
but extending it to three tables gives me a syntax error.

 select * from

((select column1, 1 as join_keyb from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1

  Join

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2

  on t1.join_key=t2.join_key) t12

Join

(SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 on t12.join_keyb=t3.join_key)



*The other syntax was easier for me to use the join three times so I could
test it with *


 select t1.column1 from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 where

 t1.join_key=t2.join_key and t1.join_key=t3.join_key


Thank you very much for your time Ted, Jacques and Jason!

Thanks,
Alex

On Fri, Jul 17, 2015 at 2:09 PM, Jason Altekruse altekruseja...@gmail.com
wrote:

 Jacques,

 Alexander has brought up this problem previously in one of the hangouts and
 said that submitting a physical plan was not possible through ODBC. If he
 is able to modify the driver code to make it possible to submit one, that
 would be an option, as I believe the C++ client is capable of submitting
 plans. The issue I seem to recall him mentioning is that the ODBC driver
 was running a little sanity checking on they sql query to try to prevent
 submitting complete garbage queries to a server. I think he had concerns
 that a JSON formatted physical plan would fail these checks and he would
 have to disable them along with trying to allow submitting two types of
 queries from ODBC.

 On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com
 wrote:

  Removing cross posting
 
  Alexander,
 
  There is currently no way for Drill to generate a large amount of data
  using SQL.  However, you can generate large generic data by using the
  MockStoragePlugin if you submit a plan.  You can find an example plan
 using
  this at [1].
 
  I heard someone might be working on extending the MockStoragePlugin to
  support SQL which would provide the outcome you requested.
 
  [1]
 
 
 https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
 
  On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  
   Also, just doing a Cartesian join of three copies of 1000 records will
   give you a billion records with negligible I/o.
  
   Sent from my iPhone
  
On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com
 
   wrote:
   
@Alexander If you want to test the speed of the ODBC driver you can
 do
   that
without a new storage plugin.
   
If you get the entire dataset into memory, it will be returned from
   Drill a
quickly as we can possibly send it to the client. One way to do this
 is
   to
insert a sort; we cannot send along any of the data until the compete
   sort
is done. As long as you don't read so much data that we will start
   spilling
the sort to disk, all of the records will be in memory. To take the
  read
and sort time out of your test, just make sure to record the time you
   first
receive data from Drill, not the query start time.
   
There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records
   with
the schema information populated. This schema is generated by
 applying
   all
of the transformations that happen throughout the query. For example,
  the
join operator handles this schema population by sending along the
  schema
merged from the two sides of the join, project will similarly add or
   remove
column based on the expressions and columns requested. You will want
 to
make sure you record your start time when you receive the first batch
   with
actual records. This can give you an accurate measurement of the ODBC
performance, removing the bottleneck of the disk.
   
On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei 
   alexanderz.si...@gmail.com
wrote:
   
Thanks for the answers.
   
@Ted my only goal is to pump a large amount of data without having
 to
   read
from Hard Disk. I am measuring the ODBC driver performance and I
 need
  a
higher data transfer rate. So any method that helps 

Re: Recursive CTE Support in Drill

2015-07-18 Thread Jacques Nadeau
Good point.  In fact, you can just use a literal expression and some sample
data such tpch lineitem:

SELECT * FROM
(select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key
from cp.`tpch/lineitem.parquet`) t1
JOIN
(select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key
from cp.`tpch/lineitem.parquet`) t2 on t1.join_key = t2.join_key

...SNIP...

3,621,030,625 rows selected









On Sat, Jul 18, 2015 at 2:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Jacques, Alexander,

 The cartesian join approach will produce an enormous stream of data with
 only a very small amount of disk read.  It works now.  If you start with
 small file, you can increase the amount of output in small increments by
 simply adding another join. You do have to have some kind of equality join
 involved since drill will (anomalously in my view) refuse to plan the query
 otherwise.

 Here is an example.  Note that I have deleted many lines of output and
 indicated that with ellipsis (...).  The three queries here produce 4, 16
 and 64 lines of output from a 4 line input file.  That could have been n,
 n^2 and n^3 just as easily for any value of n that you might like.

 ted:apache-drill-1.0.0$ cat  x.csv !
  a,1
  a,2
  a,3
  a,4
  !
 ted:apache-drill-1.0.0$ bin/sqlline -u jdbc:drill:zk=local -n admin -p
 admin
 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;*
 +-+-+
 | EXPR$0  | EXPR$1  |
 +-+-+
 | a   | 1   |
 | a   | 2   |
 | a   | 3   |
 | a   | 4   |
 +-+-+
 4 rows selected (0.097 seconds)
 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
 dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] =
 y.columns[0];*
 +-+-+
 | EXPR$0  | EXPR$1  |
 +-+-+
 | a   | 1   |
 | a   | 1   |
 | a   | 1   |
 | a   | 1   |
 | a   | 2   |
 ...
 | a   | 4   |
 | a   | 4   |
 +-+-+
 16 rows selected (0.614 seconds)
 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from
  dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x,
 dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y,
 dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] =
 y.columns[0] and x.columns[0] = z.columns[0];*
 +-+-+
 | EXPR$0  | EXPR$1  |
 +-+-+
 | a   | 1   |
 | a   | 1   |
 | a   | 1   |
 | a   | 1   |
 ...
 | a   | 4   |
 | a   | 4   |
 | a   | 4   |
 | a   | 4   |
 +-+-+
 64 rows selected (0.63 seconds)
 0: jdbc:drill:zk=local



 On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com
 wrote:

  Removing cross posting
 
  Alexander,
 
  There is currently no way for Drill to generate a large amount of data
  using SQL.  However, you can generate large generic data by using the
  MockStoragePlugin if you submit a plan.  You can find an example plan
 using
  this at [1].
 
  I heard someone might be working on extending the MockStoragePlugin to
  support SQL which would provide the outcome you requested.
 
  [1]
 
 
 https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
 
  On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  
   Also, just doing a Cartesian join of three copies of 1000 records will
   give you a billion records with negligible I/o.
  
   Sent from my iPhone
  
On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com
 
   wrote:
   
@Alexander If you want to test the speed of the ODBC driver you can
 do
   that
without a new storage plugin.
   
If you get the entire dataset into memory, it will be returned from
   Drill a
quickly as we can possibly send it to the client. One way to do this
 is
   to
insert a sort; we cannot send along any of the data until the compete
   sort
is done. As long as you don't read so much data that we will start
   spilling
the sort to disk, all of the records will be in memory. To take the
  read
and sort time out of your test, just make sure to record the time you
   first
receive data from Drill, not the query start time.
   
There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records
   with
the schema information populated. This schema is generated by
 applying
   all
of the transformations that happen throughout the query. For example,
  the
join operator handles this schema population by sending along the
  schema
merged from the two sides of the join, project will similarly add or
   remove
column based on the expressions and columns requested. You will want
 to
make sure you 

Re: Recursive CTE Support in Drill

2015-07-16 Thread Ted Dunning

Also, just doing a Cartesian join of three copies of 1000 records will give you 
a billion records with negligible I/o. 

Sent from my iPhone

 On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com wrote:
 
 @Alexander If you want to test the speed of the ODBC driver you can do that
 without a new storage plugin.
 
 If you get the entire dataset into memory, it will be returned from Drill a
 quickly as we can possibly send it to the client. One way to do this is to
 insert a sort; we cannot send along any of the data until the compete sort
 is done. As long as you don't read so much data that we will start spilling
 the sort to disk, all of the records will be in memory. To take the read
 and sort time out of your test, just make sure to record the time you first
 receive data from Drill, not the query start time.
 
 There is one gotcha here. To make the BI tools more responsive, we
 implemented a feature that will send along one empty batch of records with
 the schema information populated. This schema is generated by applying all
 of the transformations that happen throughout the query. For example, the
 join operator handles this schema population by sending along the schema
 merged from the two sides of the join, project will similarly add or remove
 column based on the expressions and columns requested. You will want to
 make sure you record your start time when you receive the first batch with
 actual records. This can give you an accurate measurement of the ODBC
 performance, removing the bottleneck of the disk.
 
 On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei alexanderz.si...@gmail.com
 wrote:
 
 Thanks for the answers.
 
 @Ted my only goal is to pump a large amount of data without having to read
 from Hard Disk. I am measuring the ODBC driver performance and I need a
 higher data transfer rate. So any method that helps pumping data out of
 Drill faster would help. The log-synth seems a good way to generate data
 for testing. However, I'd need a ram only option which hopefully provides a
 higher throughput.
 
 @Jacques How involved is it to write a dummy plugin that returns one
 hardcoded row repeatedly 12 million times?
 
 Thanks,
 Alex
 
 On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 It may be easy, but it is completely opaque about what really needs to
 happen.
 
 For instance,
 
 1) how is schema exposed?
 
 2) which classes do I really need to implement?
 
 3) how do I express partitioning of a format?
 
 4) how do I test it?
 
 Just a bit of documentation and comments would go a very, very long way.
 
 Even answers on the mailing list that have more details than oh, that's
 easy.  I would be happy to transcribe answers into the code if I could
 just get some.
 
 
 
 On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org
 wrote:
 
 Creating an EasyFormatPlugin is pretty simple.  They were designed to
 get
 rid of much of the scaffolding required for a standard FormatPlugin.
 
 JSON
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
 
 Text
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
 
 AVRO
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
 
 In all cases, the connection code is pretty light.  A fully schematized
 format like log-synth should be even simpler to implement.
 
 On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 I don't think we need a full on storage plugin.  I think a data
 format
 should be sufficient, basically CSV on steroids.
 
 
 
 
 
 On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:
 
 Yeah, we still lack documentation on how to write a storage plugin.
 One
 advice I've been seeing a lot is to take a look at the mongo-db
 plugin,
 it
 was basically added in one single commit:
 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
 
 I think this will give some general ideas on what to expect when
 writing
 a
 storage plugin.
 
 On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:
 
 Hakim,
 
 Not yet.  Still very much in the stage of gathering feedback.
 
 I would think it very simple.  The biggest obstacles are
 
 1) no documentation on how to write a data format
 
 2) I need to release a jar for log-synth to Maven Central.
 
 
 
 
 On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:
 
 @Ted, the log-synth storage format would be really useful. I'm
 already
 seeing many unit tests that could benefit from this. Do you
 have
 a
 github
 repo for your ongoing work ?
 
 Thanks!
 
 On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:
 
 Are you hard set on using common table expressions?
 
 I have discussed a bit off-list creating a data format that
 would
 

Re: Recursive CTE Support in Drill

2015-07-16 Thread Alexander Zarei
Thanks for the answers.

@Ted my only goal is to pump a large amount of data without having to read
from Hard Disk. I am measuring the ODBC driver performance and I need a
higher data transfer rate. So any method that helps pumping data out of
Drill faster would help. The log-synth seems a good way to generate data
for testing. However, I'd need a ram only option which hopefully provides a
higher throughput.

@Jacques How involved is it to write a dummy plugin that returns one
hardcoded row repeatedly 12 million times?

Thanks,
Alex

On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 It may be easy, but it is completely opaque about what really needs to
 happen.

 For instance,

 1) how is schema exposed?

 2) which classes do I really need to implement?

 3) how do I express partitioning of a format?

 4) how do I test it?

 Just a bit of documentation and comments would go a very, very long way.

 Even answers on the mailing list that have more details than oh, that's
 easy.  I would be happy to transcribe answers into the code if I could
 just get some.



 On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org
 wrote:

  Creating an EasyFormatPlugin is pretty simple.  They were designed to get
  rid of much of the scaffolding required for a standard FormatPlugin.
 
  JSON
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
 
  Text
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
 
  AVRO
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
 
  In all cases, the connection code is pretty light.  A fully schematized
  format like log-synth should be even simpler to implement.
 
  On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   I don't think we need a full on storage plugin.  I think a data format
   should be sufficient, basically CSV on steroids.
  
  
  
  
  
   On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche 
   adene...@maprtech.com
wrote:
  
Yeah, we still lack documentation on how to write a storage plugin.
 One
advice I've been seeing a lot is to take a look at the mongo-db
 plugin,
   it
was basically added in one single commit:
   
   
   
  
 
 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
   
I think this will give some general ideas on what to expect when
  writing
   a
storage plugin.
   
On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 Hakim,

 Not yet.  Still very much in the stage of gathering feedback.

 I would think it very simple.  The biggest obstacles are

 1) no documentation on how to write a data format

 2) I need to release a jar for log-synth to Maven Central.




 On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  @Ted, the log-synth storage format would be really useful. I'm
   already
  seeing many unit tests that could benefit from this. Do you have
 a
github
  repo for your ongoing work ?
 
  Thanks!
 
  On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning 
  ted.dunn...@gmail.com
  wrote:
 
   Are you hard set on using common table expressions?
  
   I have discussed a bit off-list creating a data format that
 would
allow
   tables to be read from a log-synth [1] schema.  That would let
  you
read
  as
   much data as you might like with an arbitrarily complex (or
  simple)
  query.
  
   Operationally, you would create a file containing a log-synth
   schema
 that
   has the extension .synth.  Your data source would have to be
configured
  to
   connect that extension with the log-synth format.  At that
 point,
   you
  could
   select as much or little data as you like from the file and you
   would
 see
   generated data rather than the schema.
  
  
  
   [1] https://github.com/tdunning/log-synth
  
   On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
   alexanderz.si...@gmail.com
wrote:
  
Hi All,
   
I am trying to come up with a query which returns a given
  number
   of
  rows
without having a real table on Storage.
   
I am hoping to achieve something like this:
   
   
   
  
 

   
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
   
DECLARE @start INT = 1;DECLARE @end INT = 100;
WITH numbers AS (
SELECT @start AS number
UNION ALL
SELECT number + 1
FROM  numbers
WHERE number  @end)SELECT *FROM numbersOPTION
  (MAXRECURSION
0);
   
I do not 

Re: Recursive CTE Support in Drill

2015-07-16 Thread Jason Altekruse
@Alexander If you want to test the speed of the ODBC driver you can do that
without a new storage plugin.

If you get the entire dataset into memory, it will be returned from Drill a
quickly as we can possibly send it to the client. One way to do this is to
insert a sort; we cannot send along any of the data until the compete sort
is done. As long as you don't read so much data that we will start spilling
the sort to disk, all of the records will be in memory. To take the read
and sort time out of your test, just make sure to record the time you first
receive data from Drill, not the query start time.

There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records with
the schema information populated. This schema is generated by applying all
of the transformations that happen throughout the query. For example, the
join operator handles this schema population by sending along the schema
merged from the two sides of the join, project will similarly add or remove
column based on the expressions and columns requested. You will want to
make sure you record your start time when you receive the first batch with
actual records. This can give you an accurate measurement of the ODBC
performance, removing the bottleneck of the disk.

On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei alexanderz.si...@gmail.com
 wrote:

 Thanks for the answers.

 @Ted my only goal is to pump a large amount of data without having to read
 from Hard Disk. I am measuring the ODBC driver performance and I need a
 higher data transfer rate. So any method that helps pumping data out of
 Drill faster would help. The log-synth seems a good way to generate data
 for testing. However, I'd need a ram only option which hopefully provides a
 higher throughput.

 @Jacques How involved is it to write a dummy plugin that returns one
 hardcoded row repeatedly 12 million times?

 Thanks,
 Alex

 On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  It may be easy, but it is completely opaque about what really needs to
  happen.
 
  For instance,
 
  1) how is schema exposed?
 
  2) which classes do I really need to implement?
 
  3) how do I express partitioning of a format?
 
  4) how do I test it?
 
  Just a bit of documentation and comments would go a very, very long way.
 
  Even answers on the mailing list that have more details than oh, that's
  easy.  I would be happy to transcribe answers into the code if I could
  just get some.
 
 
 
  On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org
  wrote:
 
   Creating an EasyFormatPlugin is pretty simple.  They were designed to
 get
   rid of much of the scaffolding required for a standard FormatPlugin.
  
   JSON
  
  
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
  
   Text
  
  
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
  
   AVRO
  
  
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
  
   In all cases, the connection code is pretty light.  A fully schematized
   format like log-synth should be even simpler to implement.
  
   On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
I don't think we need a full on storage plugin.  I think a data
 format
should be sufficient, basically CSV on steroids.
   
   
   
   
   
On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche 
adene...@maprtech.com
 wrote:
   
 Yeah, we still lack documentation on how to write a storage plugin.
  One
 advice I've been seeing a lot is to take a look at the mongo-db
  plugin,
it
 was basically added in one single commit:



   
  
 
 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304

 I think this will give some general ideas on what to expect when
   writing
a
 storage plugin.

 On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:

  Hakim,
 
  Not yet.  Still very much in the stage of gathering feedback.
 
  I would think it very simple.  The biggest obstacles are
 
  1) no documentation on how to write a data format
 
  2) I need to release a jar for log-synth to Maven Central.
 
 
 
 
  On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
  adene...@maprtech.com
  wrote:
 
   @Ted, the log-synth storage format would be really useful. I'm
already
   seeing many unit tests that could benefit from this. Do you
 have
  a
 github
   repo for your ongoing work ?
  
   Thanks!
  
   On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning 
   ted.dunn...@gmail.com
   wrote:
  
Are you hard set on using common table expressions?
   
I have 

Re: Recursive CTE Support in Drill

2015-07-10 Thread Abdel Hakim Deneche
Yeah, we still lack documentation on how to write a storage plugin. One
advice I've been seeing a lot is to take a look at the mongo-db plugin, it
was basically added in one single commit:

https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304

I think this will give some general ideas on what to expect when writing a
storage plugin.

On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Hakim,

 Not yet.  Still very much in the stage of gathering feedback.

 I would think it very simple.  The biggest obstacles are

 1) no documentation on how to write a data format

 2) I need to release a jar for log-synth to Maven Central.




 On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  @Ted, the log-synth storage format would be really useful. I'm already
  seeing many unit tests that could benefit from this. Do you have a github
  repo for your ongoing work ?
 
  Thanks!
 
  On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Are you hard set on using common table expressions?
  
   I have discussed a bit off-list creating a data format that would allow
   tables to be read from a log-synth [1] schema.  That would let you read
  as
   much data as you might like with an arbitrarily complex (or simple)
  query.
  
   Operationally, you would create a file containing a log-synth schema
 that
   has the extension .synth.  Your data source would have to be configured
  to
   connect that extension with the log-synth format.  At that point, you
  could
   select as much or little data as you like from the file and you would
 see
   generated data rather than the schema.
  
  
  
   [1] https://github.com/tdunning/log-synth
  
   On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
   alexanderz.si...@gmail.com
wrote:
  
Hi All,
   
I am trying to come up with a query which returns a given number of
  rows
without having a real table on Storage.
   
I am hoping to achieve something like this:
   
   
   
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
   
DECLARE @start INT = 1;DECLARE @end INT = 100;
WITH numbers AS (
SELECT @start AS number
UNION ALL
SELECT number + 1
FROM  numbers
WHERE number  @end)SELECT *FROM numbersOPTION (MAXRECURSION 0);
   
I do not actually need to create different values and returning
  identical
rows would work too.I just need to bypass the from clause in the
  query.
   
Thanks,
Alex
   
  
 
 
 
  --
 
  Abdelhakim Deneche
 
  Software Engineer
 
http://www.mapr.com/
 
 
  Now Available - Free Hadoop On-Demand Training
  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
  
 




-- 

Abdelhakim Deneche

Software Engineer

  http://www.mapr.com/


Now Available - Free Hadoop On-Demand Training
http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available


Re: Recursive CTE Support in Drill

2015-07-10 Thread Ted Dunning
I don't think we need a full on storage plugin.  I think a data format
should be sufficient, basically CSV on steroids.





On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com
 wrote:

 Yeah, we still lack documentation on how to write a storage plugin. One
 advice I've been seeing a lot is to take a look at the mongo-db plugin, it
 was basically added in one single commit:


 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304

 I think this will give some general ideas on what to expect when writing a
 storage plugin.

 On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Hakim,
 
  Not yet.  Still very much in the stage of gathering feedback.
 
  I would think it very simple.  The biggest obstacles are
 
  1) no documentation on how to write a data format
 
  2) I need to release a jar for log-synth to Maven Central.
 
 
 
 
  On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
  adene...@maprtech.com
  wrote:
 
   @Ted, the log-synth storage format would be really useful. I'm already
   seeing many unit tests that could benefit from this. Do you have a
 github
   repo for your ongoing work ?
  
   Thanks!
  
   On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
Are you hard set on using common table expressions?
   
I have discussed a bit off-list creating a data format that would
 allow
tables to be read from a log-synth [1] schema.  That would let you
 read
   as
much data as you might like with an arbitrarily complex (or simple)
   query.
   
Operationally, you would create a file containing a log-synth schema
  that
has the extension .synth.  Your data source would have to be
 configured
   to
connect that extension with the log-synth format.  At that point, you
   could
select as much or little data as you like from the file and you would
  see
generated data rather than the schema.
   
   
   
[1] https://github.com/tdunning/log-synth
   
On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
alexanderz.si...@gmail.com
 wrote:
   
 Hi All,

 I am trying to come up with a query which returns a given number of
   rows
 without having a real table on Storage.

 I am hoping to achieve something like this:



   
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table

 DECLARE @start INT = 1;DECLARE @end INT = 100;
 WITH numbers AS (
 SELECT @start AS number
 UNION ALL
 SELECT number + 1
 FROM  numbers
 WHERE number  @end)SELECT *FROM numbersOPTION (MAXRECURSION
 0);

 I do not actually need to create different values and returning
   identical
 rows would work too.I just need to bypass the from clause in the
   query.

 Thanks,
 Alex

   
  
  
  
   --
  
   Abdelhakim Deneche
  
   Software Engineer
  
 http://www.mapr.com/
  
  
   Now Available - Free Hadoop On-Demand Training
   
  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
   
  
 



 --

 Abdelhakim Deneche

 Software Engineer

   http://www.mapr.com/


 Now Available - Free Hadoop On-Demand Training
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
 



Re: Recursive CTE Support in Drill

2015-07-10 Thread Ted Dunning
Hakim,

Not yet.  Still very much in the stage of gathering feedback.

I would think it very simple.  The biggest obstacles are

1) no documentation on how to write a data format

2) I need to release a jar for log-synth to Maven Central.




On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com
wrote:

 @Ted, the log-synth storage format would be really useful. I'm already
 seeing many unit tests that could benefit from this. Do you have a github
 repo for your ongoing work ?

 Thanks!

 On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Are you hard set on using common table expressions?
 
  I have discussed a bit off-list creating a data format that would allow
  tables to be read from a log-synth [1] schema.  That would let you read
 as
  much data as you might like with an arbitrarily complex (or simple)
 query.
 
  Operationally, you would create a file containing a log-synth schema that
  has the extension .synth.  Your data source would have to be configured
 to
  connect that extension with the log-synth format.  At that point, you
 could
  select as much or little data as you like from the file and you would see
  generated data rather than the schema.
 
 
 
  [1] https://github.com/tdunning/log-synth
 
  On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
  alexanderz.si...@gmail.com
   wrote:
 
   Hi All,
  
   I am trying to come up with a query which returns a given number of
 rows
   without having a real table on Storage.
  
   I am hoping to achieve something like this:
  
  
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
  
   DECLARE @start INT = 1;DECLARE @end INT = 100;
   WITH numbers AS (
   SELECT @start AS number
   UNION ALL
   SELECT number + 1
   FROM  numbers
   WHERE number  @end)SELECT *FROM numbersOPTION (MAXRECURSION 0);
  
   I do not actually need to create different values and returning
 identical
   rows would work too.I just need to bypass the from clause in the
 query.
  
   Thanks,
   Alex
  
 



 --

 Abdelhakim Deneche

 Software Engineer

   http://www.mapr.com/


 Now Available - Free Hadoop On-Demand Training
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
 



Re: Recursive CTE Support in Drill

2015-07-10 Thread Ted Dunning
It may be easy, but it is completely opaque about what really needs to
happen.

For instance,

1) how is schema exposed?

2) which classes do I really need to implement?

3) how do I express partitioning of a format?

4) how do I test it?

Just a bit of documentation and comments would go a very, very long way.

Even answers on the mailing list that have more details than oh, that's
easy.  I would be happy to transcribe answers into the code if I could
just get some.



On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org wrote:

 Creating an EasyFormatPlugin is pretty simple.  They were designed to get
 rid of much of the scaffolding required for a standard FormatPlugin.

 JSON

 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json

 Text

 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text

 AVRO

 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro

 In all cases, the connection code is pretty light.  A fully schematized
 format like log-synth should be even simpler to implement.

 On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  I don't think we need a full on storage plugin.  I think a data format
  should be sufficient, basically CSV on steroids.
 
 
 
 
 
  On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche 
  adene...@maprtech.com
   wrote:
 
   Yeah, we still lack documentation on how to write a storage plugin. One
   advice I've been seeing a lot is to take a look at the mongo-db plugin,
  it
   was basically added in one single commit:
  
  
  
 
 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
  
   I think this will give some general ideas on what to expect when
 writing
  a
   storage plugin.
  
   On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
Hakim,
   
Not yet.  Still very much in the stage of gathering feedback.
   
I would think it very simple.  The biggest obstacles are
   
1) no documentation on how to write a data format
   
2) I need to release a jar for log-synth to Maven Central.
   
   
   
   
On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
adene...@maprtech.com
wrote:
   
 @Ted, the log-synth storage format would be really useful. I'm
  already
 seeing many unit tests that could benefit from this. Do you have a
   github
 repo for your ongoing work ?

 Thanks!

 On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning 
 ted.dunn...@gmail.com
 wrote:

  Are you hard set on using common table expressions?
 
  I have discussed a bit off-list creating a data format that would
   allow
  tables to be read from a log-synth [1] schema.  That would let
 you
   read
 as
  much data as you might like with an arbitrarily complex (or
 simple)
 query.
 
  Operationally, you would create a file containing a log-synth
  schema
that
  has the extension .synth.  Your data source would have to be
   configured
 to
  connect that extension with the log-synth format.  At that point,
  you
 could
  select as much or little data as you like from the file and you
  would
see
  generated data rather than the schema.
 
 
 
  [1] https://github.com/tdunning/log-synth
 
  On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
  alexanderz.si...@gmail.com
   wrote:
 
   Hi All,
  
   I am trying to come up with a query which returns a given
 number
  of
 rows
   without having a real table on Storage.
  
   I am hoping to achieve something like this:
  
  
  
 

   
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
  
   DECLARE @start INT = 1;DECLARE @end INT = 100;
   WITH numbers AS (
   SELECT @start AS number
   UNION ALL
   SELECT number + 1
   FROM  numbers
   WHERE number  @end)SELECT *FROM numbersOPTION
 (MAXRECURSION
   0);
  
   I do not actually need to create different values and returning
 identical
   rows would work too.I just need to bypass the from clause in
  the
 query.
  
   Thanks,
   Alex
  
 



 --

 Abdelhakim Deneche

 Software Engineer

   http://www.mapr.com/


 Now Available - Free Hadoop On-Demand Training
 

   
  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
 

   
  
  
  
   --
  
   Abdelhakim Deneche
  
   Software Engineer
  
 http://www.mapr.com/
  
  
   Now Available - Free Hadoop On-Demand Training
   
  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available