Good point. In fact, you can just use a literal expression and some sample data such tpch lineitem:
SELECT * FROM (select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key from cp.`tpch/lineitem.parquet`) t1 JOIN (select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key from cp.`tpch/lineitem.parquet`) t2 on t1.join_key = t2.join_key ...SNIP... 3,621,030,625 rows selected On Sat, Jul 18, 2015 at 2:38 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Jacques, Alexander, > > The cartesian join approach will produce an enormous stream of data with > only a very small amount of disk read. It works now. If you start with > small file, you can increase the amount of output in small increments by > simply adding another join. You do have to have some kind of equality join > involved since drill will (anomalously in my view) refuse to plan the query > otherwise. > > Here is an example. Note that I have deleted many lines of output and > indicated that with ellipsis (...). The three queries here produce 4, 16 > and 64 lines of output from a 4 line input file. That could have been n, > n^2 and n^3 just as easily for any value of n that you might like. > > ted:apache-drill-1.0.0$ cat > x.csv <<! > > a,1 > > a,2 > > a,3 > > a,4 > > ! > ted:apache-drill-1.0.0$ bin/sqlline -u jdbc:drill:zk=local -n admin -p > admin > 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;* > +---------+---------+ > | EXPR$0 | EXPR$1 | > +---------+---------+ > | a | 1 | > | a | 2 | > | a | 3 | > | a | 4 | > +---------+---------+ > 4 rows selected (0.097 seconds) > 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] = > y.columns[0];* > +---------+---------+ > | EXPR$0 | EXPR$1 | > +---------+---------+ > | a | 1 | > | a | 1 | > | a | 1 | > | a | 1 | > | a | 2 | > ... > | a | 4 | > | a | 4 | > +---------+---------+ > 16 rows selected (0.614 seconds) > 0: jdbc:drill:zk=local> *select x.columns[0], x.columns[1] from > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y, > dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] = > y.columns[0] and x.columns[0] = z.columns[0];* > +---------+---------+ > | EXPR$0 | EXPR$1 | > +---------+---------+ > | a | 1 | > | a | 1 | > | a | 1 | > | a | 1 | > ... > | a | 4 | > | a | 4 | > | a | 4 | > | a | 4 | > +---------+---------+ > 64 rows selected (0.63 seconds) > 0: jdbc:drill:zk=local> > > > > On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacq...@dremio.com> > wrote: > > > Removing cross posting > > > > Alexander, > > > > There is currently no way for Drill to generate a large amount of data > > using SQL. However, you can generate large generic data by using the > > MockStoragePlugin if you submit a plan. You can find an example plan > using > > this at [1]. > > > > I heard someone might be working on extending the MockStoragePlugin to > > support SQL which would provide the outcome you requested. > > > > [1] > > > > > https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json > > > > On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > > > > Also, just doing a Cartesian join of three copies of 1000 records will > > > give you a billion records with negligible I/o. > > > > > > Sent from my iPhone > > > > > > > On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com > > > > > wrote: > > > > > > > > @Alexander If you want to test the speed of the ODBC driver you can > do > > > that > > > > without a new storage plugin. > > > > > > > > If you get the entire dataset into memory, it will be returned from > > > Drill a > > > > quickly as we can possibly send it to the client. One way to do this > is > > > to > > > > insert a sort; we cannot send along any of the data until the compete > > > sort > > > > is done. As long as you don't read so much data that we will start > > > spilling > > > > the sort to disk, all of the records will be in memory. To take the > > read > > > > and sort time out of your test, just make sure to record the time you > > > first > > > > receive data from Drill, not the query start time. > > > > > > > > There is one gotcha here. To make the BI tools more responsive, we > > > > implemented a feature that will send along one empty batch of records > > > with > > > > the schema information populated. This schema is generated by > applying > > > all > > > > of the transformations that happen throughout the query. For example, > > the > > > > join operator handles this schema population by sending along the > > schema > > > > merged from the two sides of the join, project will similarly add or > > > remove > > > > column based on the expressions and columns requested. You will want > to > > > > make sure you record your start time when you receive the first batch > > > with > > > > actual records. This can give you an accurate measurement of the ODBC > > > > performance, removing the bottleneck of the disk. > > > > > > > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei < > > > alexanderz.si...@gmail.com > > > >> wrote: > > > > > > > >> Thanks for the answers. > > > >> > > > >> @Ted my only goal is to pump a large amount of data without having > to > > > read > > > >> from Hard Disk. I am measuring the ODBC driver performance and I > need > > a > > > >> higher data transfer rate. So any method that helps pumping data out > > of > > > >> Drill faster would help. The log-synth seems a good way to generate > > data > > > >> for testing. However, I'd need a ram only option which hopefully > > > provides a > > > >> higher throughput. > > > >> > > > >> @Jacques How involved is it to write a dummy plugin that returns one > > > >> hardcoded row repeatedly 12 million times? > > > >> > > > >> Thanks, > > > >> Alex > > > >> > > > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning < > ted.dunn...@gmail.com> > > > >> wrote: > > > >> > > > >>> It may be easy, but it is completely opaque about what really needs > > to > > > >>> happen. > > > >>> > > > >>> For instance, > > > >>> > > > >>> 1) how is schema exposed? > > > >>> > > > >>> 2) which classes do I really need to implement? > > > >>> > > > >>> 3) how do I express partitioning of a format? > > > >>> > > > >>> 4) how do I test it? > > > >>> > > > >>> Just a bit of documentation and comments would go a very, very long > > > way. > > > >>> > > > >>> Even answers on the mailing list that have more details than "oh, > > > that's > > > >>> easy". I would be happy to transcribe answers into the code if I > > could > > > >>> just get some. > > > >>> > > > >>> > > > >>> > > > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau < > jacq...@apache.org > > > > > > >>> wrote: > > > >>> > > > >>>> Creating an EasyFormatPlugin is pretty simple. They were designed > > to > > > >> get > > > >>>> rid of much of the scaffolding required for a standard > FormatPlugin. > > > >>>> > > > >>>> JSON > > > >> > > > > > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json > > > >>>> > > > >>>> Text > > > >> > > > > > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text > > > >>>> > > > >>>> AVRO > > > >> > > > > > > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro > > > >>>> > > > >>>> In all cases, the connection code is pretty light. A fully > > > schematized > > > >>>> format like log-synth should be even simpler to implement. > > > >>>> > > > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning < > > ted.dunn...@gmail.com> > > > >>>> wrote: > > > >>>> > > > >>>>> I don't think we need a full on storage plugin. I think a data > > > >> format > > > >>>>> should be sufficient, basically CSV on steroids. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche < > > > >>>>> adene...@maprtech.com > > > >>>>>> wrote: > > > >>>>> > > > >>>>>> Yeah, we still lack documentation on how to write a storage > > plugin. > > > >>> One > > > >>>>>> advice I've been seeing a lot is to take a look at the mongo-db > > > >>> plugin, > > > >>>>> it > > > >>>>>> was basically added in one single commit: > > > >> > > > > > > https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 > > > >>>>>> > > > >>>>>> I think this will give some general ideas on what to expect when > > > >>>> writing > > > >>>>> a > > > >>>>>> storage plugin. > > > >>>>>> > > > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning < > > > >> ted.dunn...@gmail.com> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Hakim, > > > >>>>>>> > > > >>>>>>> Not yet. Still very much in the stage of gathering feedback. > > > >>>>>>> > > > >>>>>>> I would think it very simple. The biggest obstacles are > > > >>>>>>> > > > >>>>>>> 1) no documentation on how to write a data format > > > >>>>>>> > > > >>>>>>> 2) I need to release a jar for log-synth to Maven Central. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche < > > > >>>>>>> adene...@maprtech.com> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm > > > >>>>> already > > > >>>>>>>> seeing many unit tests that could benefit from this. Do you > > > >> have > > > >>> a > > > >>>>>> github > > > >>>>>>>> repo for your ongoing work ? > > > >>>>>>>> > > > >>>>>>>> Thanks! > > > >>>>>>>> > > > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning < > > > >>>> ted.dunn...@gmail.com> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Are you hard set on using common table expressions? > > > >>>>>>>>> > > > >>>>>>>>> I have discussed a bit off-list creating a data format that > > > >>> would > > > >>>>>> allow > > > >>>>>>>>> tables to be read from a log-synth [1] schema. That would > > > >> let > > > >>>> you > > > >>>>>> read > > > >>>>>>>> as > > > >>>>>>>>> much data as you might like with an arbitrarily complex (or > > > >>>> simple) > > > >>>>>>>> query. > > > >>>>>>>>> > > > >>>>>>>>> Operationally, you would create a file containing a log-synth > > > >>>>> schema > > > >>>>>>> that > > > >>>>>>>>> has the extension .synth. Your data source would have to be > > > >>>>>> configured > > > >>>>>>>> to > > > >>>>>>>>> connect that extension with the log-synth format. At that > > > >>> point, > > > >>>>> you > > > >>>>>>>> could > > > >>>>>>>>> select as much or little data as you like from the file and > > > >> you > > > >>>>> would > > > >>>>>>> see > > > >>>>>>>>> generated data rather than the schema. > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> [1] https://github.com/tdunning/log-synth > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei < > > > >>>>>>>>> alexanderz.si...@gmail.com > > > >>>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi All, > > > >>>>>>>>>> > > > >>>>>>>>>> I am trying to come up with a query which returns a given > > > >>>> number > > > >>>>> of > > > >>>>>>>> rows > > > >>>>>>>>>> without having a real table on Storage. > > > >>>>>>>>>> > > > >>>>>>>>>> I am hoping to achieve something like this: > > > >> > > > > > > http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table > > > >>>>>>>>>> > > > >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000; > > > >>>>>>>>>> WITH numbers AS ( > > > >>>>>>>>>> SELECT @start AS number > > > >>>>>>>>>> UNION ALL > > > >>>>>>>>>> SELECT number + 1 > > > >>>>>>>>>> FROM numbers > > > >>>>>>>>>> WHERE number < @end)SELECT *FROM numbersOPTION > > > >>>> (MAXRECURSION > > > >>>>>> 0); > > > >>>>>>>>>> > > > >>>>>>>>>> I do not actually need to create different values and > > > >>> returning > > > >>>>>>>> identical > > > >>>>>>>>>> rows would work too.I just need to bypass the "from clause" > > > >>> in > > > >>>>> the > > > >>>>>>>> query. > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, > > > >>>>>>>>>> Alex > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> > > > >>>>>>>> Abdelhakim Deneche > > > >>>>>>>> > > > >>>>>>>> Software Engineer > > > >>>>>>>> > > > >>>>>>>> <http://www.mapr.com/> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Now Available - Free Hadoop On-Demand Training > > > >>>>>>>> < > > > >> > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> > > > >>>>>> Abdelhakim Deneche > > > >>>>>> > > > >>>>>> Software Engineer > > > >>>>>> > > > >>>>>> <http://www.mapr.com/> > > > >>>>>> > > > >>>>>> > > > >>>>>> Now Available - Free Hadoop On-Demand Training > > > >>>>>> < > > > >> > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > >> > > > > > >