Removing cross posting Alexander,
There is currently no way for Drill to generate a large amount of data using SQL. However, you can generate large generic data by using the MockStoragePlugin if you submit a plan. You can find an example plan using this at [1]. I heard someone might be working on extending the MockStoragePlugin to support SQL which would provide the outcome you requested. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > Also, just doing a Cartesian join of three copies of 1000 records will > give you a billion records with negligible I/o. > > Sent from my iPhone > > > On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com> > wrote: > > > > @Alexander If you want to test the speed of the ODBC driver you can do > that > > without a new storage plugin. > > > > If you get the entire dataset into memory, it will be returned from > Drill a > > quickly as we can possibly send it to the client. One way to do this is > to > > insert a sort; we cannot send along any of the data until the compete > sort > > is done. As long as you don't read so much data that we will start > spilling > > the sort to disk, all of the records will be in memory. To take the read > > and sort time out of your test, just make sure to record the time you > first > > receive data from Drill, not the query start time. > > > > There is one gotcha here. To make the BI tools more responsive, we > > implemented a feature that will send along one empty batch of records > with > > the schema information populated. This schema is generated by applying > all > > of the transformations that happen throughout the query. For example, the > > join operator handles this schema population by sending along the schema > > merged from the two sides of the join, project will similarly add or > remove > > column based on the expressions and columns requested. You will want to > > make sure you record your start time when you receive the first batch > with > > actual records. This can give you an accurate measurement of the ODBC > > performance, removing the bottleneck of the disk. > > > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei < > alexanderz.si...@gmail.com > >> wrote: > > > >> Thanks for the answers. > >> > >> @Ted my only goal is to pump a large amount of data without having to > read > >> from Hard Disk. I am measuring the ODBC driver performance and I need a > >> higher data transfer rate. So any method that helps pumping data out of > >> Drill faster would help. The log-synth seems a good way to generate data > >> for testing. However, I'd need a ram only option which hopefully > provides a > >> higher throughput. > >> > >> @Jacques How involved is it to write a dummy plugin that returns one > >> hardcoded row repeatedly 12 million times? > >> > >> Thanks, > >> Alex > >> > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <ted.dunn...@gmail.com> > >> wrote: > >> > >>> It may be easy, but it is completely opaque about what really needs to > >>> happen. > >>> > >>> For instance, > >>> > >>> 1) how is schema exposed? > >>> > >>> 2) which classes do I really need to implement? > >>> > >>> 3) how do I express partitioning of a format? > >>> > >>> 4) how do I test it? > >>> > >>> Just a bit of documentation and comments would go a very, very long > way. > >>> > >>> Even answers on the mailing list that have more details than "oh, > that's > >>> easy". I would be happy to transcribe answers into the code if I could > >>> just get some. > >>> > >>> > >>> > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <jacq...@apache.org> > >>> wrote: > >>> > >>>> Creating an EasyFormatPlugin is pretty simple. They were designed to > >> get > >>>> rid of much of the scaffolding required for a standard FormatPlugin. > >>>> > >>>> JSON > >> > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json > >>>> > >>>> Text > >> > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text > >>>> > >>>> AVRO > >> > https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro > >>>> > >>>> In all cases, the connection code is pretty light. A fully > schematized > >>>> format like log-synth should be even simpler to implement. > >>>> > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <ted.dunn...@gmail.com> > >>>> wrote: > >>>> > >>>>> I don't think we need a full on storage plugin. I think a data > >> format > >>>>> should be sufficient, basically CSV on steroids. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche < > >>>>> adene...@maprtech.com > >>>>>> wrote: > >>>>> > >>>>>> Yeah, we still lack documentation on how to write a storage plugin. > >>> One > >>>>>> advice I've been seeing a lot is to take a look at the mongo-db > >>> plugin, > >>>>> it > >>>>>> was basically added in one single commit: > >> > https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 > >>>>>> > >>>>>> I think this will give some general ideas on what to expect when > >>>> writing > >>>>> a > >>>>>> storage plugin. > >>>>>> > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning < > >> ted.dunn...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hakim, > >>>>>>> > >>>>>>> Not yet. Still very much in the stage of gathering feedback. > >>>>>>> > >>>>>>> I would think it very simple. The biggest obstacles are > >>>>>>> > >>>>>>> 1) no documentation on how to write a data format > >>>>>>> > >>>>>>> 2) I need to release a jar for log-synth to Maven Central. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche < > >>>>>>> adene...@maprtech.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm > >>>>> already > >>>>>>>> seeing many unit tests that could benefit from this. Do you > >> have > >>> a > >>>>>> github > >>>>>>>> repo for your ongoing work ? > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning < > >>>> ted.dunn...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Are you hard set on using common table expressions? > >>>>>>>>> > >>>>>>>>> I have discussed a bit off-list creating a data format that > >>> would > >>>>>> allow > >>>>>>>>> tables to be read from a log-synth [1] schema. That would > >> let > >>>> you > >>>>>> read > >>>>>>>> as > >>>>>>>>> much data as you might like with an arbitrarily complex (or > >>>> simple) > >>>>>>>> query. > >>>>>>>>> > >>>>>>>>> Operationally, you would create a file containing a log-synth > >>>>> schema > >>>>>>> that > >>>>>>>>> has the extension .synth. Your data source would have to be > >>>>>> configured > >>>>>>>> to > >>>>>>>>> connect that extension with the log-synth format. At that > >>> point, > >>>>> you > >>>>>>>> could > >>>>>>>>> select as much or little data as you like from the file and > >> you > >>>>> would > >>>>>>> see > >>>>>>>>> generated data rather than the schema. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [1] https://github.com/tdunning/log-synth > >>>>>>>>> > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei < > >>>>>>>>> alexanderz.si...@gmail.com > >>>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi All, > >>>>>>>>>> > >>>>>>>>>> I am trying to come up with a query which returns a given > >>>> number > >>>>> of > >>>>>>>> rows > >>>>>>>>>> without having a real table on Storage. > >>>>>>>>>> > >>>>>>>>>> I am hoping to achieve something like this: > >> > http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table > >>>>>>>>>> > >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000; > >>>>>>>>>> WITH numbers AS ( > >>>>>>>>>> SELECT @start AS number > >>>>>>>>>> UNION ALL > >>>>>>>>>> SELECT number + 1 > >>>>>>>>>> FROM numbers > >>>>>>>>>> WHERE number < @end)SELECT *FROM numbersOPTION > >>>> (MAXRECURSION > >>>>>> 0); > >>>>>>>>>> > >>>>>>>>>> I do not actually need to create different values and > >>> returning > >>>>>>>> identical > >>>>>>>>>> rows would work too.I just need to bypass the "from clause" > >>> in > >>>>> the > >>>>>>>> query. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Alex > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> Abdelhakim Deneche > >>>>>>>> > >>>>>>>> Software Engineer > >>>>>>>> > >>>>>>>> <http://www.mapr.com/> > >>>>>>>> > >>>>>>>> > >>>>>>>> Now Available - Free Hadoop On-Demand Training > >>>>>>>> < > >> > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> Abdelhakim Deneche > >>>>>> > >>>>>> Software Engineer > >>>>>> > >>>>>> <http://www.mapr.com/> > >>>>>> > >>>>>> > >>>>>> Now Available - Free Hadoop On-Demand Training > >>>>>> < > >> > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > >> >