Removing cross posting

Alexander,

There is currently no way for Drill to generate a large amount of data
using SQL.  However, you can generate large generic data by using the
MockStoragePlugin if you submit a plan.  You can find an example plan using
this at [1].

I heard someone might be working on extending the MockStoragePlugin to
support SQL which would provide the outcome you requested.

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json

On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

>
> Also, just doing a Cartesian join of three copies of 1000 records will
> give you a billion records with negligible I/o.
>
> Sent from my iPhone
>
> > On Jul 16, 2015, at 15:43, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
> >
> > @Alexander If you want to test the speed of the ODBC driver you can do
> that
> > without a new storage plugin.
> >
> > If you get the entire dataset into memory, it will be returned from
> Drill a
> > quickly as we can possibly send it to the client. One way to do this is
> to
> > insert a sort; we cannot send along any of the data until the compete
> sort
> > is done. As long as you don't read so much data that we will start
> spilling
> > the sort to disk, all of the records will be in memory. To take the read
> > and sort time out of your test, just make sure to record the time you
> first
> > receive data from Drill, not the query start time.
> >
> > There is one gotcha here. To make the BI tools more responsive, we
> > implemented a feature that will send along one empty batch of records
> with
> > the schema information populated. This schema is generated by applying
> all
> > of the transformations that happen throughout the query. For example, the
> > join operator handles this schema population by sending along the schema
> > merged from the two sides of the join, project will similarly add or
> remove
> > column based on the expressions and columns requested. You will want to
> > make sure you record your start time when you receive the first batch
> with
> > actual records. This can give you an accurate measurement of the ODBC
> > performance, removing the bottleneck of the disk.
> >
> > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
> alexanderz.si...@gmail.com
> >> wrote:
> >
> >> Thanks for the answers.
> >>
> >> @Ted my only goal is to pump a large amount of data without having to
> read
> >> from Hard Disk. I am measuring the ODBC driver performance and I need a
> >> higher data transfer rate. So any method that helps pumping data out of
> >> Drill faster would help. The log-synth seems a good way to generate data
> >> for testing. However, I'd need a ram only option which hopefully
> provides a
> >> higher throughput.
> >>
> >> @Jacques How involved is it to write a dummy plugin that returns one
> >> hardcoded row repeatedly 12 million times?
> >>
> >> Thanks,
> >> Alex
> >>
> >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <ted.dunn...@gmail.com>
> >> wrote:
> >>
> >>> It may be easy, but it is completely opaque about what really needs to
> >>> happen.
> >>>
> >>> For instance,
> >>>
> >>> 1) how is schema exposed?
> >>>
> >>> 2) which classes do I really need to implement?
> >>>
> >>> 3) how do I express partitioning of a format?
> >>>
> >>> 4) how do I test it?
> >>>
> >>> Just a bit of documentation and comments would go a very, very long
> way.
> >>>
> >>> Even answers on the mailing list that have more details than "oh,
> that's
> >>> easy".  I would be happy to transcribe answers into the code if I could
> >>> just get some.
> >>>
> >>>
> >>>
> >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <jacq...@apache.org>
> >>> wrote:
> >>>
> >>>> Creating an EasyFormatPlugin is pretty simple.  They were designed to
> >> get
> >>>> rid of much of the scaffolding required for a standard FormatPlugin.
> >>>>
> >>>> JSON
> >>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
> >>>>
> >>>> Text
> >>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
> >>>>
> >>>> AVRO
> >>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
> >>>>
> >>>> In all cases, the connection code is pretty light.  A fully
> schematized
> >>>> format like log-synth should be even simpler to implement.
> >>>>
> >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <ted.dunn...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> I don't think we need a full on storage plugin.  I think a data
> >> format
> >>>>> should be sufficient, basically CSV on steroids.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
> >>>>> adene...@maprtech.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Yeah, we still lack documentation on how to write a storage plugin.
> >>> One
> >>>>>> advice I've been seeing a lot is to take a look at the mongo-db
> >>> plugin,
> >>>>> it
> >>>>>> was basically added in one single commit:
> >>
> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
> >>>>>>
> >>>>>> I think this will give some general ideas on what to expect when
> >>>> writing
> >>>>> a
> >>>>>> storage plugin.
> >>>>>>
> >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
> >> ted.dunn...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hakim,
> >>>>>>>
> >>>>>>> Not yet.  Still very much in the stage of gathering feedback.
> >>>>>>>
> >>>>>>> I would think it very simple.  The biggest obstacles are
> >>>>>>>
> >>>>>>> 1) no documentation on how to write a data format
> >>>>>>>
> >>>>>>> 2) I need to release a jar for log-synth to Maven Central.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
> >>>>>>> adene...@maprtech.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm
> >>>>> already
> >>>>>>>> seeing many unit tests that could benefit from this. Do you
> >> have
> >>> a
> >>>>>> github
> >>>>>>>> repo for your ongoing work ?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
> >>>> ted.dunn...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Are you hard set on using common table expressions?
> >>>>>>>>>
> >>>>>>>>> I have discussed a bit off-list creating a data format that
> >>> would
> >>>>>> allow
> >>>>>>>>> tables to be read from a log-synth [1] schema.  That would
> >> let
> >>>> you
> >>>>>> read
> >>>>>>>> as
> >>>>>>>>> much data as you might like with an arbitrarily complex (or
> >>>> simple)
> >>>>>>>> query.
> >>>>>>>>>
> >>>>>>>>> Operationally, you would create a file containing a log-synth
> >>>>> schema
> >>>>>>> that
> >>>>>>>>> has the extension .synth.  Your data source would have to be
> >>>>>> configured
> >>>>>>>> to
> >>>>>>>>> connect that extension with the log-synth format.  At that
> >>> point,
> >>>>> you
> >>>>>>>> could
> >>>>>>>>> select as much or little data as you like from the file and
> >> you
> >>>>> would
> >>>>>>> see
> >>>>>>>>> generated data rather than the schema.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [1] https://github.com/tdunning/log-synth
> >>>>>>>>>
> >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
> >>>>>>>>> alexanderz.si...@gmail.com
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi All,
> >>>>>>>>>>
> >>>>>>>>>> I am trying to come up with a query which returns a given
> >>>> number
> >>>>> of
> >>>>>>>> rows
> >>>>>>>>>> without having a real table on Storage.
> >>>>>>>>>>
> >>>>>>>>>> I am hoping to achieve something like this:
> >>
> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
> >>>>>>>>>>
> >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000;
> >>>>>>>>>> WITH numbers AS (
> >>>>>>>>>>    SELECT @start AS number
> >>>>>>>>>>    UNION ALL
> >>>>>>>>>>    SELECT number + 1
> >>>>>>>>>>    FROM  numbers
> >>>>>>>>>>    WHERE number < @end)SELECT *FROM numbersOPTION
> >>>> (MAXRECURSION
> >>>>>> 0);
> >>>>>>>>>>
> >>>>>>>>>> I do not actually need to create different values and
> >>> returning
> >>>>>>>> identical
> >>>>>>>>>> rows would work too.I just need to bypass the "from clause"
> >>> in
> >>>>> the
> >>>>>>>> query.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Alex
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Abdelhakim Deneche
> >>>>>>>>
> >>>>>>>> Software Engineer
> >>>>>>>>
> >>>>>>>>  <http://www.mapr.com/>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Now Available - Free Hadoop On-Demand Training
> >>>>>>>> <
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Abdelhakim Deneche
> >>>>>>
> >>>>>> Software Engineer
> >>>>>>
> >>>>>>  <http://www.mapr.com/>
> >>>>>>
> >>>>>>
> >>>>>> Now Available - Free Hadoop On-Demand Training
> >>>>>> <
> >>
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >>
>

Reply via email to