Also, just doing a Cartesian join of three copies of 1000 records will give you a billion records with negligible I/o.
Sent from my iPhone > On Jul 16, 2015, at 15:43, Jason Altekruse <[email protected]> wrote: > > @Alexander If you want to test the speed of the ODBC driver you can do that > without a new storage plugin. > > If you get the entire dataset into memory, it will be returned from Drill a > quickly as we can possibly send it to the client. One way to do this is to > insert a sort; we cannot send along any of the data until the compete sort > is done. As long as you don't read so much data that we will start spilling > the sort to disk, all of the records will be in memory. To take the read > and sort time out of your test, just make sure to record the time you first > receive data from Drill, not the query start time. > > There is one gotcha here. To make the BI tools more responsive, we > implemented a feature that will send along one empty batch of records with > the schema information populated. This schema is generated by applying all > of the transformations that happen throughout the query. For example, the > join operator handles this schema population by sending along the schema > merged from the two sides of the join, project will similarly add or remove > column based on the expressions and columns requested. You will want to > make sure you record your start time when you receive the first batch with > actual records. This can give you an accurate measurement of the ODBC > performance, removing the bottleneck of the disk. > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <[email protected] >> wrote: > >> Thanks for the answers. >> >> @Ted my only goal is to pump a large amount of data without having to read >> from Hard Disk. I am measuring the ODBC driver performance and I need a >> higher data transfer rate. So any method that helps pumping data out of >> Drill faster would help. The log-synth seems a good way to generate data >> for testing. However, I'd need a ram only option which hopefully provides a >> higher throughput. >> >> @Jacques How involved is it to write a dummy plugin that returns one >> hardcoded row repeatedly 12 million times? >> >> Thanks, >> Alex >> >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <[email protected]> >> wrote: >> >>> It may be easy, but it is completely opaque about what really needs to >>> happen. >>> >>> For instance, >>> >>> 1) how is schema exposed? >>> >>> 2) which classes do I really need to implement? >>> >>> 3) how do I express partitioning of a format? >>> >>> 4) how do I test it? >>> >>> Just a bit of documentation and comments would go a very, very long way. >>> >>> Even answers on the mailing list that have more details than "oh, that's >>> easy". I would be happy to transcribe answers into the code if I could >>> just get some. >>> >>> >>> >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <[email protected]> >>> wrote: >>> >>>> Creating an EasyFormatPlugin is pretty simple. They were designed to >> get >>>> rid of much of the scaffolding required for a standard FormatPlugin. >>>> >>>> JSON >> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json >>>> >>>> Text >> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text >>>> >>>> AVRO >> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro >>>> >>>> In all cases, the connection code is pretty light. A fully schematized >>>> format like log-synth should be even simpler to implement. >>>> >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <[email protected]> >>>> wrote: >>>> >>>>> I don't think we need a full on storage plugin. I think a data >> format >>>>> should be sufficient, basically CSV on steroids. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche < >>>>> [email protected] >>>>>> wrote: >>>>> >>>>>> Yeah, we still lack documentation on how to write a storage plugin. >>> One >>>>>> advice I've been seeing a lot is to take a look at the mongo-db >>> plugin, >>>>> it >>>>>> was basically added in one single commit: >> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 >>>>>> >>>>>> I think this will give some general ideas on what to expect when >>>> writing >>>>> a >>>>>> storage plugin. >>>>>> >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hakim, >>>>>>> >>>>>>> Not yet. Still very much in the stage of gathering feedback. >>>>>>> >>>>>>> I would think it very simple. The biggest obstacles are >>>>>>> >>>>>>> 1) no documentation on how to write a data format >>>>>>> >>>>>>> 2) I need to release a jar for log-synth to Maven Central. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche < >>>>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> @Ted, the log-synth storage format would be really useful. I'm >>>>> already >>>>>>>> seeing many unit tests that could benefit from this. Do you >> have >>> a >>>>>> github >>>>>>>> repo for your ongoing work ? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning < >>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Are you hard set on using common table expressions? >>>>>>>>> >>>>>>>>> I have discussed a bit off-list creating a data format that >>> would >>>>>> allow >>>>>>>>> tables to be read from a log-synth [1] schema. That would >> let >>>> you >>>>>> read >>>>>>>> as >>>>>>>>> much data as you might like with an arbitrarily complex (or >>>> simple) >>>>>>>> query. >>>>>>>>> >>>>>>>>> Operationally, you would create a file containing a log-synth >>>>> schema >>>>>>> that >>>>>>>>> has the extension .synth. Your data source would have to be >>>>>> configured >>>>>>>> to >>>>>>>>> connect that extension with the log-synth format. At that >>> point, >>>>> you >>>>>>>> could >>>>>>>>> select as much or little data as you like from the file and >> you >>>>> would >>>>>>> see >>>>>>>>> generated data rather than the schema. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] https://github.com/tdunning/log-synth >>>>>>>>> >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei < >>>>>>>>> [email protected] >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I am trying to come up with a query which returns a given >>>> number >>>>> of >>>>>>>> rows >>>>>>>>>> without having a real table on Storage. >>>>>>>>>> >>>>>>>>>> I am hoping to achieve something like this: >> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table >>>>>>>>>> >>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000; >>>>>>>>>> WITH numbers AS ( >>>>>>>>>> SELECT @start AS number >>>>>>>>>> UNION ALL >>>>>>>>>> SELECT number + 1 >>>>>>>>>> FROM numbers >>>>>>>>>> WHERE number < @end)SELECT *FROM numbersOPTION >>>> (MAXRECURSION >>>>>> 0); >>>>>>>>>> >>>>>>>>>> I do not actually need to create different values and >>> returning >>>>>>>> identical >>>>>>>>>> rows would work too.I just need to bypass the "from clause" >>> in >>>>> the >>>>>>>> query. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Alex >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Abdelhakim Deneche >>>>>>>> >>>>>>>> Software Engineer >>>>>>>> >>>>>>>> <http://www.mapr.com/> >>>>>>>> >>>>>>>> >>>>>>>> Now Available - Free Hadoop On-Demand Training >>>>>>>> < >> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Abdelhakim Deneche >>>>>> >>>>>> Software Engineer >>>>>> >>>>>> <http://www.mapr.com/> >>>>>> >>>>>> >>>>>> Now Available - Free Hadoop On-Demand Training >>>>>> < >> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available >>
