Re: Recursive CTE Support in Drill

Ted Dunning Thu, 16 Jul 2015 22:17:53 -0700

Also, just doing a Cartesian join of three copies of 1000 records will give you 
a billion records with negligible I/o.


Sent from my iPhone

> On Jul 16, 2015, at 15:43, Jason Altekruse <[email protected]> wrote:
> 
> @Alexander If you want to test the speed of the ODBC driver you can do that
> without a new storage plugin.
> 
> If you get the entire dataset into memory, it will be returned from Drill a
> quickly as we can possibly send it to the client. One way to do this is to
> insert a sort; we cannot send along any of the data until the compete sort
> is done. As long as you don't read so much data that we will start spilling
> the sort to disk, all of the records will be in memory. To take the read
> and sort time out of your test, just make sure to record the time you first
> receive data from Drill, not the query start time.
> 
> There is one gotcha here. To make the BI tools more responsive, we
> implemented a feature that will send along one empty batch of records with
> the schema information populated. This schema is generated by applying all
> of the transformations that happen throughout the query. For example, the
> join operator handles this schema population by sending along the schema
> merged from the two sides of the join, project will similarly add or remove
> column based on the expressions and columns requested. You will want to
> make sure you record your start time when you receive the first batch with
> actual records. This can give you an accurate measurement of the ODBC
> performance, removing the bottleneck of the disk.
> 
> On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <[email protected]
>> wrote:
> 
>> Thanks for the answers.
>> 
>> @Ted my only goal is to pump a large amount of data without having to read
>> from Hard Disk. I am measuring the ODBC driver performance and I need a
>> higher data transfer rate. So any method that helps pumping data out of
>> Drill faster would help. The log-synth seems a good way to generate data
>> for testing. However, I'd need a ram only option which hopefully provides a
>> higher throughput.
>> 
>> @Jacques How involved is it to write a dummy plugin that returns one
>> hardcoded row repeatedly 12 million times?
>> 
>> Thanks,
>> Alex
>> 
>> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <[email protected]>
>> wrote:
>> 
>>> It may be easy, but it is completely opaque about what really needs to
>>> happen.
>>> 
>>> For instance,
>>> 
>>> 1) how is schema exposed?
>>> 
>>> 2) which classes do I really need to implement?
>>> 
>>> 3) how do I express partitioning of a format?
>>> 
>>> 4) how do I test it?
>>> 
>>> Just a bit of documentation and comments would go a very, very long way.
>>> 
>>> Even answers on the mailing list that have more details than "oh, that's
>>> easy".  I would be happy to transcribe answers into the code if I could
>>> just get some.
>>> 
>>> 
>>> 
>>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <[email protected]>
>>> wrote:
>>> 
>>>> Creating an EasyFormatPlugin is pretty simple.  They were designed to
>> get
>>>> rid of much of the scaffolding required for a standard FormatPlugin.
>>>> 
>>>> JSON
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
>>>> 
>>>> Text
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
>>>> 
>>>> AVRO
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
>>>> 
>>>> In all cases, the connection code is pretty light.  A fully schematized
>>>> format like log-synth should be even simpler to implement.
>>>> 
>>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <[email protected]>
>>>> wrote:
>>>> 
>>>>> I don't think we need a full on storage plugin.  I think a data
>> format
>>>>> should be sufficient, basically CSV on steroids.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche <
>>>>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> Yeah, we still lack documentation on how to write a storage plugin.
>>> One
>>>>>> advice I've been seeing a lot is to take a look at the mongo-db
>>> plugin,
>>>>> it
>>>>>> was basically added in one single commit:
>> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
>>>>>> 
>>>>>> I think this will give some general ideas on what to expect when
>>>> writing
>>>>> a
>>>>>> storage plugin.
>>>>>> 
>>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hakim,
>>>>>>> 
>>>>>>> Not yet.  Still very much in the stage of gathering feedback.
>>>>>>> 
>>>>>>> I would think it very simple.  The biggest obstacles are
>>>>>>> 
>>>>>>> 1) no documentation on how to write a data format
>>>>>>> 
>>>>>>> 2) I need to release a jar for log-synth to Maven Central.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche <
>>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Ted, the log-synth storage format would be really useful. I'm
>>>>> already
>>>>>>>> seeing many unit tests that could benefit from this. Do you
>> have
>>> a
>>>>>> github
>>>>>>>> repo for your ongoing work ?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning <
>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Are you hard set on using common table expressions?
>>>>>>>>> 
>>>>>>>>> I have discussed a bit off-list creating a data format that
>>> would
>>>>>> allow
>>>>>>>>> tables to be read from a log-synth [1] schema.  That would
>> let
>>>> you
>>>>>> read
>>>>>>>> as
>>>>>>>>> much data as you might like with an arbitrarily complex (or
>>>> simple)
>>>>>>>> query.
>>>>>>>>> 
>>>>>>>>> Operationally, you would create a file containing a log-synth
>>>>> schema
>>>>>>> that
>>>>>>>>> has the extension .synth.  Your data source would have to be
>>>>>> configured
>>>>>>>> to
>>>>>>>>> connect that extension with the log-synth format.  At that
>>> point,
>>>>> you
>>>>>>>> could
>>>>>>>>> select as much or little data as you like from the file and
>> you
>>>>> would
>>>>>>> see
>>>>>>>>> generated data rather than the schema.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/tdunning/log-synth
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei <
>>>>>>>>> [email protected]
>>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi All,
>>>>>>>>>> 
>>>>>>>>>> I am trying to come up with a query which returns a given
>>>> number
>>>>> of
>>>>>>>> rows
>>>>>>>>>> without having a real table on Storage.
>>>>>>>>>> 
>>>>>>>>>> I am hoping to achieve something like this:
>> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
>>>>>>>>>> 
>>>>>>>>>> DECLARE @start INT = 1;DECLARE @end INT = 1000000;
>>>>>>>>>> WITH numbers AS (
>>>>>>>>>>    SELECT @start AS number
>>>>>>>>>>    UNION ALL
>>>>>>>>>>    SELECT number + 1
>>>>>>>>>>    FROM  numbers
>>>>>>>>>>    WHERE number < @end)SELECT *FROM numbersOPTION
>>>> (MAXRECURSION
>>>>>> 0);
>>>>>>>>>> 
>>>>>>>>>> I do not actually need to create different values and
>>> returning
>>>>>>>> identical
>>>>>>>>>> rows would work too.I just need to bypass the "from clause"
>>> in
>>>>> the
>>>>>>>> query.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> Abdelhakim Deneche
>>>>>>>> 
>>>>>>>> Software Engineer
>>>>>>>> 
>>>>>>>>  <http://www.mapr.com/>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Now Available - Free Hadoop On-Demand Training
>>>>>>>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Abdelhakim Deneche
>>>>>> 
>>>>>> Software Engineer
>>>>>> 
>>>>>>  <http://www.mapr.com/>
>>>>>> 
>>>>>> 
>>>>>> Now Available - Free Hadoop On-Demand Training
>>>>>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>>

Re: Recursive CTE Support in Drill

Reply via email to