Re: Use cases for DFDL

Charles Givre Thu, 07 Nov 2019 10:40:48 -0800

@Paul, 
Do you think a format plugin is the right way to integrate this?  My thought 
was that we could create a folder for dfdl schemata, then the format plugin 
could specify which schema would be used during read.  IE:


"dfdl" :{
  "type":"dfdl",
  "file":"myschema.dfdl",
  "extensions":["xml"]
}

I was envisioning this working in much the same way as other format plugins 
that use an external parser.
-- C


> On Nov 7, 2019, at 1:35 PM, Paul Rogers <[email protected]> wrote:
> 
> Hi All,
> 
> One thought to add is that if DFDL defines the file schema, then it would be 
> ideal to use that schema at plan time as well as run time. Drill's Calcite 
> integration provides means to do this, though I am personally a bit hazy on 
> the details.
> 
> Certainly getting the reader to work is the first step; thanks Charles for 
> the excellent summary. Then, add the needed Calcite integration to make the 
> schema available to the planner at plan time.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre 
> <[email protected]> wrote:  
> 
> Hi Steve, 
> Thanks for responding... Here's how Drill reads a file:
> 
> Drill uses what are called "format plugins" which basically read the file in 
> question and map fields to column vectors.  Note:  Drill supports nested data 
> structures, so a column could contain a MAP or LIST. 
> 
> The basic steps are:
> 1.  Open the inputstream and read the file
> 2.  If the schema is known, it is advantageous to define the schema using a 
> schemaBuilder object in advance and create schemaWriters for each column.  In 
> this case, since we'd be using DFDL, we do know the schema so we could create 
> the schema BEFORE the data actually gets read.  If the schema is not known in 
> advance, JSON for instance, Drill can discover the schema as it is reading 
> the data, by dynamically adding column vectors as data is ingested, but 
> that's not the case here... 
> 3.  Once the schema is defined, Drill will then read the file row by row, 
> parse the data, and assign values to each column vector. 
> 
> There are a few more details but that's the essence.  
> 
> What would be great is if we could create a function that could directly map 
> a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1])  Drill does 
> natively support JSON, however, it would probably be more effective and 
> efficient if there was an InfosetOutputter custom for Drill.  Ideally, we 
> need some sort of Iterable object so that Drill can map the parsed fields to 
> the schema.  
> 
> If you want to take a look at a relatively simple format plugin take a look 
> here: [2]. This file is the BatchReader which is where most of the heavy 
> lifting takes place.  This plugin is for ESRI Shape files and has a mix of 
> pre-defined fields, nested fields and fields that are defined after reading 
> starts.
> 
> 
> [1]: 
> https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
>  
> <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]: 
> https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
>  
> <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
> 
> 
> I can start a draft PR on the Drill side over the weekend and will share the 
> link to this list.
> Respectfully, 
> -- C
> 
> 
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <[email protected]> 
>> wrote:
>> 
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>> 
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>> 
>> 1) Compile a DFDL schema to a data processor:
>> 
>>   Compiler c = Daffodil.compiler();
>>   ProcessorFactory pf = c.compileFile(file);
>>   DataProcessor dp = pf.onPath("/");
>> 
>> 2) Create an input source for the data
>> 
>>   InputStream is = ...
>>   InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>> 
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>> 
>>   JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>> 
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>> 
>>   ParseResult pr = dataProcessor.parse(in, out)
>> 
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>> 
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>> 
>> - Steve
>> 
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>> 
>> 
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here.  I 
>>> went to 
>>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>>> immediately thought this was a really interesting possibility.
>>> 
>>> I'd love to see if we could foster some collaboration between the various 
>>> projects on this.  From the Drill side of things, it would make it SO much 
>>> easier to get Drill to read (and by extension query) various data types.  
>>> I'd be 
>>> willing to contribute time from the Drill side, but I definitely will need 
>>> help 
>>> understanding how DFDL works.
>>> 
>>> --C
>>> 
>>> 
>>> 
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter 
>>>> for 
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete 
>>>> Runtime 
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>> *Antworten an:*"[email protected] 
>>>> <mailto:[email protected]>" 
>>>> <[email protected] <mailto:[email protected]>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <[email protected] <mailto:[email protected]>>
>>>> *Cc:*"[email protected] <mailto:[email protected]>" 
>>>> <[email protected] <mailto:[email protected]>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
>>>>> could 
>>>>> use ANSI SQL to query the data, join it with other data, do analysis, 
>>>>> etc., 
>>>>> just as if it came from a database. So, instead of parsing data to XML 
>>>>> and 
>>>>> then using XPath to pull out data, you could instead parse data to Apache 
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and 
>>>>> even 
>>>>> combine it with other non-Daffodil data types. The advantage for this 
>>>>> would 
>>>>> be that it would make it very easy to enable Drill to query new data 
>>>>> types 
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily 
>>>>> query 
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. 
>>>>> It is 
>>>>> regular ANSI SQL.  IMHO, I think this. would be a really great 
>>>>> collaboration 
>>>>> of the two communities.
>>>>> --C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could 
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you 
>>>>>> could 
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query 
>>>>>> parts of 
>>>>>> that data, join it with other data, do analysis, etc., just as if it 
>>>>>> came 
>>>>>> from a database. So, instead of parsing data to XML and then using XPath 
>>>>>> to 
>>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>>> representation and then use Drills rich data-query capabilities to pull 
>>>>>> out 
>>>>>> data, and even combine it with other non-Daffodil data types. The 
>>>>>> advantage 
>>>>>> for this would be that it would make it very easy to enable Drill to 
>>>>>> query 
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable 
>>>>>> users 
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data 
>>>>>> that Drill cannot read.  If DFDL can be used to describe the schema, a 
>>>>>> plugin could be written for Drill that mirrors this schema and 
>>>>>> ultimately 
>>>>>> reads the data files.  Drill wouldn't be populating any database, but 
>>>>>> rather 
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to 
>>>>>> enable 
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it 
>>>>>> would enable users to easily query this data w/o having to load it into 
>>>>>> another system.  Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill 
>>>>>>> to 
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill.  I think 
>>>>>>> a 
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to 
>>>>>>> enable 
>>>>>>> Drill to query data based on a DFDL schema.  This same concept could be 
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <[email protected] 
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case. 
>>>>>>>> The 
>>>>>>>> fact that we did do one project involving RDF is why I cited that 
>>>>>>>> example 
>>>>>>>> in particular but pulling data into any data store/data base begins 
>>>>>>>> with 
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be 
>>>>>>>> "Example 
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <[email protected] 
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*[email protected] 
>>>>>>>> <mailto:[email protected]><[email protected] 
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the 
>>>>>>>> slide? 
>>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>>> <image003.png>
>>> 
>>

Re: Use cases for DFDL

Reply via email to