@Paul,
Do you think a format plugin is the right way to integrate this? My thought
was that we could create a folder for dfdl schemata, then the format plugin
could specify which schema would be used during read. IE:
"dfdl" :{
"type":"dfdl",
"file":"myschema.dfdl",
"extensions":["xml"]
}
I was envisioning this working in much the same way as other format plugins
that use an external parser.
-- C
> On Nov 7, 2019, at 1:35 PM, Paul Rogers <[email protected]> wrote:
>
> Hi All,
>
> One thought to add is that if DFDL defines the file schema, then it would be
> ideal to use that schema at plan time as well as run time. Drill's Calcite
> integration provides means to do this, though I am personally a bit hazy on
> the details.
>
> Certainly getting the reader to work is the first step; thanks Charles for
> the excellent summary. Then, add the needed Calcite integration to make the
> schema available to the planner at plan time.
>
> Thanks,
> - Paul
>
>
>
> On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre
> <[email protected]> wrote:
>
> Hi Steve,
> Thanks for responding... Here's how Drill reads a file:
>
> Drill uses what are called "format plugins" which basically read the file in
> question and map fields to column vectors. Note: Drill supports nested data
> structures, so a column could contain a MAP or LIST.
>
> The basic steps are:
> 1. Open the inputstream and read the file
> 2. If the schema is known, it is advantageous to define the schema using a
> schemaBuilder object in advance and create schemaWriters for each column. In
> this case, since we'd be using DFDL, we do know the schema so we could create
> the schema BEFORE the data actually gets read. If the schema is not known in
> advance, JSON for instance, Drill can discover the schema as it is reading
> the data, by dynamically adding column vectors as data is ingested, but
> that's not the case here...
> 3. Once the schema is defined, Drill will then read the file row by row,
> parse the data, and assign values to each column vector.
>
> There are a few more details but that's the essence.
>
> What would be great is if we could create a function that could directly map
> a DFDL schema directly to a Drill SchemaBuilder. (Docs here [1]) Drill does
> natively support JSON, however, it would probably be more effective and
> efficient if there was an InfosetOutputter custom for Drill. Ideally, we
> need some sort of Iterable object so that Drill can map the parsed fields to
> the schema.
>
> If you want to take a look at a relatively simple format plugin take a look
> here: [2]. This file is the BatchReader which is where most of the heavy
> lifting takes place. This plugin is for ESRI Shape files and has a mix of
> pre-defined fields, nested fields and fields that are defined after reading
> starts.
>
>
> [1]:
> https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
>
> <https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
> [2]:
> https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
>
> <https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>
>
>
> I can start a draft PR on the Drill side over the weekend and will share the
> link to this list.
> Respectfully,
> -- C
>
>
>> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <[email protected]>
>> wrote:
>>
>> I definitely agree. Apache Drill seems like a logical place to add
>> Daffodil support. And I'm sure many of us, including myself, would be
>> happy to provide some time towards this effort.
>>
>> The Daffodil API is actually fairly simple and is usually fairly
>> straightforward to integrate--most of the complexity comes from the DFDL
>> schemas. There's a good "hello world" available [1] that shows more API
>> functionality/errors/etc., but the jist of it is:
>>
>> 1) Compile a DFDL schema to a data processor:
>>
>> Compiler c = Daffodil.compiler();
>> ProcessorFactory pf = c.compileFile(file);
>> DataProcessor dp = pf.onPath("/");
>>
>> 2) Create an input source for the data
>>
>> InputStream is = ...
>> InputSourceDataInputStream in = new InputSourceDataInputStream(is);
>>
>> 3) Create an infoset outputter (we have a handful of differnt kinds)
>>
>> JDOMInfosetOutputter out = new JDOMInfosetOutputter();
>>
>> 4) Use the DataProcessor to parse the input data to the infoset outputter
>>
>> ParseResult pr = dataProcessor.parse(in, out)
>>
>> So I guess the parts that we would need more Drill understanding is what
>> the InfosetOutputter (step 3) needs to look like to better integrate
>> into Drill. Is there a standard data structure that Drill expects
>> representations of data to look like and Drill does the querying on the
>> data structure? And is there some sort of schema that Daffodil would
>> need to create to describe what this structure looks like so it could
>> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
>> this data structure, unless Drill already supports XML or JSON.
>>
>> Or is it completely up to the Storage Plugin (is that the right term) to
>> determine how to take a Drill query and find the appropriate data from
>> the data store?
>>
>> - Steve
>>
>> [1]
>> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
>>
>>
>> On 11/3/19 9:31 AM, Charles Givre wrote:
>>> Hi Julian,
>>> It seems like there is a beginning of convergence of the minds here. I
>>> went to
>>> the Apache Roadshow in DC and that was where I learned about DFDL and
>>> immediately thought this was a really interesting possibility.
>>>
>>> I'd love to see if we could foster some collaboration between the various
>>> projects on this. From the Drill side of things, it would make it SO much
>>> easier to get Drill to read (and by extension query) various data types.
>>> I'd be
>>> willing to contribute time from the Drill side, but I definitely will need
>>> help
>>> understanding how DFDL works.
>>>
>>> --C
>>>
>>>
>>>
>>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>> Hi Charles,
>>>> this is an interesting idea and in fact we also discussed the same matter
>>>> for
>>>> Calcite at ApacheCon NA.
>>>> But, I agree that it would be really powerful together with a complete
>>>> Runtime
>>>> like Drill.
>>>> Julian
>>>> *Von:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>> *Antworten an:*"[email protected]
>>>> <mailto:[email protected]>"
>>>> <[email protected] <mailto:[email protected]>>
>>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>>> *An:*"Costello, Roger L." <[email protected] <mailto:[email protected]>>
>>>> *Cc:*"[email protected] <mailto:[email protected]>"
>>>> <[email protected] <mailto:[email protected]>>
>>>> *Betreff:*Re: Use cases for DFDL
>>>> +1
>>>>
>>>>
>>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> Excellent! Okay, here’s the use case:
>>>>> A Daffodil extension could be created for Apache Drill so that you could
>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you
>>>>> could
>>>>> use ANSI SQL to query the data, join it with other data, do analysis,
>>>>> etc.,
>>>>> just as if it came from a database. So, instead of parsing data to XML
>>>>> and
>>>>> then using XPath to pull out data, you could instead parse data to Apache
>>>>> Drill's data representation and then use ANSI SQL to pull out data, and
>>>>> even
>>>>> combine it with other non-Daffodil data types. The advantage for this
>>>>> would
>>>>> be that it would make it very easy to enable Drill to query new data
>>>>> types
>>>>> (IE simply by using a DFDL schema) and it would enable users to easily
>>>>> query
>>>>> this data without having to load it into another system.
>>>>> How’s that Charles?
>>>>> /Roger
>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax.
>>>>> It is
>>>>> regular ANSI SQL. IMHO, I think this. would be a really great
>>>>> collaboration
>>>>> of the two communities.
>>>>> --C
>>>>>
>>>>>
>>>>>
>>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Thanks again Charles. Is the following use case description correct?
>>>>>> A Daffodil extension could be created for Apache Drill so that you could
>>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you
>>>>>> could
>>>>>> use Apache Drill's query-like syntax and rich capabilities to query
>>>>>> parts of
>>>>>> that data, join it with other data, do analysis, etc., just as if it
>>>>>> came
>>>>>> from a database. So, instead of parsing data to XML and then using XPath
>>>>>> to
>>>>>> pull out data, you could instead parse data to Apache Drill's data
>>>>>> representation and then use Drills rich data-query capabilities to pull
>>>>>> out
>>>>>> data, and even combine it with other non-Daffodil data types. The
>>>>>> advantage
>>>>>> for this would be that it would make it very easy to enable Drill to
>>>>>> query
>>>>>> new data types (IE simply by using a DFDL schema) and it would enable
>>>>>> users
>>>>>> to easily query this data without having to load it into another system.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>>> *To:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
>>>>>> *Cc:*[email protected] <mailto:[email protected]>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> Not exactly...
>>>>>> I was thinking of using DFDL to enable Drill to create a schema for data
>>>>>> that Drill cannot read. If DFDL can be used to describe the schema, a
>>>>>> plugin could be written for Drill that mirrors this schema and
>>>>>> ultimately
>>>>>> reads the data files. Drill wouldn't be populating any database, but
>>>>>> rather
>>>>>> directly querying the data.
>>>>>> The advantage for this would be that it would make it very easy to
>>>>>> enable
>>>>>> Drill to query new data types (IE simply by using a DFDL schema) and it
>>>>>> would enable users to easily query this data w/o having to load it into
>>>>>> another system. Does that make sense?
>>>>>> -- C
>>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>>> Use DFDL to parse data to populate a database and then use Apache Drill
>>>>>>> to
>>>>>>> query the database.
>>>>>>> Is that correct?
>>>>>>> /Roger
>>>>>>> *From:*Charles Givre <[email protected] <mailto:[email protected]>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill. I think
>>>>>>> a
>>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL to
>>>>>>> enable
>>>>>>> Drill to query data based on a DFDL schema. This same concept could be
>>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>>> IMHO, this would facilitate the analysis of data sets supported by DFDL.
>>>>>>> -- C
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <[email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Thanks Mike! I updated the slide:
>>>>>>>> <image002.png>
>>>>>>>> *From:*Beckerle, Mike <[email protected]
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>>> Parsing data to populate a database (any variety) is the actual case.
>>>>>>>> The
>>>>>>>> fact that we did do one project involving RDF is why I cited that
>>>>>>>> example
>>>>>>>> in particular but pulling data into any data store/data base begins
>>>>>>>> with
>>>>>>>> the ability to parse the data, and then process it into suitable form.
>>>>>>>> This is an incomplete list so perhaps this slide title should be
>>>>>>>> "Example
>>>>>>>> Use Cases for DFDL" ?
>>>>>>>> ...mikeb
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:*Costello, Roger L. <[email protected]
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>>> *To:*[email protected]
>>>>>>>> <mailto:[email protected]><[email protected]
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> *Subject:*Use cases for DFDL
>>>>>>>> Hi Folks,
>>>>>>>> I created a slide of use cases. See below. Do you agree with the
>>>>>>>> slide?
>>>>>>>> Anything you would add, delete, or change? /Roger
>>>>>>>> <image003.png>
>>>
>>