Hi Mike,

The wikipedia description is pretty complete in terms of what we're looking to 
include in the paper.  I am interested in including an example workflow and 
possibly implementation gotchas, especially those that appear across tools.  
For example, one of the questions that has come up for Kaitai is the dangers in 
making the description langauge turing-complete vs. the benefits.

The other thing that would be helpful would be more specific (where possible)  
examples.  I'm not sure how feasible that is if most of the use cases are 
cybersecurity related, but if anyone does have an example(s) they're able to 
share that would be compelling!

When you say it would be helpful to compare and contrast approaches, do you 
mean DFDL and ECN?

Thanks so much for your response!

Amy

________________________________
From: Mike Beckerle <mbecke...@apache.org>
Sent: Friday, April 19, 2024 10:09 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: Re: paper on tools and use cases for parsing binary data

[External Email - Use Caution]
Hi Amy,

What would you be looking for beyond what is already in our Wikipedia 
description https://en.wikipedia.org/wiki/Data_Format_Description_Language ?

One thing I would emphasize about DFDL is that most people think data parsing 
is the hard problem, and data serialization is relatively simple, but if the 
serializer actually solves the problem of computing all the stored lengths for 
you, so that the user doesn't have to know anything about the actual format, 
then data serialization turns out to be far more complex than parsing. 
Particularly if you want streaming behavior. I gave a talk about this back at 
the ApacheCon conference in 2018. The slides are available here: 
https://s.apache.org/apacheconNA2018-dfdl

To date, the primary use case for DFDL is cybersecurity related. Data must be 
parsed/ validated and "unparsed" back to original form in order to ensure that 
the data is, in fact, in that format and will not crash applications. The 
threat is not so much malware as "just bad data" causing denial-of-service.

For your study, I would suggest you also look at ASN.1 Encoding Control 
Notation. This has been an ISO standard since 2008. ASN.1 which we normally 
think of as a prescriptive data format, but ECN extends it so that you specify 
the representation of the data. See: 
https://en.wikipedia.org/wiki/Encoding_Control_Notation

I think it would be very helpful if a paper really compared/contrasted these 
approaches.


On Fri, Apr 19, 2024 at 10:24 AM Roberts, Amy L2 
<amy.robe...@ucdenver.edu<mailto:amy.robe...@ucdenver.edu>> wrote:
Hello!

I am working with a team on a tool, Awkward Kaitai, that gives people tools to 
work with binary data once that data has been described with a custom language. 
 If you're interested in more details the project is currently hosted at 
https://github.com/ManasviGoyal/kaitai_struct_awkward_runtime and is meant to 
integrate with a larger project, https://kaitai.io/.

I am writing because DFDL is a tool that solves a similar problem.

We are currently writing a paper that provides examples of different 
custom-data problems in different domains and provides an overview of tools 
that help scientists work with such data and I wanted to reach out to the DFDL 
community to see if anyone would be interested in joining our paper as an 
author.

I'd be delighted to have you contribute in any way you'd like as an author, and 
am particularly interested in having you:

- Contribute a section about your tool
- Show how your tool deals with a toy data file (I'm suggesting 
https://github.com/det-lab/dataReaderWriter/blob/master/kaitai/ksy/animal.ksy 
but would be happy to consider other options!)
- Help identify any similar tools that we should include in our review
- Help identify any use cases that we could include in our "Use Cases" section

Thanks so much for your work in this area!

Best,

Amy


Amy Roberts

Assistant Professor of Physics

amy.robe...@ucdenver.edu<mailto:amy.robe...@ucdenver.edu>

Reply via email to