Hi Mike, The wikipedia description is pretty complete in terms of what we're looking to include in the paper. I am interested in including an example workflow and possibly implementation gotchas, especially those that appear across tools. For example, one of the questions that has come up for Kaitai is the dangers in making the description langauge turing-complete vs. the benefits.
The other thing that would be helpful would be more specific (where possible) examples. I'm not sure how feasible that is if most of the use cases are cybersecurity related, but if anyone does have an example(s) they're able to share that would be compelling! When you say it would be helpful to compare and contrast approaches, do you mean DFDL and ECN? Thanks so much for your response! Amy ________________________________ From: Mike Beckerle <mbecke...@apache.org> Sent: Friday, April 19, 2024 10:09 AM To: users@daffodil.apache.org <users@daffodil.apache.org> Subject: Re: paper on tools and use cases for parsing binary data [External Email - Use Caution] Hi Amy, What would you be looking for beyond what is already in our Wikipedia description https://en.wikipedia.org/wiki/Data_Format_Description_Language ? One thing I would emphasize about DFDL is that most people think data parsing is the hard problem, and data serialization is relatively simple, but if the serializer actually solves the problem of computing all the stored lengths for you, so that the user doesn't have to know anything about the actual format, then data serialization turns out to be far more complex than parsing. Particularly if you want streaming behavior. I gave a talk about this back at the ApacheCon conference in 2018. The slides are available here: https://s.apache.org/apacheconNA2018-dfdl To date, the primary use case for DFDL is cybersecurity related. Data must be parsed/ validated and "unparsed" back to original form in order to ensure that the data is, in fact, in that format and will not crash applications. The threat is not so much malware as "just bad data" causing denial-of-service. For your study, I would suggest you also look at ASN.1 Encoding Control Notation. This has been an ISO standard since 2008. ASN.1 which we normally think of as a prescriptive data format, but ECN extends it so that you specify the representation of the data. See: https://en.wikipedia.org/wiki/Encoding_Control_Notation I think it would be very helpful if a paper really compared/contrasted these approaches. On Fri, Apr 19, 2024 at 10:24 AM Roberts, Amy L2 <amy.robe...@ucdenver.edu<mailto:amy.robe...@ucdenver.edu>> wrote: Hello! I am working with a team on a tool, Awkward Kaitai, that gives people tools to work with binary data once that data has been described with a custom language. If you're interested in more details the project is currently hosted at https://github.com/ManasviGoyal/kaitai_struct_awkward_runtime and is meant to integrate with a larger project, https://kaitai.io/. I am writing because DFDL is a tool that solves a similar problem. We are currently writing a paper that provides examples of different custom-data problems in different domains and provides an overview of tools that help scientists work with such data and I wanted to reach out to the DFDL community to see if anyone would be interested in joining our paper as an author. I'd be delighted to have you contribute in any way you'd like as an author, and am particularly interested in having you: - Contribute a section about your tool - Show how your tool deals with a toy data file (I'm suggesting https://github.com/det-lab/dataReaderWriter/blob/master/kaitai/ksy/animal.ksy but would be happy to consider other options!) - Help identify any similar tools that we should include in our review - Help identify any use cases that we could include in our "Use Cases" section Thanks so much for your work in this area! Best, Amy Amy Roberts Assistant Professor of Physics amy.robe...@ucdenver.edu<mailto:amy.robe...@ucdenver.edu>