No one has created a PDF DFDL schema that I am aware of. And this is likely because it might be a difficult task, if it's even possible.
A major hurdle is that I believe the PDF format uses byte offsets, which DFDL & Daffodil do not support. And it also includes metadata about how to interpret the data at the *end* of the file, which DFDL & Daffodil also do not support (though, this is maybe a form of offset, so same problem). So without some new offset features in DFDL/Daffodil, there's a chance it's impossible to parse PDF. Though, it's also possible that you don't actually need the offsets to parse the data. The offsets might only be needed to *display* a PDF. So it very well might be possible to parse all the PDF elements + offset table + trailing metadata without actually needing the offset feature. There would still likely be challenges related to unparsing if you need to update offsets, but maybe that could be handled by something else. If you do try to create a DFDL schema for PDF, please report back. Any issues that cause difficulties could lead to new Daffodil extensions or new features in the next version of DFDL. On 8/22/19 8:09 AM, Costello, Roger L. wrote: > Hello DFDL community, > > I may soon be involved in a project that needs to parse and mine PDF > documents. > > Has anyone created a DFDL schema for PDF? > > /Roger >
