Re: Anyone created a DFDL schema for PDF documents?

Steve Lawrence Thu, 22 Aug 2019 05:39:49 -0700

No one has created a PDF DFDL schema that I am aware of. And this is
likely because it might be a difficult task, if it's even possible.

A major hurdle is that I believe the PDF format uses byte offsets, which
DFDL & Daffodil do not support. And it also includes metadata about how
to interpret the data at the *end* of the file, which DFDL & Daffodil
also do not support (though, this is maybe a form of offset, so same
problem).

So without some new offset features in DFDL/Daffodil, there's a chance
it's impossible to parse PDF.

Though, it's also possible that you don't actually need the offsets to
parse the data. The offsets might only be needed to *display* a PDF. So
it very well might be possible to parse all the PDF elements + offset
table + trailing metadata without actually needing the offset feature.
There would still likely be challenges related to unparsing if you need
to update offsets, but maybe that could be handled by something else.

If you do try to create a DFDL schema for PDF, please report back. Any
issues that cause difficulties could lead to new Daffodil extensions or
new features in the next version of DFDL.

On 8/22/19 8:09 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> I may soon be involved in a project that needs to parse and mine PDF 
> documents.
> 
> Has anyone created a DFDL schema for PDF?
> 
> /Roger
>

Re: Anyone created a DFDL schema for PDF documents?

Reply via email to