Re: Is Apache PDFBox based on the Arlington PDF Model? ...

Albretch Mueller Sat, 08 Oct 2022 08:00:20 -0700

On 10/8/22, Tim Allison <[email protected]> wrote:
>> But I am not sure if it makes any functional sense anyway.
> There's far more to parsing and the capabilities to what PDFBox and
> other PDF tools offer than just validating compliance with the spec.


 Well, I figured, but the kind of functionality APDFM xml offers, so
easily exploitable as some sort of SAX listener interface linked to
some command objects array through their XPath hash tables is even
dreamy to me ;-) ... to a point that I feel like starting working on
some PoC right now to be later added or merged onto the PDFBox code
base. I would like to at least have some PDFBox and/or tika folks
participate or watch over what I do.

 I am more of a data analyst, corpora research kind of guy and I may
have to move my mind somewhere else once in a while. I think that
would be some important code which would deserve permanent attention.

 If anyone runs into this thread I would recommend Peter Wyatt's one
paper (April 5th, 2021):

// __ Work in progress: Demystifying PDF through a machine-readable definition

 
https://raw.githubusercontent.com/gangtan/LangSec-papers-and-slides/main/langsec21/papers/Wyatt_LangSec21.pdf
~
 lbrtchx

Re: Is Apache PDFBox based on the Arlington PDF Model? ...

Reply via email to