On 10/8/22, Tim Allison <[email protected]> wrote: >> But I am not sure if it makes any functional sense anyway. > There's far more to parsing and the capabilities to what PDFBox and > other PDF tools offer than just validating compliance with the spec.
Well, I figured, but the kind of functionality APDFM xml offers, so easily exploitable as some sort of SAX listener interface linked to some command objects array through their XPath hash tables is even dreamy to me ;-) ... to a point that I feel like starting working on some PoC right now to be later added or merged onto the PDFBox code base. I would like to at least have some PDFBox and/or tika folks participate or watch over what I do. I am more of a data analyst, corpora research kind of guy and I may have to move my mind somewhere else once in a while. I think that would be some important code which would deserve permanent attention. If anyone runs into this thread I would recommend Peter Wyatt's one paper (April 5th, 2021): // __ Work in progress: Demystifying PDF through a machine-readable definition https://raw.githubusercontent.com/gangtan/LangSec-papers-and-slides/main/langsec21/papers/Wyatt_LangSec21.pdf ~ lbrtchx
