I have developed a system (AMI) which has code to extract graphics in PDF to semantic form (SVG) and thence to data (e.g. CSV) or other objects (e.g. plots, tables, molecules). [1] It uses AMIPageDrawer (a subclass of PageDrawer) to trap the graphics stream and extract paths and their attributes. It's nearly working (the coordinates and the primitives (Move, Line, Curve) seem correct, but the fill and stroke are not always correctly extracted.
The query is quite large so I posted this as a query ( https://stackoverflow.com/questions/59534091/extracting-fill-and-stroke-attributes-in-pdfbox) on StackOverflow with links to my code ( https://github.com/petermr/ami3 ). I hope this is an appropriate thing to do. If not, I can post a longer mail here. The goal is to create a system which extracts data from graphs and tables automatically, relying on authors (unconsciously) incorporating their vector graphics into the PDF. It's not completely deterministic but has a reasonable success, especially for modern PDFs. The use of SVG is a very useful intermediate as it works as a modelling language. Peter [1] originally "pdf2svg" using PDFBox 1. -- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069