I have developed a system (AMI) which has code to extract graphics in PDF
to semantic form (SVG) and thence to data (e.g. CSV) or other objects (e.g.
plots, tables, molecules). [1] It uses AMIPageDrawer (a subclass of
PageDrawer) to trap the graphics stream and extract paths and their
attributes. It's nearly working (the coordinates and the primitives (Move,
Line, Curve) seem correct, but the fill and stroke are not always correctly
extracted.

The query is quite large so I posted this as a query (
https://stackoverflow.com/questions/59534091/extracting-fill-and-stroke-attributes-in-pdfbox)
on StackOverflow with links to my code ( https://github.com/petermr/ami3 ).
I hope this is an appropriate thing to do. If not, I can post a longer mail
here.

The goal is to create a system which extracts data from graphs and tables
automatically, relying on authors (unconsciously) incorporating their
vector graphics into the PDF. It's not completely deterministic but has a
reasonable success, especially for modern PDFs.  The use of SVG is a very
useful intermediate as it works as a modelling language.

Peter



[1] originally "pdf2svg" using PDFBox 1.


-- 
"I always retain copyright in my papers, and nothing in any contract I sign
with any publisher will override that fact. You should do the same".

Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to