The image in PDF file is already in SVG format - ie in XML format. I need to find it and extract this XML part as file with SVG extension.
On Wed, Mar 6, 2019 at 12:57 AM Peter Murray-Rust <[email protected]> wrote: > I have been doing a lot of graphical extraction of scientific "images" , > but in general there is no algorithmic way.( I'd be happy to see if there > is an overlap of our interests.) > > To simplify: The PDF stream consists of bitmaps (images), glyphs > (characters with code points) and paths (a mixture of Move, Line, Quadratic > and Cubic curves, with Close(Z)). I tend to use "image" for bitmaps and > "plots", "diagrams" or "graphics" for non-bitmap graphics. A "plot" > generally consists of characters, and paths (and sometimes small > images/bitmaps). But paths can occur anywhere and a diagram is only defined > by convention - either a whitespace border or a rectangular path surround. > But characters can be created by paths (cursive glyphs) which are difficult > to interpret, and small paths can be embedded within runs of glyphs. I > convert these to SVG. > > In practice I attempt to identify diagrams by whitespace surrounds, > borders, and formal identification such as "Figure 2." But some diagrams > don't have captions (e.g. chemical reaction schemes. In other places paths > are used as page decoration (e.g. think lines, publisher icons, etc.). > > So simple answer there is no formal way, but there are heuristics. I am > making useful progress with this and can extract certain types of diagrams > into SVG. > > see https://github.com/petermr/normami (warning it's complex and mostly > created as a library). > > > On Tue, Mar 5, 2019 at 10:34 PM European Neuroscience Center < > [email protected]> wrote: > > > Hi, > > > > What is the way to extract an embedded image, which is in SVG format from > > an PDF file using PDFBox? > > > > If there is no such option, how to determine from where the embedded SVG > > image starts and extract this XML part of the PDF file? > > > > > > Regards, > > Miro. > > > > > -- > Peter Murray-Rust > Reader Emeritus in Molecular Informatics > Unilever Centre, Dept. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069 >

