Thanks Andrey, Here are the SVG files from the two versions (I apologize for the verbosity but people may want to inspect the paths:
1.7.0 paths only (glyphs) <svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink" color-rendering="auto" color-interpolation="auto" stroke="black" text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10" stroke-opacity="1" shape-rendering="auto" fill="black" stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns=" http://www.w3.org/2000/svg" font-family="'Dialog'" font-style="normal" stroke-linejoin="miter" font-size="12" stroke-dashoffset="0" image-rendering="auto"> <!--Generated by the Batik Graphics2D SVG Generator--> <defs id="genericDefs" /> <g> <defs id="defs1"> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1"> <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2"> <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3"> <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4"> <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5"> <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" /> </clipPath> </defs> <g text-rendering="optimizeLegibility" transform="matrix(9.7634,0,0,9.7634,0,0)"> <path d="M6.2598 9.3956 L6.2598 9.6299 C6.2598 9.7549 6.2598 9.8956 6.2598 9.9424 L6.3379 9.9737 L6.3379 9.9893 C6.2911 9.9893 6.2598 9.9893 6.2129 9.9893 C6.1661 9.9893 6.1348 9.9893 6.0879 9.9893 L6.0879 9.9737 L6.1661 9.9424 C6.1661 9.8643 6.1661 9.8018 6.1661 9.7393 L6.1661 9.7393 C6.1348 9.7706 6.0879 9.7862 6.0567 9.7862 C5.9473 9.7862 5.8379 9.6924 5.8379 9.5518 C5.8379 9.4268 5.9629 9.3174 6.0879 9.3174 C6.1192 9.3174 6.1504 9.3331 6.1817 9.3643 C6.1973 9.3331 6.2598 9.3174 6.2911 9.3174 L6.2911 9.3174 C6.2754 9.3487 6.2598 9.3799 6.2598 9.3956 ZM6.1661 9.7081 L6.1661 9.3956 C6.1661 9.3799 6.1348 9.3643 6.1036 9.3643 C6.0098 9.3643 5.9317 9.4424 5.9317 9.5518 C5.9317 9.6612 6.0098 9.7237 6.0879 9.7237 C6.1192 9.7237 6.1504 9.7237 6.1661 9.7081 Z" clip-path="url(#clipPath1)" stroke="none" /> <path d="M6.455 9.3799 L6.4081 9.3643 L6.4081 9.3487 C6.4394 9.3331 6.5175 9.3331 6.5488 9.3331 C6.5488 9.3643 6.5488 9.4424 6.5488 9.6143 C6.5488 9.6612 6.5488 9.6768 6.5644 9.6924 C6.58 9.7081 6.5956 9.7237 6.6269 9.7237 C6.6738 9.7237 6.7206 9.6924 6.7363 9.6612 L6.7363 9.3956 C6.7363 9.3799 6.7206 9.3799 6.6738 9.3643 L6.6894 9.3331 C6.7206 9.3331 6.7988 9.3174 6.83 9.3331 C6.83 9.3799 6.83 9.4424 6.83 9.6924 C6.8456 9.7237 6.8769 9.7393 6.9081 9.7549 L6.8925 9.7706 C6.8925 9.7706 6.8613 9.7862 6.8456 9.7862 C6.7988 9.7862 6.7519 9.7706 6.7519 9.7081 L6.7363 9.7081 C6.705 9.7549 6.6581 9.7862 6.5956 9.7862 C6.5644 9.7862 6.5175 9.7706 6.5019 9.7393 C6.4706 9.7237 6.455 9.6768 6.455 9.6299 C6.455 9.5518 6.455 9.4581 6.455 9.3799 Z" clip-path="url(#clipPath1)" stroke="none" /> and 1.6.0 <svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink" color-rendering="auto" color-interpolation="auto" stroke="black" text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10" stroke-opacity="1" shape-rendering="auto" fill="black" stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns=" http://www.w3.org/2000/svg" font-family="'Dialog'" font-style="normal" stroke-linejoin="miter" font-size="12" stroke-dashoffset="0" image-rendering="auto"> <!--Generated by the Batik Graphics2D SVG Generator--> <defs id="genericDefs" /> <g> <defs id="defs1"> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1"> <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2"> <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3"> <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4"> <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" /> </clipPath> <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5"> <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" /> </clipPath> </defs> <g text-rendering="optimizeLegibility" font-size="1" font-family="'null'" transform="matrix(9.7634,0,0,9.7634,0,0)"> <text xml:space="preserve" x="5.8067" y="9.7706" clip-path="url(#clipPath1)" stroke="none">q</text> <text xml:space="preserve" x="6.3769" y="9.7706" clip-path="url(#clipPath1)" stroke="none">u</text> (this is part of the string "question") I have verified that the glyphs correspond to "q" and "u". There is a useful heuristic in that the clipPaths appear to be coupled to the fonts (including , I think, different font-sizes) so it effectively records the fonts, their glyphs and their metrics. I am assuming that if I knew how I could dump the font information (presumably through the COSDictionary). That would give me most of what I need: * the character (from 1.6.0) * the character position (from 1.6.0) * the glyph (from 1.7.0) giving (i) the coordinate origin (ii) the width and height and (iii) an indication of italic (neither 1.7.0 and 1.6.0 decode the glyph as itallic so I will have to use heuristics This is very tedious, but at least it's possible. However I would suggest to the PDFBox developers that they preserve the character info when transmitting to the drawing surface Graphics2D. This would allow different fonts, even if not as beautiful. On Tue, May 8, 2012 at 5:44 AM, Andrey Kuznetsov <imag...@gmx.de> wrote: > Hi Peter,**** > > ** ** > > >When I use 1.7.0 NO text is written. Instead the characters are replaced > by outline glyphs using <svg:path>. The visual layout is effectively the > same as the input PDF >but there are no explicit characters.**** > > ** ** > > Wow! They managed to implement it like Adobe suggested!**** > > ** > Good. We understand each other I think! > ** > > >I guess that in 1.7.0 NO characters are transmitted to drawString and > that everything is drawShape(), with the precomputed glyphs.**** > > Yes, you have to trace from where g.draw(Shape) / g.fill(Shape) is coming. > **** > > Not easy task however, since paths are also drawn with same methods.**** > > ** > That's my problem. If I can get in and add simple attributes to carry text info through to the surface that would be great. As it is it is very difficult (not impossible) to hack the glyph stream. We have to assume that text mainly occurs in rows of glyphs with the same y-coordinate, but this varies slightly because of the different glyph origins. > ** > > May be I’ll download new version and look deeper…**** > > ** > I, for one, would be grateful if you did! I thought I was miscompiling/omitting some resource, etc. which caused different output. It took a while to realise it was the version. Having used (by mistake) an 0.7.0 version I have seen the marked progress and want to thank and congratulate the PDFBox community for the current position and momentum. [Why don't I ask the authors for XML and SVG directly? That's a different and political issue. If anyone is interested in helping liberate the scientific literature legally then hacking PDFs is a major strategy. Volunteers welcome!] > ** > > P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069