Hi,
I can only answer about PDFBox... no PDF has anything bold. Both have
something italic. The PDF without fontspec doesn't have the "é". The PDF
with fontspec can be converted to HTML with "ExtractText -html" and this
is the HTML I get:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title></title>
<meta http-equiv="Content-Type" content="text/html; charset="UTF-8">
</head> <body> <div style="page-break-before:always;
page-break-after:always"><div><p>Wiel, Jérôme aan de (2018).
‘Irish intelligence, 1880s-1922’. In: <i>Cultures of
Intelligence in the Era of the World Wars</i>. Ed. by Simon Ball et al.
Oxford: Oxford University Press.</p> </div></div> </body></html>
So the italic is there.
Tilman
Am 25.01.2018 um 16:47 schrieb Flynn, Peter:
I have a very large number of bibliographic references in BiBTeX format which
we need to make available individually in formal reference formats within web
pages (as HTML, not as embedded images).
I experimented a couple of years ago with Apache PDFBox and found that it could
extract the text from a PDF and preserve bold and italics. This would let us
use LaTeX to typeset each PDF in the required format, and then have PDFBox
extract the text with bold and italics in all the right places.
Regular pdflatex with old-style bibtex is insufficient, as it doesn't handle
all the UTF-8 characters we need, and the reference formats supported are out
of date; XeLaTeX with biblatex and biber do all this just fine...but...
...if I do this using the fontspec package (the standard way to provide XeLaTeX
with the font data for handling UTF-8 diacritics), the output has all accented
characters, but PDFBox doesn't recognise the bold or italic. If I omit the
fontspec package, PDFBox can get the bold and italics, but XeLaTeX will omit
the diacritics.
Examples of both PDFs and both HTML files are at
http://epu.ucc.ie/latex/pdfbox-xelatex-fontspec-error.zip
As I don't know the internals either of fontspec or of PDFBox, I am hoping that
someone on the pdfbox mailing list or the comp.text.tex newsgroup may have a
lead.
///Peter