Hi,
As a newbie of Apache POI, I use the
"org.apache.poi.hwpf.Word2Forrest" class to extract text in a MS Word 2003
document. The Word document contains text as well as hyperlinks, equations
and graphs. The normal text is extracted OK. However, when a hyperlink is
extracted, it looks, for example, something like this :
extracted hyperlink :
<p>“Java Native Access (JNA)” --- call DLL functions from Java, HYPERLINK
"https://jna.dev.java.net/" https://jna.dev.java.net/.
</p>
original hyperlink :
“Java Native Access (JNA)” --- call DLL functions from Java,
https://jna.dev.java.net/.
The hyperlink address is duplicated in the extracted text. Moreover,
when equations are extracted, something like "EMBED Equation.3" are
displayed in the extracted text. Furthermore, when graphs are extracted,
nothing would be displayed in the extracted text.
I would like to know that is this the best behavior of Apache POI in
parsing MS Word document ? Could we change some configurations so that
Apache POI could handle "hyperlinks", "equations" and "graphs" in a better
way ?
Thanks for any suggestion.
Lawrence