All,
Over on Tika, I recently added an experimental SAX parser to process the
document.xml component within .docx. That parser allows the user to select
whether or not to include text within "moveFrom" regions. Has anyone come
across how to do this with .doc files?
A test document is available here [1]. If we hide the "moveFrom" run, we
wouldn't see "second paragraph here" twice.
Thank you!
Cheers,
Tim
[1]
https://git-wip-us.apache.org/repos/asf?p=tika.git;a=blob;f=tika-parsers/src/test/resources/test-documents/testWORD_2006ml.doc;h=c8f509aea483006d40de9c2970df7988ff058b51;hb=fe20ecd83ea43e5ec6ad0e9fded9d803cb011251