On Thu, 29 Jan 2015, [email protected] wrote:
Extracting plain text from word this empty paragraphs are completely
removed (albeit they stay in the xhtml representation).
Any suggestion for preserving this empty paragraphs - in the extracted
string they would appear as double \n\n - without getting and parsing
the xhtml?
What's wrong with parsing at the xhtml level?
I'd suggest you do something like a custom handler, which normally just
looks at the characters and whitespace (much as the to-text handlers do),
but also adds a tiny bit of logic to detect empty paragraphs which then
triggers the "this is a new block" behaviour in your code
Custom handlers are surprisingly easy to write, take a look at the Tika
Examples package for a few
Nick