On Thu, 29 Jan 2015, [email protected] wrote:
Extracting plain text from word this empty paragraphs are completely removed (albeit they stay in the xhtml representation).

Any suggestion for preserving this empty paragraphs - in the extracted string they would appear as double \n\n - without getting and parsing the xhtml?

What's wrong with parsing at the xhtml level?

I'd suggest you do something like a custom handler, which normally just looks at the characters and whitespace (much as the to-text handlers do), but also adds a tiny bit of logic to detect empty paragraphs which then triggers the "this is a new block" behaviour in your code

Custom handlers are surprisingly easy to write, take a look at the Tika Examples package for a few

Nick

Reply via email to