Re: Question .. how NOT to skip empty paragraphs

Nick Burch Fri, 30 Jan 2015 01:52:13 -0800

On Thu, 29 Jan 2015, [email protected] wrote:

Extracting plain text from word this empty paragraphs are completelyremoved (albeit they stay in the xhtml representation).
Any suggestion for preserving this empty paragraphs - in the extractedstring they would appear as double \n\n - without getting and parsingthe xhtml?


What's wrong with parsing at the xhtml level?

I'd suggest you do something like a custom handler, which normally justlooks at the characters and whitespace (much as the to-text handlers do),but also adds a tiny bit of logic to detect empty paragraphs which thentriggers the "this is a new block" behaviour in your code

Custom handlers are surprisingly easy to write, take a look at the TikaExamples package for a few


Nick

Re: Question .. how NOT to skip empty paragraphs

Reply via email to