Question .. how NOT to skip empty paragraphs

[email protected] Thu, 29 Jan 2015 14:08:28 -0800

We use tika for processing word Documents (among others).

In a specific application we need to rely on empty paragraphs torecognize specific part of text, which in the source documentappear as empty paragraphs separating blocks (ok, i know, not the bestway to use word even but this what we have - part of an old legacy system).

Extracting plain text from word this empty paragraphs are completelyremoved (albeit they stay in the xhtml representation).

Any suggestion for preserving this empty paragraphs - in the extractedstring they would appear as double \n\n - without getting and parsingthe xhtml?


Any help wellcome.

LG

Question .. how NOT to skip empty paragraphs

Reply via email to