We use tika for processing word Documents (among others).
In a specific application we need to rely on empty paragraphs to recognize specific part of text, which in the source document appear as empty paragraphs separating blocks (ok, i know, not the best way to use word even but this what we have - part of an old legacy system).
Extracting plain text from word this empty paragraphs are completely removed (albeit they stay in the xhtml representation).
Any suggestion for preserving this empty paragraphs - in the extracted string they would appear as double \n\n - without getting and parsing the xhtml?
Any help wellcome. LG
