On Thu, 21 Oct 2010, qubit wrote:
When translating a text file -- file.txt -- through tika and looking at the raw output, tika is essentially inserting no markup for line breaks or paragraphs.
Most of the logic in TXTParser is around languages and types, there's not much on the markup
Also, a blank line (2 consecutive newlines possibly including whitespace) should be treated as paragraphs.
There could be some fun with more than 2 consecutive newlines, but otherwise I don't see why we shouldn't do this.
I'd suggest you open an enhancement bug in jira, and if you can upload a patch to do a first pass implementation of it. I can't see us loosing anything by outputting multiple paragraphs.
We could maybe even have support for detecting thing like * this * or this # or this as lists, but I'm not sure what others might think of that? Nick
