Re: question and possible error about output xhtml

Nick Burch Fri, 22 Oct 2010 08:19:05 -0700

On Thu, 21 Oct 2010, qubit wrote:

When translating a text file -- file.txt -- through tika and looking at the
raw output, tika is essentially inserting no markup for line breaks or
paragraphs.

Most of the logic in TXTParser is around languages and types, there's notmuch on the markup

Also, a blank line (2 consecutive newlines possibly includingwhitespace) should be treated as paragraphs.

There could be some fun with more than 2 consecutive newlines, butotherwise I don't see why we shouldn't do this.

I'd suggest you open an enhancement bug in jira, and if you can upload apatch to do a first pass implementation of it. I can't see us loosinganything by outputting multiple paragraphs.


We could maybe even have support for detecting thing like
* this
 * or this
# or this
as lists, but I'm not sure what others might think of that?

Nick

Re: question and possible error about output xhtml

Reply via email to