Thank you for your reply -- I will look into making the patch; it will get me immersed in the code so I understand it better. But I do not know what jira is or how to submit a bug or patch. Perhaps you could point me to a page.
As for doing anything beyond the newline translation, I think that should be left to the user. I am wondering also about the parsing of xhtml codes like & or other things that can be lost in translation that shouldn't be processed. I want to avoid any surprises. For example, I got bitten by this once sending the text of an html tutorial to someone. He received my mail, but his aol mailer interpreted it like html rather than plain text, so when I talked to him we were looking at something completely different. What's worse, I couldn't make him understand that the mail had been mistranslated, so he thought his copy was right after all. Aren't there guidelines for parsing text/plain versus text/html or xhtml as the case may be? Anyway, please point me to the URL where I can enter the bug report. TIA --le --le ----- Original Message ----- From: "Nick Burch" <[email protected]> To: <[email protected]> Sent: Friday, October 22, 2010 10:18 AM Subject: Re: question and possible error about output xhtml On Thu, 21 Oct 2010, qubit wrote: > When translating a text file -- file.txt -- through tika and looking at > the > raw output, tika is essentially inserting no markup for line breaks or > paragraphs. Most of the logic in TXTParser is around languages and types, there's not much on the markup > Also, a blank line (2 consecutive newlines possibly including > whitespace) should be treated as paragraphs. There could be some fun with more than 2 consecutive newlines, but otherwise I don't see why we shouldn't do this. I'd suggest you open an enhancement bug in jira, and if you can upload a patch to do a first pass implementation of it. I can't see us loosing anything by outputting multiple paragraphs. We could maybe even have support for detecting thing like * this * or this # or this as lists, but I'm not sure what others might think of that? Nick
