Thank you for your reply -- I will look into making the patch; it will get 
me immersed in the code so I understand it better.
But I do not know what jira is or how to submit a bug or patch. Perhaps you 
could point me to a page.

As for doing anything beyond the newline translation, I think that should be 
left to the user.
I am wondering also about the parsing of xhtml codes like & or other 
things that can be lost in translation that shouldn't be processed.  I want 
to avoid any surprises.  For example, I got bitten by this once sending the 
text of an html tutorial to someone.  He received my mail, but his aol 
mailer interpreted it like html rather than plain text, so when I talked to 
him we were looking at something completely different.  What's worse, I 
couldn't make him understand that the mail had been mistranslated, so he 
thought his copy was right after all.
Aren't there guidelines for parsing text/plain versus text/html or xhtml as 
the case may be?
Anyway, please point me to the URL where I can enter the bug report.
TIA
--le




--le

----- Original Message ----- 
From: "Nick Burch" <[email protected]>
To: <[email protected]>
Sent: Friday, October 22, 2010 10:18 AM
Subject: Re: question and possible error about output xhtml


On Thu, 21 Oct 2010, qubit wrote:
> When translating a text file -- file.txt -- through tika and looking at 
> the
> raw output, tika is essentially inserting no markup for line breaks or
> paragraphs.

Most of the logic in TXTParser is around languages and types, there's not
much on the markup

> Also, a blank line (2 consecutive newlines possibly including
> whitespace) should be treated as paragraphs.

There could be some fun with more than 2 consecutive newlines, but
otherwise I don't see why we shouldn't do this.

I'd suggest you open an enhancement bug in jira, and if you can upload a
patch to do a first pass implementation of it. I can't see us loosing
anything by outputting multiple paragraphs.

We could maybe even have support for detecting thing like
* this
  * or this
# or this
as lists, but I'm not sure what others might think of that?

Nick 

Reply via email to