On Fri, 22 Oct 2010, qubit wrote:
Thank you for your reply -- I will look into making the patch; it will get
me immersed in the code so I understand it better.

The code you probably want to look at is TXTParser in the tika-parser package. The parser quickstart guide at http://tika.apache.org/0.7/parser_guide.html is probably also worth a read too.

But I do not know what jira is or how to submit a bug or patch. Perhaps you
could point me to a page.

Our JIRA instance is at
        https://issues.apache.org/jira/browse/TIKA
And there's a link on the dashboard there to create a new issue. I don't think we have a guide to submitting patches etc on the Tika site, but Nutch (another Apache project) do seem to have a good guide at
        http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
much of which will apply to Tika and may be of help!

As for doing anything beyond the newline translation, I think that should be left to the user.

I'd lean towards doing some basic bits if we're doing anything at all, but I suspect others will have a much stronger opinion!

I am wondering also about the parsing of xhtml codes like & or other
things that can be lost in translation that shouldn't be processed.

Hopefully things will be properly escaped, but I suspect this would be worth adding a unit test for!

Nick

Reply via email to