On Fri, 22 Oct 2010, qubit wrote:
Thank you for your reply -- I will look into making the patch; it will get
me immersed in the code so I understand it better.
The code you probably want to look at is TXTParser in the tika-parser
package. The parser quickstart guide at
http://tika.apache.org/0.7/parser_guide.html is probably also worth a read
too.
But I do not know what jira is or how to submit a bug or patch. Perhaps you
could point me to a page.
Our JIRA instance is at
https://issues.apache.org/jira/browse/TIKA
And there's a link on the dashboard there to create a new issue. I don't
think we have a guide to submitting patches etc on the Tika site, but
Nutch (another Apache project) do seem to have a good guide at
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
much of which will apply to Tika and may be of help!
As for doing anything beyond the newline translation, I think that
should be left to the user.
I'd lean towards doing some basic bits if we're doing anything at all, but
I suspect others will have a much stronger opinion!
I am wondering also about the parsing of xhtml codes like & or other
things that can be lost in translation that shouldn't be processed.
Hopefully things will be properly escaped, but I suspect this would be
worth adding a unit test for!
Nick