question and possible error about output xhtml

qubit Thu, 21 Oct 2010 06:31:37 -0700

Greetings --
I hope this is the first time I posted this question...
I and others on my project have been passing various files through tika and 
I have the following request:
When translating a text file -- file.txt -- through tika and looking at the 
raw output, tika is essentially inserting no markup for line breaks or 
paragraphs.
This project assumes that a newline in a non-marked-up text file are 
intended to be there by the author of the file.  Also, a blank line (2 
consecutive newlines possibly including whitespace) should be treated as 
paragraphs.  So the following file should translate as shown:


=== file.txt ===
This is a text file.  It should be rendered as displayed with paragraph and 
line break tags inserted as appropriate...
For example, this is after a line break, and

this is a new paragraph.
=== end of file.text ===

=== translated to as ===
<!-- a bunch of header markup -->
<body><p
/>This is a text file.  It should be rendered as displayed with paragraph 
and line break tags inserted as appropriate...<br
/>For example, this is after a line break, and</p
/><p
/>this is a new paragraph.
</p></body>
=== end of translation ===


Alternatively, the text content could be translated into a <pre> block, to 
preserve formatting.

This is what is being requested by the head of the group working on 
brailleblaster.
My question is, is there a reason tika's text parser doesn't insert the 
markup as shown? I.e., is there some reason this is a feature and not a 
bug...?
Would it be appropriate to change tika's code to insert the tags, or would 
it be better to define a separate parser that could render .txt files as 
needed?

Note that the project I am on is a tool that uses a braille translation 
library and hopefully tika to transcribe books and other files to braille, 
producing what is called UTDML, or "unified tactile document markup 
language".  UTDML is essentially DAISY extended to support special braille 
markup for braille math and tactile images.
There is a large number of books in text format that need to be translated 
and tika would have to return the needed markup.

So please let me know how I should proceed.  I wouldn't mind rolling up my 
sleeves and looking at tika's parser where it handles pure text.  Presumably 
this may be complicated by the autodetect feature for files -- or does that 
not apply if the file name is .txt?
Thank you in advance.
--le

question and possible error about output xhtml

Reply via email to