Greetings -- I hope this is the first time I posted this question... I and others on my project have been passing various files through tika and I have the following request: When translating a text file -- file.txt -- through tika and looking at the raw output, tika is essentially inserting no markup for line breaks or paragraphs. This project assumes that a newline in a non-marked-up text file are intended to be there by the author of the file. Also, a blank line (2 consecutive newlines possibly including whitespace) should be treated as paragraphs. So the following file should translate as shown:
=== file.txt === This is a text file. It should be rendered as displayed with paragraph and line break tags inserted as appropriate... For example, this is after a line break, and this is a new paragraph. === end of file.text === === translated to as === <!-- a bunch of header markup --> <body><p />This is a text file. It should be rendered as displayed with paragraph and line break tags inserted as appropriate...<br />For example, this is after a line break, and</p /><p />this is a new paragraph. </p></body> === end of translation === Alternatively, the text content could be translated into a <pre> block, to preserve formatting. This is what is being requested by the head of the group working on brailleblaster. My question is, is there a reason tika's text parser doesn't insert the markup as shown? I.e., is there some reason this is a feature and not a bug...? Would it be appropriate to change tika's code to insert the tags, or would it be better to define a separate parser that could render .txt files as needed? Note that the project I am on is a tool that uses a braille translation library and hopefully tika to transcribe books and other files to braille, producing what is called UTDML, or "unified tactile document markup language". UTDML is essentially DAISY extended to support special braille markup for braille math and tactile images. There is a large number of books in text format that need to be translated and tika would have to return the needed markup. So please let me know how I should proceed. I wouldn't mind rolling up my sleeves and looking at tika's parser where it handles pure text. Presumably this may be complicated by the autodetect feature for files -- or does that not apply if the file name is .txt? Thank you in advance. --le
