Hi Malcolm,

Do you know how or with what software that RTF file was generated?
Looking in the file there are some extra line breaks in places where
you say the words are broken up. Example: "The appara\r\ntus",
"plur\r\na\r\nlity".

The Tika RTF parser does some parsing/filtering of the content before
it sends it to RTFEditorKit because RTFEditorKit doesn't know how to
handle some encodings and special characters. That filtering might
convert these linebreaks into spaces.

If I save the RTF again through Word it gets rid of those linebreaks,
so it would be good to know what actually adds them and how often this
can occur (first time I saw something like this).

Regards,
Cristian Vat

On Sun, Jun 19, 2011 at 08:20, Malcolm Robbins
<[email protected]> wrote:
>>
>> I'm not sure if I should use this mail list to raise a defect but I couldn't 
>> see another route.
>>
>> I have encountered a issue with RTF parsing with TIKA (0.9) introducing 
>> spurious space characters.
>> The attached RTF contains a paragraph of text that when you view through the 
>> TIKA App
>> (or parse programmatically) Structured Text view:
>>
>> <?xml version="1.0" encoding="UTF-8"?><html 
>> xmlns="http://www.w3.org/1999/xhtml";><head><meta name="Content-Length" 
>> content="8497"/><meta name="Content-Type" content="application/rtf"/><meta 
>> name="resourceName" content="abriWithSpaces.doc"/><title/></head><body><p>  
>> Patent 565934     Apparatus for programming parameters of a power driven 
>> wheelchair for a plurality of drive modes is disclosed. The appara tus is 
>> coupled to the control system of the power driven wheelchair and comprises a 
>> display; a controller for interacting with the display and operative to 
>> display a menu image on a screen of the display. The menu image includes 
>> names and values of a plur a lity of wheelchair parameters for a plurality 
>> of drive modes of the wheelchair.  The apparatus further comprises one or 
>> more input devices operative to interact with the controller and the display 
>> to select a wheelchair parameter value for a drive mode fr o m the displayed 
>> menu image, and to program the value of the selected wheelchair parameter to 
>> a desired value.  The controller is operative to open a window on the 
>> display for programming the selected wheelchair parameter value in response 
>> to selection of  the  wheelchair parameter value such that names and values 
>> of at least the select wheelchair parameter for the plurality of drive modes 
>> continue to be displayed in the menu image.  </p>
>> </body></html>
>>
>> In the above you can see the following words (quoted) have introduced space 
>> characters:
>>
>> "appara tus"
>> "plur a lity"
>> "fr o m"
>>
>> Interestingly, if I cut out of Tika and use RTFEditorKit directly (and 
>> assume UTF-8" encoding) in the following (Scala) fragment the document comes 
>> out fine.  So does this suggest TIKA is not sensing the encoding correctly?
>>
>>    val rtf = new RTFEditorKit
>>    val document = new DefaultStyledDocument
>>    val fis = new FileInputStream(f)
>>    try {
>>      // Assume UTF-8 encoding...
>>      rtf.read(new InputStreamReader(fis, "UTF-8"), document, 0)
>>      document.getText(0, document.getLength)
>>    } finally {
>>      fis.close
>>    }
>>
>> What can I do to get this defect addressed?
> --malcolm
>>
>
>
>
>

Reply via email to