Problem with RTF parsing in TIKA introducing spurious space characters

Malcolm Robbins Sat, 18 Jun 2011 22:21:48 -0700
> 
> I'm not sure if I should use this mail list to raise a defect but I couldn't 
> see another route.
> 
> I have encountered a issue with RTF parsing with TIKA (0.9) introducing 
> spurious space characters.
> The attached RTF contains a paragraph of text that when you view through the 
> TIKA App
> (or parse programmatically) Structured Text view: 
> 
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";><head><meta name="Content-Length" 
> content="8497"/><meta name="Content-Type" content="application/rtf"/><meta 
> name="resourceName" content="abriWithSpaces.doc"/><title/></head><body><p>  
> Patent 565934     Apparatus for programming parameters of a power driven 
> wheelchair for a plurality of drive modes is disclosed. The appara tus is 
> coupled to the control system of the power driven wheelchair and comprises a 
> display; a controller for interacting with the display and operative to 
> display a menu image on a screen of the display. The menu image includes 
> names and values of a plur a lity of wheelchair parameters for a plurality of 
> drive modes of the wheelchair.  The apparatus further comprises one or more 
> input devices operative to interact with the controller and the display to 
> select a wheelchair parameter value for a drive mode fr o m the displayed 
> menu image, and to program the value of the selected wheelchair parameter to 
> a desired value.  The controller is operative to open a window on the display 
> for programming the selected wheelchair parameter value in response to 
> selection of  the  wheelchair parameter value such that names and values of 
> at least the select wheelchair parameter for the plurality of drive modes 
> continue to be displayed in the menu image.  </p>
> </body></html>
> 
> In the above you can see the following words (quoted) have introduced space 
> characters:
> 
> "appara tus" 
> "plur a lity" 
> "fr o m"
> 
> Interestingly, if I cut out of Tika and use RTFEditorKit directly (and assume 
> UTF-8" encoding) in the following (Scala) fragment the document comes out 
> fine.  So does this suggest TIKA is not sensing the encoding correctly?
> 
>    val rtf = new RTFEditorKit
>    val document = new DefaultStyledDocument
>    val fis = new FileInputStream(f)
>    try {
>      // Assume UTF-8 encoding...
>      rtf.read(new InputStreamReader(fis, "UTF-8"), document, 0)
>      document.getText(0, document.getLength)
>    } finally {
>      fis.close
>    }
> 
> What can I do to get this defect addressed?
--malcolm
>
abriWithSpaces.doc
Description: MS-Word document
Problem with RTF parsing in TIKA introducing spurious space characters

Reply via email to