> > I'm not sure if I should use this mail list to raise a defect but I couldn't > see another route. > > I have encountered a issue with RTF parsing with TIKA (0.9) introducing > spurious space characters. > The attached RTF contains a paragraph of text that when you view through the > TIKA App > (or parse programmatically) Structured Text view: > > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"><head><meta name="Content-Length" > content="8497"/><meta name="Content-Type" content="application/rtf"/><meta > name="resourceName" content="abriWithSpaces.doc"/><title/></head><body><p> > Patent 565934 Apparatus for programming parameters of a power driven > wheelchair for a plurality of drive modes is disclosed. The appara tus is > coupled to the control system of the power driven wheelchair and comprises a > display; a controller for interacting with the display and operative to > display a menu image on a screen of the display. The menu image includes > names and values of a plur a lity of wheelchair parameters for a plurality of > drive modes of the wheelchair. The apparatus further comprises one or more > input devices operative to interact with the controller and the display to > select a wheelchair parameter value for a drive mode fr o m the displayed > menu image, and to program the value of the selected wheelchair parameter to > a desired value. The controller is operative to open a window on the display > for programming the selected wheelchair parameter value in response to > selection of the wheelchair parameter value such that names and values of > at least the select wheelchair parameter for the plurality of drive modes > continue to be displayed in the menu image. </p> > </body></html> > > In the above you can see the following words (quoted) have introduced space > characters: > > "appara tus" > "plur a lity" > "fr o m" > > Interestingly, if I cut out of Tika and use RTFEditorKit directly (and assume > UTF-8" encoding) in the following (Scala) fragment the document comes out > fine. So does this suggest TIKA is not sensing the encoding correctly? > > val rtf = new RTFEditorKit > val document = new DefaultStyledDocument > val fis = new FileInputStream(f) > try { > // Assume UTF-8 encoding... > rtf.read(new InputStreamReader(fis, "UTF-8"), document, 0) > document.getText(0, document.getLength) > } finally { > fis.close > } > > What can I do to get this defect addressed? --malcolm >
abriWithSpaces.doc
Description: MS-Word document
