Re: Problem with RTF parsing in TIKA introducing spurious space characters

Malcolm Robbins Wed, 22 Jun 2011 03:14:49 -0700

Christian,

My understanding is that this file (and ones like it) are created within Word 
by users in the Intellectual Property Office
of New Zealand and these files are actually "patent abstracts" and they 
regularly go to overseas websites (i.e. the origin of the patent)
and cut and paste the text from there (for the purpose of including it for an 
application in NZ) so I guess it's the copy/paste that must be bringing over 
the CR/LF characters.


For some strange reason they save the files as RTF (presumably for historical 
reasons) and I guess that preserves the CR/LF characters.  I'm not sure if/why 
it is that, as you say if you save again through Word it works OK (did you save 
as Word first and then save as RTF?). 

The above is based on what I've been told rather than what I have observed 
directly however.  I could try and observe the actual behaviour going on if 
that would be helpful to solving the problem.  Clearly there are workarounds to 
the situation, but it'd be nice to know what the problem is and whether Tika 
can resolve the situation or whether these are very unusual circumstances that 
need to be avoided to avoid the problem.

regards

On 20/06/2011, at 12:38 AM, Cristian Vat wrote:

> Hi Malcolm,
> 
> Do you know how or with what software that RTF file was generated?
> Looking in the file there are some extra line breaks in places where
> you say the words are broken up. Example: "The appara\r\ntus",
> "plur\r\na\r\nlity".
> 
> The Tika RTF parser does some parsing/filtering of the content before
> it sends it to RTFEditorKit because RTFEditorKit doesn't know how to
> handle some encodings and special characters. That filtering might
> convert these linebreaks into spaces.
> 
> If I save the RTF again through Word it gets rid of those linebreaks,
> so it would be good to know what actually adds them and how often this
> can occur (first time I saw something like this).
> 
> Regards,
> Cristian Vat
> 
> On Sun, Jun 19, 2011 at 08:20, Malcolm Robbins
> <[email protected]> wrote:
>>> 
>>> I'm not sure if I should use this mail list to raise a defect but I 
>>> couldn't see another route.
>>> 
>>> I have encountered a issue with RTF parsing with TIKA (0.9) introducing 
>>> spurious space characters.
>>> The attached RTF contains a paragraph of text that when you view through 
>>> the TIKA App
>>> (or parse programmatically) Structured Text view:
>>> 
>>> <?xml version="1.0" encoding="UTF-8"?><html 
>>> xmlns="http://www.w3.org/1999/xhtml";><head><meta name="Content-Length" 
>>> content="8497"/><meta name="Content-Type" content="application/rtf"/><meta 
>>> name="resourceName" content="abriWithSpaces.doc"/><title/></head><body><p>  
>>> Patent 565934     Apparatus for programming parameters of a power driven 
>>> wheelchair for a plurality of drive modes is disclosed. The appara tus is 
>>> coupled to the control system of the power driven wheelchair and comprises 
>>> a display; a controller for interacting with the display and operative to 
>>> display a menu image on a screen of the display. The menu image includes 
>>> names and values of a plur a lity of wheelchair parameters for a plurality 
>>> of drive modes of the wheelchair.  The apparatus further comprises one or 
>>> more input devices operative to interact with the controller and the 
>>> display to select a wheelchair parameter value for a drive mode fr o m the 
>>> displayed menu image, and to program the value of the selected wheelchair 
>>> parameter to a desired value.  The controller is operative to open a window 
>>> on the display for programming the selected wheelchair parameter value in 
>>> response to selection of  the  wheelchair parameter value such that names 
>>> and values of at least the select wheelchair parameter for the plurality of 
>>> drive modes continue to be displayed in the menu image.  </p>
>>> </body></html>
>>> 
>>> In the above you can see the following words (quoted) have introduced space 
>>> characters:
>>> 
>>> "appara tus"
>>> "plur a lity"
>>> "fr o m"
>>> 
>>> Interestingly, if I cut out of Tika and use RTFEditorKit directly (and 
>>> assume UTF-8" encoding) in the following (Scala) fragment the document 
>>> comes out fine.  So does this suggest TIKA is not sensing the encoding 
>>> correctly?
>>> 
>>>    val rtf = new RTFEditorKit
>>>    val document = new DefaultStyledDocument
>>>    val fis = new FileInputStream(f)
>>>    try {
>>>      // Assume UTF-8 encoding...
>>>      rtf.read(new InputStreamReader(fis, "UTF-8"), document, 0)
>>>      document.getText(0, document.getLength)
>>>    } finally {
>>>      fis.close
>>>    }
>>> 
>>> What can I do to get this defect addressed?
>> --malcolm
>>> 
>> 
>> 
>> 
>>

Re: Problem with RTF parsing in TIKA introducing spurious space characters

Reply via email to