Hi, Nick. I'm not sure where exactly the problem is because I'm not
very adept in Java.

>From what I can tell is that thre rest of .rtf files are parsed
without a problem. Only those
with embedded Visio diagrams create problems. Wierd thing is I just
tried parsing a .doc
document with embedded .vsd and it was parsed without a problem.

I am still not convinced that the problem is in the RTFParser itself.
java.lang.ArrayIndexOutOfBoundsException
is supposed to be the exception returned when parsing .vsd files with
POI 3.6 (the exception was different in
previous POI version). Unfortunately, I am not allowed to send the
problem files for inspection.

I did try to build tika with poi 3.7 beta1, but the compilation failed
because of missing constructors and incompatible
types, all in hsmf. I read the thread where you mention some changes
should be made in tika's hsmf classes.
Would you tell me what to change, so I can try and test the new poi
version. Here are the error messages:

/tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java:[291,59]
incompatible types
found   : org.apache.poi.hssf.record.common.UnicodeString
required: org.apache.poi.hssf.record.UnicodeString

tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[42,26]
cannot find symbol
symbol  : constructor
POIFSChunkParser(org.apache.poi.poifs.filesystem.POIFSFileSystem)
location: class org.apache.poi.hsmf.parsers.POIFSChunkParser

tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[43,32]
cannot find symbol
symbol  : method identifyChunks()
location: class org.apache.poi.hsmf.parsers.POIFSChunkParser

tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[88,25]
cannot find symbol
symbol  : method getDocumentNode(org.apache.poi.hsmf.datatypes.StringChunk)
location: class org.apache.poi.hsmf.parsers.POIFSChunkParser



On Thu, Jun 24, 2010 at 11:10 AM, Nick Burch <[email protected]> wrote:
> On Wed, 23 Jun 2010, Mango wrote:
>>
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 42
>>        at
>> javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown
>> Source)
>
> This looks like a fault in the core java rtf parser :/
>
> Do you know how your rtf file was created?
>
>> I suppose it's the same problem as when parsing .vsd files directly.
>
> Visio files should (as of yesterday) be parsing fine - the POI related fixes
> for visio files are now being used. Have you tried from a svn checkout from
> late yesterday / today and seen if your visio files now work?
>
> Nick
>

Reply via email to