Hi, Nick. I'm not sure where exactly the problem is because I'm not very adept in Java.
>From what I can tell is that thre rest of .rtf files are parsed without a problem. Only those with embedded Visio diagrams create problems. Wierd thing is I just tried parsing a .doc document with embedded .vsd and it was parsed without a problem. I am still not convinced that the problem is in the RTFParser itself. java.lang.ArrayIndexOutOfBoundsException is supposed to be the exception returned when parsing .vsd files with POI 3.6 (the exception was different in previous POI version). Unfortunately, I am not allowed to send the problem files for inspection. I did try to build tika with poi 3.7 beta1, but the compilation failed because of missing constructors and incompatible types, all in hsmf. I read the thread where you mention some changes should be made in tika's hsmf classes. Would you tell me what to change, so I can try and test the new poi version. Here are the error messages: /tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java:[291,59] incompatible types found : org.apache.poi.hssf.record.common.UnicodeString required: org.apache.poi.hssf.record.UnicodeString tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[42,26] cannot find symbol symbol : constructor POIFSChunkParser(org.apache.poi.poifs.filesystem.POIFSFileSystem) location: class org.apache.poi.hsmf.parsers.POIFSChunkParser tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[43,32] cannot find symbol symbol : method identifyChunks() location: class org.apache.poi.hsmf.parsers.POIFSChunkParser tika_build/tika-0.7/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java:[88,25] cannot find symbol symbol : method getDocumentNode(org.apache.poi.hsmf.datatypes.StringChunk) location: class org.apache.poi.hsmf.parsers.POIFSChunkParser On Thu, Jun 24, 2010 at 11:10 AM, Nick Burch <[email protected]> wrote: > On Wed, 23 Jun 2010, Mango wrote: >> >> Caused by: java.lang.ArrayIndexOutOfBoundsException: 42 >> at >> javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown >> Source) > > This looks like a fault in the core java rtf parser :/ > > Do you know how your rtf file was created? > >> I suppose it's the same problem as when parsing .vsd files directly. > > Visio files should (as of yesterday) be parsing fine - the POI related fixes > for visio files are now being used. Have you tried from a svn checkout from > late yesterday / today and seen if your visio files now work? > > Nick >
