Figured it out. Looks like I was incorrect. I opened up a few files in a hex editor and discovered that many of our ".doc" files are actually RTF files saved with the .doc extension. Tika is able to process them just fine, so I'm off and running again.
Thanks! On Wednesday, March 19, 2014 5:09 AM, Nick Burch <[email protected]> wrote: On Tue, 18 Mar 2014, optimusfan wrote: > Hello. I am currently trying to use POI to extract text to be indexed > in SOLR. The sources include Word .doc and .docx files stored in a > Sharepoint repository and accessed via a URL. > > My issue is that whenever I > call ExtractorFactory.createExtractor(inputstream) I receive the > following exception: > > java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 > stream, nor an OOXML stream Can you get a real url for the documents, without any sharepoint wrapping? (Something that OpenOffice can open for example) Otherwise, try using something like CMIS (Apache Chemistry provides a library) to fetch the real file, which you can pass to POI Nick --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
