Hello. I am currently trying to use POI to extract text to be indexed in SOLR. The sources include Word .doc and .docx files stored in a Sharepoint repository and accessed via a URL.
My issue is that whenever I call ExtractorFactory.createExtractor(inputstream) I receive the following exception: java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream Now, I know the documents are valid Word docs. I can open them manually, and the HTTPClient ContentType.get(responseEntity).getMimeType() returns the correct "application/msword". I am also able to load these documents if I manually load them into Word, do a "Save As" to remove the Sharepoint hooks, and then load from either an HTTP URL or the local file system. Looking at the POI source it appears as though the headers aren't recognized. So, I would like to know whether or not Sharepoint-controlled Word documents are supported via POI? If not, is there an alternate way to parse the contents of these files to extract text? Thanks, Ian
