Extracting text from Sharepoint-controlled Word documents

optimusfan Tue, 18 Mar 2014 17:07:39 -0700

Hello.  I am currently trying to use POI to extract text to be indexed in SOLR. 
 The sources include Word .doc and .docx files stored in a Sharepoint 
repository and accessed via a URL.


My issue is that whenever I call ExtractorFactory.createExtractor(inputstream) 
I receive the following exception:

java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 
stream, nor an OOXML stream


Now, I know the documents are valid Word docs.  I can open them manually, and 
the HTTPClient ContentType.get(responseEntity).getMimeType() returns the 
correct "application/msword".  I am also able to load these documents if I 
manually load them into Word, do a "Save As" to remove the Sharepoint hooks, 
and then load from either an HTTP URL or the local file system.

Looking at the POI source it appears as though the headers aren't recognized.  
So, I would like to know whether or not Sharepoint-controlled Word documents 
are supported via POI?  If not, is there an alternate way to parse the contents 
of these files to extract text?

Thanks,
Ian

Extracting text from Sharepoint-controlled Word documents

Reply via email to