Re: Extracting text from Sharepoint-controlled Word documents

Nick Burch Wed, 19 Mar 2014 03:10:16 -0700

On Tue, 18 Mar 2014, optimusfan wrote:

Hello. I am currently trying to use POI to extract text to be indexedin SOLR. The sources include Word .doc and .docx files stored in aSharepoint repository and accessed via a URL.
My issue is that whenever Icall ExtractorFactory.createExtractor(inputstream) I receive thefollowing exception:
java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 
stream, nor an OOXML stream

Can you get a real url for the documents, without any sharepoint wrapping?(Something that OpenOffice can open for example)

Otherwise, try using something like CMIS (Apache Chemistry provides alibrary) to fetch the real file, which you can pass to POI


Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Extracting text from Sharepoint-controlled Word documents

Reply via email to