Re: Extracting text from Sharepoint-controlled Word documents

optimusfan Thu, 20 Mar 2014 07:16:37 -0700

Figured it out.  Looks like I was incorrect.  I opened up a few files in a hex 
editor and discovered that many of our ".doc" files are actually RTF files 
saved with the .doc extension.  Tika is able to process them just fine, so I'm 
off and running again.


Thanks!



On Wednesday, March 19, 2014 5:09 AM, Nick Burch <[email protected]> wrote:
 
On Tue, 18 Mar 2014, optimusfan wrote:

> Hello.  I am currently trying to use POI to extract text to be indexed 
> in SOLR.  The sources include Word .doc and .docx files stored in a 
> Sharepoint repository and accessed via a URL.
>
> My issue is that whenever I 
> call ExtractorFactory.createExtractor(inputstream) I receive the 
> following exception:
>
> java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 
> stream, nor an OOXML stream

Can you get a real url for the documents, without any sharepoint wrapping? 
(Something that OpenOffice can open for example)

Otherwise, try using something like CMIS (Apache Chemistry provides a 
library) to fetch the real file, which you can pass to POI

Nick


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Extracting text from Sharepoint-controlled Word documents

Reply via email to