Re: Indexing PDF and MS Office files

Terry Rhodes Tue, 14 Apr 2015 21:06:07 -0700

Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2may not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:

Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

                         static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
                         static ContentStreamUpdateRequest indexingReq = new

     ContentStreamUpdateRequest("/update/extract");

                         indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay

--
The contents of this e-mail are confidential and for the exclusive use of
the intended recipient. If you receive this e-mail in error please delete
it from your system immediately and notify us either by e-mail or
telephone. You should not copy, forward or otherwise disclose the content
of the e-mail. The views expressed in this communication may not
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to