Re: Indexing PDF and MS Office files

Siegfried Goeschl Thu, 16 Apr 2015 04:54:13 -0700

Hi Vijay,

I know the this road too well :-)


For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look atcommons-exec :-)


Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will neverever successfully all real-world PDFs and cater for that fact in yourrequirements :-)


On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks & Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com> wrote:

There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
<vijaya.bhoomire...@whishworks.com> wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is "java.lang.IllegalArgumentException: This paragraph

is

not the first one in the table" which will eventually result in

"Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks & Regards
Vijay


On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote:

Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com>
wrote:

Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:

Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy

vijaya.bhoomire...@whishworks.com> wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread "main"

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:

org.apache.tika.exception.TikaException: Unexpected RuntimeException
from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code

snippet

related to indexing. Please let me know where the issue is

occurring.


                          static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
                          static ContentStreamUpdateRequest

indexingReq

new

      ContentStreamUpdateRequest("/update/extract");

                          indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,

true);

solrServer.request(indexingReq);

Thanks & Regards
Vijay

--
The contents of this e-mail are confidential and for the exclusive

use

of

the intended recipient. If you receive this e-mail in error please

delete

it from your system immediately and notify us either by e-mail or
telephone. You should not copy, forward or otherwise disclose the

content

of the e-mail. The views expressed in this communication may not
necessarily be the view held by WHISHWORKS.



--
Ph: 9845704792


--
The contents of this e-mail are confidential and for the exclusive use of
the intended recipient. If you receive this e-mail in error please delete
it from your system immediately and notify us either by e-mail or
telephone. You should not copy, forward or otherwise disclose the content
of the e-mail. The views expressed in this communication may not
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to