RE: Problem with pdf, upgrading Cell

Sandhya Agarwal Tue, 04 May 2010 04:59:02 -0700

Both the files work for me, Praveen.

Thanks,
Sandhya

From: Praveen Agrawal [mailto:pkal...@gmail.com]
Sent: Tuesday, May 04, 2010 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

another one here..
On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal 
<pkal...@gmail.com<mailto:pkal...@gmail.com>> wrote:
It bounced because of attachment's size..
attaching one by one now..

On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal 
<pkal...@gmail.com<mailto:pkal...@gmail.com>> wrote:
I noticed following pattern/relationship b/w producer/creator and content 
extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com> / 
Freeware Edition (not registered)
Creator: PScript5.dll Version 5.2.2
Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached - i 
generated)
---------------------

Producer: Acrobat Distiller 7.0.5 (Windows)
creator: PScript5.dll Version 5.2.2
Extraction: One line content
----------------------

Producer: Acrobat Distiller 8.1.0 (Windows)
creator: Acrobat PDFMaker 8.1 for Word
Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf - attached - 
was available freely on their website)
-------------------------

Producer: FOP 0.20.5
Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc
--------------
Thanks.
Praveen

On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal 
<pkal...@gmail.com<mailto:pkal...@gmail.com>> wrote:
Yes Sandhya,
i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what 
you were asking.
Thanks.

On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal 
<sagar...@opentext.com<mailto:sagar...@opentext.com>> wrote:
Praveen,

Along with the tika core and parser jars, did you run "mvn 
dependency:copy-dependencies", to generate all the dependencies too.

Thanks,
Sandhya

-----Original Message-----
From: Praveen Agrawal [mailto:pkal...@gmail.com<mailto:pkal...@gmail.com>]
Sent: Tuesday, May 04, 2010 4:52 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: Re: Problem with pdf, upgrading Cell
I seems to have mixed results:

Here is what i did:
copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
contrib/extraction/lib (of-course removed old ones),. as well as in
web-inf/lib of solr web app in tomcat.

Now it extracts contents from some pdf, but either no content from others,
or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"
still shows no contents. I've two other pdfs, for which it extracts only one
line of content.

Also, now i;m getting a field 'title' single value for some pdfs, and two
for others. In case where it can extract full content, it shows title as
what i gave as literal while submitting the pdf. For pdf wher no comtent was
extracted, it shows one empty title and one mine. For pdf where it extracted
only one line of content, it shows that line as title too and mine one.
'title' field is defined as multivalue in schema.

Any idea, whats going on? or am i missing something?

On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb 
<dekay...@hotmail.com<mailto:dekay...@hotmail.com>> wrote:

>
> Hey,
> I got it to work. I just redid my steps, i had forgotten several libraries
> that were imported through the xml. PDF extraction seems to work once again,
> i have yet to find one that raises an exception!
>
> Thanks for the investigation, at least we now have a fix :)
> Marc
> _________________________________________________________________
> Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
> Blackberry, …
> http://www.messengersurvotremobile.com/?d=Hotmail
>

RE: Problem with pdf, upgrading Cell

Reply via email to