RE: Problem with pdf, upgrading Cell

Marc Ghorayeb Wed, 05 May 2010 00:39:33 -0700
Hey,
I have the same list, and i added to it the extraction library (apache solr 
cell jar), though you might not need it specifically inside the war file.
Marc
> From: sagar...@opentext.com
> To: solr-user@lucene.apache.org
> Date: Wed, 5 May 2010 10:21:36 +0530
> Subject: RE: Problem with pdf, upgrading Cell
> 
> Looks like the highlighting may not work here. Following is the list of jars 
> I copied :
> 
> asm-3.1.jar
> bcmail-jdk15-1.45.jar
> bcprov-jdk15-1.45.jar
> commons-compress-1.0.jar
> commons-logging-1.1.1.jar
> dom4j-1.6.1.jar
> fontbox-1.1.0.jar
> geronimo-stax-api_1.0_spec-1.0.1.jar
> jempbox-1.1.0.jar
> log4j-1.2.14.jar
> metadata-extractor-2.4.0-beta-1.jar
> pdfbox-1.1.0.jar
> poi-3.6.jar
> poi-ooxml-3.6.jar
> poi-ooxml-schemas-3.6.jar
> poi-scratchpad-3.6.jar
> tagsoup-1.2.jar
> tika-core-0.7.jar
> tika-parsers-0.7.jar
> xml-apis-1.0.b2.jar
> xmlbeans-2.3.0.jar
> 
> Thanks,
> Sandhya
> 
> 
> 
> -----Original Message-----
> From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
> Sent: Wednesday, May 05, 2010 10:06 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Problem with pdf, upgrading Cell
> 
> Praveen,
> 
> 
> 
> I only have the highlighted jars copied. Not sure, if we need the other jars. 
> Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
> 
> 
> 
> Thanks,
> 
> Sandhya
> 
> 
> 
> -----Original Message-----
> From: Praveen Agrawal [mailto:pkal...@gmail.com]
> Sent: Tuesday, May 04, 2010 8:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with pdf, upgrading Cell
> 
> 
> 
> Hi Sandhya..
> 
> I must be missing something. I copied all dependencies jars to both
> 
> contrib/extraction/lib and web-in/lib folders. Here is the list of jars
> 
> copied:
> 
> 
> 
> asm-3.1.jar
> 
> bcmail-jdk15-1.45.jar
> 
> bcprov-jdk15-1.45.jar
> 
> commons-compress-1.0.jar
> 
> commons-logging-1.1.1.jar
> 
> dom4j-1.6.1.jar
> 
> fontbox-1.1.0.jar
> 
> geronimo-stax-api_1.0_spec-1.0.1.jar
> 
> hamcrest-core-1.1.jar
> 
> jempbox-1.1.0.jar
> 
> junit-3.8.1.jar
> 
> log4j-1.2.14.jar
> 
> metadata-extractor-2.4.0-beta-1.jar
> 
> mockito-core-1.7.jar
> 
> nekohtml-1.9.9.jar
> 
> objenesis-1.0.jar
> 
> ooxml-schemas-1.0.jar
> 
> pdfbox-1.1.0.jar
> 
> poi-3.6.jar
> 
> poi-ooxml-3.6.jar
> 
> poi-ooxml-schemas-3.6.jar
> 
> poi-scratchpad-3.6.jar
> 
> tagsoup-1.2.jar
> 
> tika-core-0.7.jar
> 
> tika-parsers-0.7.jar
> 
> xml-apis-1.0.b2.jar
> 
> xmlbeans-2.3.0.jar
> 
> 
> 
> Still same result for me..
> 
> 
> 
> Marc,
> 
> i'm on windows, and i copied above jars directly into already extracted
> 
> folder webapps/solr/web-in/lib, in addition to what were already there. I
> 
> didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that
> 
> could be the issue? i think tomcat extract the war and use the folder in
> 
> webapps (i didn;t put the war file in webapps, instead had put extracted
> 
> solr folder directly)
> 
> 
> 
> If it has worked for you guys, specially with my two pdfs, then that's
> 
> really great. Please let me know your exact procedure, including what all
> 
> you copied and where, or if you see i missed something obvious..
> 
> 
> 
> Thanks,
> 
> Praveen
> 
> 
> 
> 
> 
> On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal <sagar...@opentext.com>wrote:
> 
> 
> 
> > Both the files work for me, Praveen.
> 
> >
> 
> > Thanks,
> 
> > Sandhya
> 
> >
> 
> > From: Praveen Agrawal [mailto:pkal...@gmail.com]
> 
> > Sent: Tuesday, May 04, 2010 5:22 PM
> 
> > To: solr-user@lucene.apache.org
> 
> > Subject: Re: Problem with pdf, upgrading Cell
> 
> >
> 
> > another one here..
> 
> > On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
> 
> > pkal...@gmail.com>> wrote:
> 
> > It bounced because of attachment's size..
> 
> > attaching one by one now..
> 
> >
> 
> >
> 
> > On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
> 
> > pkal...@gmail.com>> wrote:
> 
> > I noticed following pattern/relationship b/w producer/creator and content
> 
> > extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
> 
> >
> 
> > producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com> /
> 
> > Freeware Edition (not registered)
> 
> > Creator: PScript5.dll Version 5.2.2
> 
> > Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached - i
> 
> > generated)
> 
> > ---------------------
> 
> >
> 
> > Producer: Acrobat Distiller 7.0.5 (Windows)
> 
> > creator: PScript5.dll Version 5.2.2
> 
> > Extraction: One line content
> 
> > ----------------------
> 
> >
> 
> > Producer: Acrobat Distiller 8.1.0 (Windows)
> 
> > creator: Acrobat PDFMaker 8.1 for Word
> 
> > Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf -
> 
> > attached - was available freely on their website)
> 
> > -------------------------
> 
> >
> 
> > Producer: FOP 0.20.5
> 
> > Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc
> 
> > --------------
> 
> > Thanks.
> 
> > Praveen
> 
> >
> 
> >
> 
> > On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
> 
> > pkal...@gmail.com>> wrote:
> 
> > Yes Sandhya,
> 
> > i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
> 
> > what you were asking.
> 
> > Thanks.
> 
> >
> 
> >
> 
> > On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagar...@opentext.com
> 
> > <mailto:sagar...@opentext.com>> wrote:
> 
> > Praveen,
> 
> >
> 
> > Along with the tika core and parser jars, did you run "mvn
> 
> > dependency:copy-dependencies", to generate all the dependencies too.
> 
> >
> 
> > Thanks,
> 
> > Sandhya
> 
> >
> 
> > -----Original Message-----
> 
> > From: Praveen Agrawal [mailto:pkal...@gmail.com<mailto:pkal...@gmail.com>]
> 
> > Sent: Tuesday, May 04, 2010 4:52 PM
> 
> > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> 
> > Subject: Re: Problem with pdf, upgrading Cell
> 
> > I seems to have mixed results:
> 
> >
> 
> > Here is what i did:
> 
> > copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
> 
> > contrib/extraction/lib (of-course removed old ones),. as well as in
> 
> > web-inf/lib of solr web app in tomcat.
> 
> >
> 
> > Now it extracts contents from some pdf, but either no content from others,
> 
> > or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"
> 
> > still shows no contents. I've two other pdfs, for which it extracts only
> 
> > one
> 
> > line of content.
> 
> >
> 
> > Also, now i;m getting a field 'title' single value for some pdfs, and two
> 
> > for others. In case where it can extract full content, it shows title as
> 
> > what i gave as literal while submitting the pdf. For pdf wher no comtent
> 
> > was
> 
> > extracted, it shows one empty title and one mine. For pdf where it
> 
> > extracted
> 
> > only one line of content, it shows that line as title too and mine one.
> 
> > 'title' field is defined as multivalue in schema.
> 
> >
> 
> > Any idea, whats going on? or am i missing something?
> 
> >
> 
> >
> 
> >
> 
> > On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay...@hotmail.com
> 
> > <mailto:dekay...@hotmail.com>> wrote:
> 
> >
> 
> > >
> 
> > > Hey,
> 
> > > I got it to work. I just redid my steps, i had forgotten several
> 
> > libraries
> 
> > > that were imported through the xml. PDF extraction seems to work once
> 
> > again,
> 
> > > i have yet to find one that raises an exception!
> 
> > >
> 
> > > Thanks for the investigation, at least we now have a fix :)
> 
> > > Marc
> 
> > > _________________________________________________________________
> 
> > > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
> 
> > > Blackberry, …
> 
> > > http://www.messengersurvotremobile.com/?d=Hotmail
> 
> > >
> 
> >
> 
> >
> 
> >
> 
> >
> 
> >
                                          
_________________________________________________________________
Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
Blackberry, …
http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell

Reply via email to