Hey,
I have the same list, and i added to it the extraction library (apache solr
cell jar), though you might not need it specifically inside the war file.
Marc
> From: sagar...@opentext.com
> To: solr-user@lucene.apache.org
> Date: Wed, 5 May 2010 10:21:36 +0530
> Subject: RE: Problem with pdf, upgrading Cell
>
> Looks like the highlighting may not work here. Following is the list of jars
> I copied :
>
> asm-3.1.jar
> bcmail-jdk15-1.45.jar
> bcprov-jdk15-1.45.jar
> commons-compress-1.0.jar
> commons-logging-1.1.1.jar
> dom4j-1.6.1.jar
> fontbox-1.1.0.jar
> geronimo-stax-api_1.0_spec-1.0.1.jar
> jempbox-1.1.0.jar
> log4j-1.2.14.jar
> metadata-extractor-2.4.0-beta-1.jar
> pdfbox-1.1.0.jar
> poi-3.6.jar
> poi-ooxml-3.6.jar
> poi-ooxml-schemas-3.6.jar
> poi-scratchpad-3.6.jar
> tagsoup-1.2.jar
> tika-core-0.7.jar
> tika-parsers-0.7.jar
> xml-apis-1.0.b2.jar
> xmlbeans-2.3.0.jar
>
> Thanks,
> Sandhya
>
>
>
> -----Original Message-----
> From: Sandhya Agarwal [mailto:sagar...@opentext.com]
> Sent: Wednesday, May 05, 2010 10:06 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Problem with pdf, upgrading Cell
>
> Praveen,
>
>
>
> I only have the highlighted jars copied. Not sure, if we need the other jars.
> Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
>
>
>
> Thanks,
>
> Sandhya
>
>
>
> -----Original Message-----
> From: Praveen Agrawal [mailto:pkal...@gmail.com]
> Sent: Tuesday, May 04, 2010 8:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with pdf, upgrading Cell
>
>
>
> Hi Sandhya..
>
> I must be missing something. I copied all dependencies jars to both
>
> contrib/extraction/lib and web-in/lib folders. Here is the list of jars
>
> copied:
>
>
>
> asm-3.1.jar
>
> bcmail-jdk15-1.45.jar
>
> bcprov-jdk15-1.45.jar
>
> commons-compress-1.0.jar
>
> commons-logging-1.1.1.jar
>
> dom4j-1.6.1.jar
>
> fontbox-1.1.0.jar
>
> geronimo-stax-api_1.0_spec-1.0.1.jar
>
> hamcrest-core-1.1.jar
>
> jempbox-1.1.0.jar
>
> junit-3.8.1.jar
>
> log4j-1.2.14.jar
>
> metadata-extractor-2.4.0-beta-1.jar
>
> mockito-core-1.7.jar
>
> nekohtml-1.9.9.jar
>
> objenesis-1.0.jar
>
> ooxml-schemas-1.0.jar
>
> pdfbox-1.1.0.jar
>
> poi-3.6.jar
>
> poi-ooxml-3.6.jar
>
> poi-ooxml-schemas-3.6.jar
>
> poi-scratchpad-3.6.jar
>
> tagsoup-1.2.jar
>
> tika-core-0.7.jar
>
> tika-parsers-0.7.jar
>
> xml-apis-1.0.b2.jar
>
> xmlbeans-2.3.0.jar
>
>
>
> Still same result for me..
>
>
>
> Marc,
>
> i'm on windows, and i copied above jars directly into already extracted
>
> folder webapps/solr/web-in/lib, in addition to what were already there. I
>
> didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that
>
> could be the issue? i think tomcat extract the war and use the folder in
>
> webapps (i didn;t put the war file in webapps, instead had put extracted
>
> solr folder directly)
>
>
>
> If it has worked for you guys, specially with my two pdfs, then that's
>
> really great. Please let me know your exact procedure, including what all
>
> you copied and where, or if you see i missed something obvious..
>
>
>
> Thanks,
>
> Praveen
>
>
>
>
>
> On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal <sagar...@opentext.com>wrote:
>
>
>
> > Both the files work for me, Praveen.
>
> >
>
> > Thanks,
>
> > Sandhya
>
> >
>
> > From: Praveen Agrawal [mailto:pkal...@gmail.com]
>
> > Sent: Tuesday, May 04, 2010 5:22 PM
>
> > To: solr-user@lucene.apache.org
>
> > Subject: Re: Problem with pdf, upgrading Cell
>
> >
>
> > another one here..
>
> > On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
>
> > pkal...@gmail.com>> wrote:
>
> > It bounced because of attachment's size..
>
> > attaching one by one now..
>
> >
>
> >
>
> > On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
>
> > pkal...@gmail.com>> wrote:
>
> > I noticed following pattern/relationship b/w producer/creator and content
>
> > extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
>
> >
>
> > producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com> /
>
> > Freeware Edition (not registered)
>
> > Creator: PScript5.dll Version 5.2.2
>
> > Extraction: no content -- "installing Solr in Tomcat.pdf" (attached - i
>
> > generated)
>
> > ---------------------
>
> >
>
> > Producer: Acrobat Distiller 7.0.5 (Windows)
>
> > creator: PScript5.dll Version 5.2.2
>
> > Extraction: One line content
>
> > ----------------------
>
> >
>
> > Producer: Acrobat Distiller 8.1.0 (Windows)
>
> > creator: Acrobat PDFMaker 8.1 for Word
>
> > Extraction: one line of content (Free_Two_way_Radio_Guide.pdf -
>
> > attached - was available freely on their website)
>
> > -------------------------
>
> >
>
> > Producer: FOP 0.20.5
>
> > Extraction: full content "/docs/features.pdf | linkmap.pdf" etc
>
> > --------------
>
> > Thanks.
>
> > Praveen
>
> >
>
> >
>
> > On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkal...@gmail.com<mailto:
>
> > pkal...@gmail.com>> wrote:
>
> > Yes Sandhya,
>
> > i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
>
> > what you were asking.
>
> > Thanks.
>
> >
>
> >
>
> > On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagar...@opentext.com
>
> > <mailto:sagar...@opentext.com>> wrote:
>
> > Praveen,
>
> >
>
> > Along with the tika core and parser jars, did you run "mvn
>
> > dependency:copy-dependencies", to generate all the dependencies too.
>
> >
>
> > Thanks,
>
> > Sandhya
>
> >
>
> > -----Original Message-----
>
> > From: Praveen Agrawal [mailto:pkal...@gmail.com<mailto:pkal...@gmail.com>]
>
> > Sent: Tuesday, May 04, 2010 4:52 PM
>
> > To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>
> > Subject: Re: Problem with pdf, upgrading Cell
>
> > I seems to have mixed results:
>
> >
>
> > Here is what i did:
>
> > copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
>
> > contrib/extraction/lib (of-course removed old ones),. as well as in
>
> > web-inf/lib of solr web app in tomcat.
>
> >
>
> > Now it extracts contents from some pdf, but either no content from others,
>
> > or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"
>
> > still shows no contents. I've two other pdfs, for which it extracts only
>
> > one
>
> > line of content.
>
> >
>
> > Also, now i;m getting a field 'title' single value for some pdfs, and two
>
> > for others. In case where it can extract full content, it shows title as
>
> > what i gave as literal while submitting the pdf. For pdf wher no comtent
>
> > was
>
> > extracted, it shows one empty title and one mine. For pdf where it
>
> > extracted
>
> > only one line of content, it shows that line as title too and mine one.
>
> > 'title' field is defined as multivalue in schema.
>
> >
>
> > Any idea, whats going on? or am i missing something?
>
> >
>
> >
>
> >
>
> > On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay...@hotmail.com
>
> > <mailto:dekay...@hotmail.com>> wrote:
>
> >
>
> > >
>
> > > Hey,
>
> > > I got it to work. I just redid my steps, i had forgotten several
>
> > libraries
>
> > > that were imported through the xml. PDF extraction seems to work once
>
> > again,
>
> > > i have yet to find one that raises an exception!
>
> > >
>
> > > Thanks for the investigation, at least we now have a fix :)
>
> > > Marc
>
> > > _________________________________________________________________
>
> > > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
>
> > > Blackberry, …
>
> > > http://www.messengersurvotremobile.com/?d=Hotmail
>
> > >
>
> >
>
> >
>
> >
>
> >
>
> >
_________________________________________________________________
Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
Blackberry, …
http://www.messengersurvotremobile.com/?d=Hotmail