RE: Problem with pdf, upgrading Cell
Great news, thanks :) Marc _ Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec Windows 7 http://clk.atdmt.com/FRM/go/229960614/direct/01/
Re: Problem with pdf, upgrading Cell
I've integrated this into Solr's trunk: https://issues.apache.org/jira/browse/SOLR-1902 -Grant On May 6, 2010, at 3:40 AM, Sandhya Agarwal wrote: Praveen, You can get the latest code, containing the fix, from here : http://lucene.apache.org/tika/source-repository.html Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Wednesday, May 05, 2010 10:49 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell It reports that Jukka has resolved the issue (Tika-419), and now waiting for Grant to verify (Solr-1902). But it seems the resolution will be available in 0.8 version of Tika. If it solves the problem, Is there a way to get it now? Any SVN trunk access etc? All i see there is 0.7 src zip to download.. Thanks. Praveen On Tue, May 4, 2010 at 3:59 PM, Grant Ingersoll gsing...@apache.org wrote: Yes, it is loading the libraries, but they are in a different classloader that apparently the new way Tika loads doesn't have access to. -Grant On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote: Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader
RE: Problem with pdf, upgrading Cell
Praveen, You can get the latest code, containing the fix, from here : http://lucene.apache.org/tika/source-repository.html Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Wednesday, May 05, 2010 10:49 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell It reports that Jukka has resolved the issue (Tika-419), and now waiting for Grant to verify (Solr-1902). But it seems the resolution will be available in 0.8 version of Tika. If it solves the problem, Is there a way to get it now? Any SVN trunk access etc? All i see there is 0.7 src zip to download.. Thanks. Praveen On Tue, May 4, 2010 at 3:59 PM, Grant Ingersoll gsing...@apache.org wrote: Yes, it is loading the libraries, but they are in a different classloader that apparently the new way Tika loads doesn't have access to. -Grant On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote: Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to classloader May 4, 2010 12:49:59 PM
RE: Problem with pdf, upgrading Cell
Hey, I have the same list, and i added to it the extraction library (apache solr cell jar), though you might not need it specifically inside the war file. Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Wed, 5 May 2010 10:21:36 +0530 Subject: RE: Problem with pdf, upgrading Cell Looks like the highlighting may not work here. Following is the list of jars I copied : asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar jempbox-1.1.0.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Wednesday, May 05, 2010 10:06 AM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com
Re: Problem with pdf, upgrading Cell
Marc Sandhya, Did you use Solr from trunk? I used Solr 1.4 distn, and even after copying all the jars, i still get the same results for the pdfs i posted here. Thanks. On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I have the same list, and i added to it the extraction library (apache solr cell jar), though you might not need it specifically inside the war file. Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Wed, 5 May 2010 10:21:36 +0530 Subject: RE: Problem with pdf, upgrading Cell Looks like the highlighting may not work here. Following is the list of jars I copied : asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar jempbox-1.1.0.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Wednesday, May 05, 2010 10:06 AM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com wrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc
RE: Problem with pdf, upgrading Cell
Praveen, I am indeed using a trunk version from last week's svn i think. You could always try a version from the hudson builds. I did not try this procedure with Solr's 1.4 release though. Marc _ Consultez vos emails Orange, Gmail, Yahoo!, Free ... directement depuis HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
RE: Problem with pdf, upgrading Cell
Praveen, I got the solr 1.4 release from here, http://download.filehat.com/apache/lucene/solr/1.4.0/ Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Wednesday, May 05, 2010 1:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Marc Sandhya, Did you use Solr from trunk? I used Solr 1.4 distn, and even after copying all the jars, i still get the same results for the pdfs i posted here. Thanks. On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I have the same list, and i added to it the extraction library (apache solr cell jar), though you might not need it specifically inside the war file. Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Wed, 5 May 2010 10:21:36 +0530 Subject: RE: Problem with pdf, upgrading Cell Looks like the highlighting may not work here. Following is the list of jars I copied : asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar jempbox-1.1.0.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Wednesday, May 05, 2010 10:06 AM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com wrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com mailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0
Re: Problem with pdf, upgrading Cell
' to classloader May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez
RE: Problem with pdf, upgrading Cell
' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene
RE: Problem with pdf, upgrading Cell
Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved the issue and the content extraction works fine now. Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Tuesday, May 04, 2010 12:58 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to classloader May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader May 4, 2010 12:51:52 PM
RE: Problem with pdf, upgrading Cell
Sandhya, How did you proceed?I did this:- jar -xf solr.war.- i then added all of the libs i had into the web-inf/lib folder- i then recreated the jar with jar -cvf solr.war *- replaced the war files- deleted the libs in the shared lib folder- started tomcat i'm now getting an error saying this: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler'at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:418) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:454) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152) Thanks Grant for investigating the problem! Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Tue, 4 May 2010 13:10:25 +0530 Subject: RE: Problem with pdf, upgrading Cell Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved the issue and the content extraction works fine now. Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Tuesday, May 04, 2010 12:58 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM
RE: Problem with pdf, upgrading Cell
I think this is most likely because tika-core-0.7.jar, no longer has the tika-config.xml. Die, to which we have the default tika config being loaded. This can be seen in ExtractingRequestHandler.inform() method. Hence, the parsers list is empty. I am still investigating. Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Tuesday, May 04, 2010 1:10 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved the issue and the content extraction works fine now. Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Tuesday, May 04, 2010 12:58 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader
Re: Problem with pdf, upgrading Cell
May be as Sandhya indicated, it was loading libs earlier, so it might be trying to load from contrib when you have deleted from there, but somehow not been 'seen' by Solr or something. May be to keep them there, as well put them in solr/lib in tomcat webapps.. I'm yet to try though.. On Tue, May 4, 2010 at 2:16 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Sandhya, How did you proceed?I did this:- jar -xf solr.war.- i then added all of the libs i had into the web-inf/lib folder- i then recreated the jar with jar -cvf solr.war *- replaced the war files- deleted the libs in the shared lib folder- started tomcat i'm now getting an error saying this: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler'at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:418) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:454) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152) Thanks Grant for investigating the problem! Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Tue, 4 May 2010 13:10:25 +0530 Subject: RE: Problem with pdf, upgrading Cell Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved the issue and the content extraction works fine now. Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Tuesday, May 04, 2010 12:58 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to classloader May 4, 2010 12:49:59 PM
Re: Problem with pdf, upgrading Cell
May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar' to classloader May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to classloader Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 6:13 AM Cc: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN
RE: Problem with pdf, upgrading Cell
Ok. In tika 0.4 and 0.5, I see that this is how the tika config is loaded : public static TikaConfig getDefaultConfig() { InputStream stream; try { stream = TikaConfig.class.getResourceAsStream(/org/apache/tika/tika-config.xml); return new TikaConfig(stream); } catch (IOException e) { throw new RuntimeException(Unable to read default configuration, e); } catch (SAXException e) { throw new RuntimeException(Unable to parse default configuration, e); } catch (TikaException e) { throw new RuntimeException(Unable to access default configuration, e); } } And this has changed in tika 0.7, to public TikaConfig() throws MimeTypeException, IOException { this.parsers = new HashMap(); ParseContext context = new ParseContext(); Iterator iterator = ServiceRegistry.lookupProviders(Parser.class); while (iterator.hasNext()) { Parser parser = (Parser)iterator.next(); for (Iterator i$ = parser.getSupportedTypes(context).iterator(); i$.hasNext(); ) { MediaType type = (MediaType)i$.next(); this.parsers.put(type.toString(), parser); } } this.mimeTypes = MimeTypesFactory.create(tika-mimetypes.xml); } Hence, the reason why we no longer have tika-config.xml, bundled. Thanks, Sandhya -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, May 04, 2010 4:00 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Yes, it is loading the libraries, but they are in a different classloader that apparently the new way Tika loads doesn't have access to. -Grant On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote: Hello, But I see that the libraries are being loaded : INFO: Adding specified lib dirs to ClassLoader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' to classloader May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C
RE: Problem with pdf, upgrading Cell
Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
Ok. So, I am assuming you copied all the dependencies from tika-app\target\dependency ? I tried with a number of files and don't see this issue yet. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:06 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
This email contained a .zip file attachment. Raytheon does not allow email attachments that are considered likely to contain malicious code. For your protection this attachment has been removed. If this email is from an unknown source, please simply delete this email. If this email was expected, and it is from a known sender, you may follow the below suggested instructions to obtain these types of attachments. + Instruct the sender to enclose the file(s) in a .zip compressed file, and rename the .zip compressed file with a different extension, such as, .rtnzip. Password protecting the renamed .zip compressed file adds an additional layer of protection. When you receive the file, please rename it with the extension .zip. Additional instructions and options on how to receive these attachments can be found at: http://security.it.ray.com/antivirus/extensions.html http://security.it.ray.com/news/2007/zipfiles.html Should you have any questions or difficulty with these instructions, please contact the Help Desk at 877.844.4712 --- It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.commailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.commailto:dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
This email contained a .zip file attachment. Raytheon does not allow email attachments that are considered likely to contain malicious code. For your protection this attachment has been removed. If this email is from an unknown source, please simply delete this email. If this email was expected, and it is from a known sender, you may follow the below suggested instructions to obtain these types of attachments. + Instruct the sender to enclose the file(s) in a .zip compressed file, and rename the .zip compressed file with a different extension, such as, .rtnzip. Password protecting the renamed .zip compressed file adds an additional layer of protection. When you receive the file, please rename it with the extension .zip. Additional instructions and options on how to receive these attachments can be found at: http://security.it.ray.com/antivirus/extensions.html http://security.it.ray.com/news/2007/zipfiles.html Should you have any questions or difficulty with these instructions, please contact the Help Desk at 877.844.4712 --- another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
Praveen, Did you try the technique I wrote a little earlier?Take your solr.war, put it in a directory of its own. Execute jar -xf solr.war, that should extract its content. Next, copy all of your libraries inside the WEB-INF/lib folder. This means all the extraction/lib files, and the lib files from the Solr's roots. Once this is done, we now recreate the solr.war by doing jar -cvf solr.war * (the * meaning all the files inside the current directory, so be sure to be inside the root directory where you extracted the war previously). Once this is done, put the new solr.war inside the tomcat webapps folder, and recreate from scratch the solr folder (so as not to leave any overlapping libraries). This should work hopefully. For the multivalued fields (title for example), this is a know feature/issue of Tika's integration. In my case, I always provide a literal.title along with my pdfs, but if Tika successfully extracts a title from the Pdf's meta, then it will create the Solr index entry with an array of the inputted literal, and the extracted value. There is no way to force an override of the extracted data with the literals, they just get appended. Someone correct me if i am wrong here :) Marc Date: Tue, 4 May 2010 11:58:56 + From: pkal...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell This email contained a .zip file attachment. Raytheon does not allow email attachments that are considered likely to contain malicious code. For your protection this attachment has been removed. If this email is from an unknown source, please simply delete this email. If this email was expected, and it is from a known sender, you may follow the below suggested instructions to obtain these types of attachments. + Instruct the sender to enclose the file(s) in a .zip compressed file, and rename the .zip compressed file with a different extension, such as, .rtnzip. Password protecting the renamed .zip compressed file adds an additional layer of protection. When you receive the file, please rename it with the extension .zip. Additional instructions and options on how to receive these attachments can be found at: http://security.it.ray.com/antivirus/extensions.html http://security.it.ray.com/news/2007/zipfiles.html Should you have any questions or difficulty with these instructions, please contact the Help Desk at 877.844.4712 --- another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent
Re: Problem with pdf, upgrading Cell
Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com mailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field is defined as multivalue in schema. Any idea, whats going on? or am i missing something? On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com mailto:dekay...@hotmail.com wrote: Hey, I got it to work. I just redid my steps, i had forgotten several libraries that were imported through the xml. PDF extraction seems to work once again, i have yet to find one that raises an exception! Thanks for the investigation, at least we now have a fix :) Marc
RE: Problem with pdf, upgrading Cell
Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com mailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib/extraction/lib (of-course removed old ones),. as well as in web-inf/lib of solr web app in tomcat. Now it extracts contents from some pdf, but either no content from others, or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf still shows no contents. I've two other pdfs, for which it extracts only one line of content. Also, now i;m getting a field 'title' single value for some pdfs, and two for others. In case where it can extract full content, it shows title as what i gave as literal while submitting the pdf. For pdf wher no comtent was extracted, it shows one empty title and one mine. For pdf where it extracted only one line of content, it shows that line as title too and mine one. 'title' field
RE: Problem with pdf, upgrading Cell
Looks like the highlighting may not work here. Following is the list of jars I copied : asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar jempbox-1.1.0.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Thanks, Sandhya -Original Message- From: Sandhya Agarwal [mailto:sagar...@opentext.com] Sent: Wednesday, May 05, 2010 10:06 AM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Praveen, I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 8:10 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Hi Sandhya.. I must be missing something. I copied all dependencies jars to both contrib/extraction/lib and web-in/lib folders. Here is the list of jars copied: asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-compress-1.0.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.1.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar jempbox-1.1.0.jar junit-3.8.1.jar log4j-1.2.14.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.1.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7.jar tika-parsers-0.7.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar Still same result for me.. Marc, i'm on windows, and i copied above jars directly into already extracted folder webapps/solr/web-in/lib, in addition to what were already there. I didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that could be the issue? i think tomcat extract the war and use the folder in webapps (i didn;t put the war file in webapps, instead had put extracted solr folder directly) If it has worked for you guys, specially with my two pdfs, then that's really great. Please let me know your exact procedure, including what all you copied and where, or if you see i missed something obvious.. Thanks, Praveen On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote: Both the files work for me, Praveen. Thanks, Sandhya From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell another one here.. On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: It bounced because of attachment's size.. attaching one by one now.. On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: I noticed following pattern/relationship b/w producer/creator and content extraction, not sure if helpful (as Grant told earlier pdfs are notorious): producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / Freeware Edition (not registered) Creator: PScript5.dll Version 5.2.2 Extraction: no content -- installing Solr in Tomcat.pdf (attached - i generated) - Producer: Acrobat Distiller 7.0.5 (Windows) creator: PScript5.dll Version 5.2.2 Extraction: One line content -- Producer: Acrobat Distiller 8.1.0 (Windows) creator: Acrobat PDFMaker 8.1 for Word Extraction: one line of content(Free_Two_way_Radio_Guide.pdf - attached - was available freely on their website) - Producer: FOP 0.20.5 Extraction: full content/docs/features.pdf | linkmap.pdf etc -- Thanks. Praveen On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto: pkal...@gmail.com wrote: Yes Sandhya, i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what you were asking. Thanks. On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com mailto:sagar...@opentext.com wrote: Praveen, Along with the tika core and parser jars, did you run mvn dependency:copy-dependencies, to generate all the dependencies too. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com] Sent: Tuesday, May 04, 2010 4:52 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell I seems to have mixed results: Here is what i did: copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in contrib
RE: Problem with pdf, upgrading Cell
Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: Problem with pdf, upgrading Cell
Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
Little more info... Seems to be a classloading issue. The tests pass, but they aren't loading the Tika libraries via the Solr ResourceLoader, whereas the example is. Marc, one thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will then work. Note, however, I haven't tried this. On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote: I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this. It is indeed a bug somewhere (still investigating). It seems that Tika is now picking an EmptyParser implementation when trying to determine which parser to use, despite the fact that it properly identifies the MIME Type. -Grant On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote: I'm investigating. On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote: Hi, Grant, i confirm what Praveen has said, any PDF i try does not work with the new Tika and SVN versions. :( Marc From: sagar...@opentext.com To: solr-user@lucene.apache.org Date: Mon, 3 May 2010 13:05:24 +0530 Subject: RE: Problem with pdf, upgrading Cell Hello, Please let me know if anybody figured out a way out of this issue. Thanks, Sandhya -Original Message- From: Praveen Agrawal [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 11:14 PM To: solr-user@lucene.apache.org Subject: Re: Problem with pdf, upgrading Cell Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search _ Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre téléphone! http://www.messengersurvotremobile.com/?d=Hotmail -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: Problem with pdf, upgrading Cell
Mark, did you managed to get it work? I did try latest Tika (0.7) command line and successfully parsed earlier problematic pdf. Then i replaced Tika related jars in Solr-1.4 contrib/extraction/lib folder with new ones. Now it doesn;t throw any exception, but no content extraction, only metadata! It now even doesn't extract content from pdfs which it was able to earlier (v0.4). Strange.. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-pdf-upgrading-Cell-tp745557p767447.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Problem with pdf, upgrading Cell
I observed the same issue too, with tika 0.7 jars. It now fails to extract content from documents of any type. Works with tika 0.5 though. Thanks, Sandhya -Original Message- From: pk [mailto:pkal...@gmail.com] Sent: Friday, April 30, 2010 3:17 PM To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Mark, did you managed to get it work? I did try latest Tika (0.7) command line and successfully parsed earlier problematic pdf. Then i replaced Tika related jars in Solr-1.4 contrib/extraction/lib folder with new ones. Now it doesn;t throw any exception, but no content extraction, only metadata! It now even doesn't extract content from pdfs which it was able to earlier (v0.4). Strange.. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-pdf-upgrading-Cell-tp745557p767447.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with pdf, upgrading Cell
Can you share the PDF it is failing on? FWIW, PDFs are notoriously hard to extract. They come in all shapes and flavors and I've seen many a commercial extractor fail on them too. Have you tried using either Tika standalone or PDFBox standalone? Does the file work there? On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote: Okay i've been digging a little bit through the Java code from the SVN, and it seems the load function inside the ExtractingDocumentLoader class does not receive the ContentStream (it is set to null...).Maybe i should send this to the developper mailing list? Marc From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 16:03:28 +0200 Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
I did try standalone version of tika0.7, and it extracted pdf content successfully. Then i replaced tika related jars in contrib/extraction/lib of solr1.4 dist'n with their newer versions, and now it doesn;t extract contents from ANY pdf. Earlier (0.4) it was throwing exception for few pdfs, but now no contents or exception. On Fri, Apr 30, 2010 at 4:14 PM, Grant Ingersoll gsing...@apache.orgwrote: Can you share the PDF it is failing on? FWIW, PDFs are notoriously hard to extract. They come in all shapes and flavors and I've seen many a commercial extractor fail on them too. Have you tried using either Tika standalone or PDFBox standalone? Does the file work there? On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote: Okay i've been digging a little bit through the Java code from the SVN, and it seems the load function inside the ExtractingDocumentLoader class does not receive the ContentStream (it is set to null...).Maybe i should send this to the developper mailing list? Marc From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 16:03:28 +0200 Seems like i'm not the only one with this no extraction problem: http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparentlyhe tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
Re: Problem with pdf, upgrading Cell
Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Problem with pdf, upgrading Cell
Grant, You can try any of the sample pdfs that come in /docs folder of Solr 1.4 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only metadata i.e. stream_size, content_type apart from my own literals are indexed, and content is missing.. On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote: Praveen and Marc, Can you share the PDF (feel free to email my private email) that fails in Solr? Thanks, Grant On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote: Hi Nope i didn't get it to work... Just like you, command line version of tika extracts correctly the content, but once included in Solr, no content is extracted. What i tried until now is:- Updating the tika libraries inside Solr 1.4 public version, no luck there.- Downloading the latest SVN version, compiled it, and started from a simple schema, still no luck.- Getting other versions compiled on hudson (nightly builds), and testing them also, still no extraction. I sent a mail on the developpers mailing list but they told me i should just mail here, hope some developper reads this because it's quite an important feature of Solr and somehow it got broke between the 1.4 release, and the last version on the svn. Marc _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: Problem with pdf, upgrading Cell
Okay i've been digging a little bit through the Java code from the SVN, and it seems the load function inside the ExtractingDocumentLoader class does not receive the ContentStream (it is set to null...).Maybe i should send this to the developper mailing list? Marc From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 16:03:28 +0200 Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Découvrez comment SURFER DISCRETEMENT sur un site de rencontres ! http://clk.atdmt.com/FRM/go/206608211/direct/01/
Re: Problem with pdf, upgrading Cell
Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:06:58 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:06:59 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 26395 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 I don't think that's going to help much :) Date: Fri, 23 Apr 2010 06:04:34 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation/
RE: Problem with pdf, upgrading Cell
Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. From: dekay...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Problem with pdf, upgrading Cell Date: Fri, 23 Apr 2010 15:12:39 +0200 I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:06:58 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:06:59 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 26395 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 I don't think that's going to help much :) Date: Fri, 23 Apr 2010 06:04:34 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail _ Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans HOTMAIL ! http://www.windowslive.fr/hotmail/agregation
RE: Problem with pdf, upgrading Cell
Seems like i'm not the only one with this no extraction problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently he tried the same thing, building from the trunk, and indexing a pdf, and no extraction occured... Strange. Marc G. _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … http://www.messengersurvotremobile.com/?d=Hotmail
Re: Problem with pdf, upgrading Cell
Marc, These are your request logs. You want to look at your Solr logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 9:12:39 AM Subject: RE: Problem with pdf, upgrading Cell I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:06:58 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:06:59 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200 44 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 26395 127.0.0.1 - - [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 I don't think that's going to help much :) Date: Fri, 23 Apr 2010 06:04:34 -0700 From: href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: ymailto=mailto:solr-user@lucene.apache.org; href=mailto:solr-user@lucene.apache.org;solr-user@lucene.apache.org Marc, got anything in your logs? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb ymailto=mailto:dekay...@hotmail.com; href=mailto:dekay...@hotmail.com;dekay...@hotmail.com To: href=mailto:solr-user@lucene.apache.org;solr-user@lucene.apache.org Sent: Fri, April 23, 2010 8:42:53 AM Subject: Problem with pdf, upgrading Cell Hello, I configured a Solr server to be able to extract data from various documents, including pdfs. Unfortunately, the data extraction fails on several pdfs. I have read around here that this may be due to the old Tika library being used?I looked around and saw that the svn had a newer version so i checked out the trunk, and built it using ant dist, and ant example.I then set up my schema in the newly built server, and inserted the library from the newly built cell into the lib directory (in solr's home). However, now all i get is a blank response... The indexing works, but it doesn't extract anything, only the literal values that i pass on are indexed. Any help would be greatly appreciated!! :) Thank you. Marc Ghorayeb _ Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, … href=http://www.messengersurvotremobile.com/?d=Hotmail; target=_blank http://www.messengersurvotremobile.com/?d=Hotmail
RE: Problem with pdf, upgrading Cell
PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming searc...@105585dc main from searc...@2efeecca main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming searc...@105585dc main from searc...@2efeecca main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: autowarming result for searc...@105585dc main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: QuerySenderListener sending requests to searc...@105585dc mainApr 23, 2010 5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: QuerySenderListener done.Apr 23, 2010 5:47:14 PM org.apache.solr.core.SolrCore registerSearcherINFO: [] Registered new searcher searc...@105585dc mainApr 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher closeINFO: Closing searc...@2efeecca main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr 23, 2010 5:47:14 PM org.apache.solr.update.processor.LogUpdateProcessor finishINFO: {optimize=} 0 46Apr 23, 2010 5:47:14 PM org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=1} status=0 QTime=46 Date: Fri, 23 Apr 2010 08:03:14 -0700 From: otis_gospodne...@yahoo.com Subject: Re: Problem with pdf, upgrading Cell To: solr-user@lucene.apache.org Marc, These are your request logs. You want to look at your Solr logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Marc Ghorayeb dekay...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, April 23, 2010 9:12:39 AM Subject: RE: Problem with pdf, upgrading Cell I'm launching it with the start.jar utility, and there doesn't seem to be anything weird inside the console when i upload a pdf. Is there a way to output the console to a log file? The only log file that get's updated is a log file in the logs directory, and it seems to only show the input/ouput of the web requests (get and posts...). for example:127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 - - [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 - - [23/Apr/2010:13:06:57 +] POST /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1 HTTP/1.1 200
Re: Problem with pdf, upgrading Cell
On Fri, Apr 23, 2010 at 5:48 PM, Marc Ghorayeb dekay...@hotmail.com wrote: Yes, the only log i can actually get is the one in the command console from windows and there are no errors there ... Here are the last lines when i upload a pdf to the update/extract url: snip I am pretty sure it is the tika itself that does not manage to convert your pdf. I'm not using solr cell but tika from a commandline, and it is only with very recent tika builds pdf extraction works in most cases. So I suggest to build tika from svn yourself, and if the commandlien extraction works, integarte it back with Solr. See http://wiki.apache.org/solr/ExtractingRequestHandler for instructions (the comitter section) hth Paul