subject:"RE\: Problem with pdf, upgrading Cell"

RE: Problem with pdf, upgrading Cell

2010-05-11 Thread Marc Ghorayeb


Great news, thanks :)
Marc  
_
Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec 
Windows 7
http://clk.atdmt.com/FRM/go/229960614/direct/01/

Re: Problem with pdf, upgrading Cell

2010-05-10 Thread Grant Ingersoll

I've integrated this into Solr's trunk: 
https://issues.apache.org/jira/browse/SOLR-1902


-Grant

On May 6, 2010, at 3:40 AM, Sandhya Agarwal wrote:

 Praveen,
 
 You can get the latest code, containing the fix, from here :
 
 http://lucene.apache.org/tika/source-repository.html
 
 Thanks,
 Sandhya
 
 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com] 
 Sent: Wednesday, May 05, 2010 10:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 It reports that Jukka has resolved the issue (Tika-419), and now waiting for
 Grant to verify (Solr-1902). But it seems the resolution will be available
 in 0.8 version of Tika.
 
 If it solves the problem, Is there a way to get it now? Any SVN trunk access
 etc? All i see there is 0.7 src zip to download..
 
 Thanks.
 Praveen
 
 
 On Tue, May 4, 2010 at 3:59 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 Yes, it is loading the libraries, but they are in a different classloader
 that apparently the new way Tika loads doesn't have access to.
 
 -Grant
 
 On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote:
 
 Hello,
 
 
 
 But I see that the libraries are being loaded :
 
 
 
 INFO: Adding specified lib dirs to ClassLoader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar'
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
 INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader

RE: Problem with pdf, upgrading Cell

2010-05-06 Thread Sandhya Agarwal

Praveen,

You can get the latest code, containing the fix, from here :

http://lucene.apache.org/tika/source-repository.html

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: Wednesday, May 05, 2010 10:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

It reports that Jukka has resolved the issue (Tika-419), and now waiting for
Grant to verify (Solr-1902). But it seems the resolution will be available
in 0.8 version of Tika.

If it solves the problem, Is there a way to get it now? Any SVN trunk access
etc? All i see there is 0.7 src zip to download..

Thanks.
Praveen


On Tue, May 4, 2010 at 3:59 PM, Grant Ingersoll gsing...@apache.org wrote:

 Yes, it is loading the libraries, but they are in a different classloader
 that apparently the new way Tika loads doesn't have access to.

 -Grant

 On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote:

  Hello,
 
 
 
  But I see that the libraries are being loaded :
 
 
 
  INFO: Adding specified lib dirs to ClassLoader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM

RE: Problem with pdf, upgrading Cell

2010-05-05 Thread Marc Ghorayeb


Hey,
I have the same list, and i added to it the extraction library (apache solr 
cell jar), though you might not need it specifically inside the war file.
Marc
 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Wed, 5 May 2010 10:21:36 +0530
 Subject: RE: Problem with pdf, upgrading Cell
 
 Looks like the highlighting may not work here. Following is the list of jars 
 I copied :
 
 asm-3.1.jar
 bcmail-jdk15-1.45.jar
 bcprov-jdk15-1.45.jar
 commons-compress-1.0.jar
 commons-logging-1.1.1.jar
 dom4j-1.6.1.jar
 fontbox-1.1.0.jar
 geronimo-stax-api_1.0_spec-1.0.1.jar
 jempbox-1.1.0.jar
 log4j-1.2.14.jar
 metadata-extractor-2.4.0-beta-1.jar
 pdfbox-1.1.0.jar
 poi-3.6.jar
 poi-ooxml-3.6.jar
 poi-ooxml-schemas-3.6.jar
 poi-scratchpad-3.6.jar
 tagsoup-1.2.jar
 tika-core-0.7.jar
 tika-parsers-0.7.jar
 xml-apis-1.0.b2.jar
 xmlbeans-2.3.0.jar
 
 Thanks,
 Sandhya
 
 
 
 -Original Message-
 From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
 Sent: Wednesday, May 05, 2010 10:06 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Problem with pdf, upgrading Cell
 
 Praveen,
 
 
 
 I only have the highlighted jars copied. Not sure, if we need the other jars. 
 Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
 
 
 
 Thanks,
 
 Sandhya
 
 
 
 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 8:10 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 
 
 Hi Sandhya..
 
 I must be missing something. I copied all dependencies jars to both
 
 contrib/extraction/lib and web-in/lib folders. Here is the list of jars
 
 copied:
 
 
 
 asm-3.1.jar
 
 bcmail-jdk15-1.45.jar
 
 bcprov-jdk15-1.45.jar
 
 commons-compress-1.0.jar
 
 commons-logging-1.1.1.jar
 
 dom4j-1.6.1.jar
 
 fontbox-1.1.0.jar
 
 geronimo-stax-api_1.0_spec-1.0.1.jar
 
 hamcrest-core-1.1.jar
 
 jempbox-1.1.0.jar
 
 junit-3.8.1.jar
 
 log4j-1.2.14.jar
 
 metadata-extractor-2.4.0-beta-1.jar
 
 mockito-core-1.7.jar
 
 nekohtml-1.9.9.jar
 
 objenesis-1.0.jar
 
 ooxml-schemas-1.0.jar
 
 pdfbox-1.1.0.jar
 
 poi-3.6.jar
 
 poi-ooxml-3.6.jar
 
 poi-ooxml-schemas-3.6.jar
 
 poi-scratchpad-3.6.jar
 
 tagsoup-1.2.jar
 
 tika-core-0.7.jar
 
 tika-parsers-0.7.jar
 
 xml-apis-1.0.b2.jar
 
 xmlbeans-2.3.0.jar
 
 
 
 Still same result for me..
 
 
 
 Marc,
 
 i'm on windows, and i copied above jars directly into already extracted
 
 folder webapps/solr/web-in/lib, in addition to what were already there. I
 
 didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that
 
 could be the issue? i think tomcat extract the war and use the folder in
 
 webapps (i didn;t put the war file in webapps, instead had put extracted
 
 solr folder directly)
 
 
 
 If it has worked for you guys, specially with my two pdfs, then that's
 
 really great. Please let me know your exact procedure, including what all
 
 you copied and where, or if you see i missed something obvious..
 
 
 
 Thanks,
 
 Praveen
 
 
 
 
 
 On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote:
 
 
 
  Both the files work for me, Praveen.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
  Sent: Tuesday, May 04, 2010 5:22 PM
 
  To: solr-user@lucene.apache.org
 
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  another one here..
 
  On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:
 
  pkal...@gmail.com wrote:
 
  It bounced because of attachment's size..
 
  attaching one by one now..
 
 
 
 
 
  On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:
 
  pkal...@gmail.com wrote:
 
  I noticed following pattern/relationship b/w producer/creator and content
 
  extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
 
 
 
  producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com /
 
  Freeware Edition (not registered)
 
  Creator: PScript5.dll Version 5.2.2
 
  Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 
  generated)
 
  -
 
 
 
  Producer: Acrobat Distiller 7.0.5 (Windows)
 
  creator: PScript5.dll Version 5.2.2
 
  Extraction: One line content
 
  --
 
 
 
  Producer: Acrobat Distiller 8.1.0 (Windows)
 
  creator: Acrobat PDFMaker 8.1 for Word
 
  Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -
 
  attached - was available freely on their website)
 
  -
 
 
 
  Producer: FOP 0.20.5
 
  Extraction: full content/docs/features.pdf | linkmap.pdf etc
 
  --
 
  Thanks.
 
  Praveen
 
 
 
 
 
  On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:
 
  pkal...@gmail.com wrote:
 
  Yes Sandhya,
 
  i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 
  what you were asking.
 
  Thanks.
 
 
 
 
 
  On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com

Re: Problem with pdf, upgrading Cell

2010-05-05 Thread Praveen Agrawal

Marc  Sandhya,
Did you use Solr from trunk?
I used Solr 1.4 distn, and even after copying all the jars, i still get the
same results for the pdfs i posted here.
Thanks.

On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I have the same list, and i added to it the extraction library (apache solr
 cell jar), though you might not need it specifically inside the war file.
 Marc
  From: sagar...@opentext.com
  To: solr-user@lucene.apache.org
  Date: Wed, 5 May 2010 10:21:36 +0530
  Subject: RE: Problem with pdf, upgrading Cell
 
  Looks like the highlighting may not work here. Following is the list of
 jars I copied :
 
  asm-3.1.jar
  bcmail-jdk15-1.45.jar
  bcprov-jdk15-1.45.jar
  commons-compress-1.0.jar
  commons-logging-1.1.1.jar
  dom4j-1.6.1.jar
  fontbox-1.1.0.jar
  geronimo-stax-api_1.0_spec-1.0.1.jar
  jempbox-1.1.0.jar
  log4j-1.2.14.jar
  metadata-extractor-2.4.0-beta-1.jar
  pdfbox-1.1.0.jar
  poi-3.6.jar
  poi-ooxml-3.6.jar
  poi-ooxml-schemas-3.6.jar
  poi-scratchpad-3.6.jar
  tagsoup-1.2.jar
  tika-core-0.7.jar
  tika-parsers-0.7.jar
  xml-apis-1.0.b2.jar
  xmlbeans-2.3.0.jar
 
  Thanks,
  Sandhya
 
 
 
  -Original Message-
  From: Sandhya Agarwal [mailto:sagar...@opentext.com]
  Sent: Wednesday, May 05, 2010 10:06 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
 
  Praveen,
 
 
 
  I only have the highlighted jars copied. Not sure, if we need the other
 jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
  Sent: Tuesday, May 04, 2010 8:10 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Hi Sandhya..
 
  I must be missing something. I copied all dependencies jars to both
 
  contrib/extraction/lib and web-in/lib folders. Here is the list of jars
 
  copied:
 
 
 
  asm-3.1.jar
 
  bcmail-jdk15-1.45.jar
 
  bcprov-jdk15-1.45.jar
 
  commons-compress-1.0.jar
 
  commons-logging-1.1.1.jar
 
  dom4j-1.6.1.jar
 
  fontbox-1.1.0.jar
 
  geronimo-stax-api_1.0_spec-1.0.1.jar
 
  hamcrest-core-1.1.jar
 
  jempbox-1.1.0.jar
 
  junit-3.8.1.jar
 
  log4j-1.2.14.jar
 
  metadata-extractor-2.4.0-beta-1.jar
 
  mockito-core-1.7.jar
 
  nekohtml-1.9.9.jar
 
  objenesis-1.0.jar
 
  ooxml-schemas-1.0.jar
 
  pdfbox-1.1.0.jar
 
  poi-3.6.jar
 
  poi-ooxml-3.6.jar
 
  poi-ooxml-schemas-3.6.jar
 
  poi-scratchpad-3.6.jar
 
  tagsoup-1.2.jar
 
  tika-core-0.7.jar
 
  tika-parsers-0.7.jar
 
  xml-apis-1.0.b2.jar
 
  xmlbeans-2.3.0.jar
 
 
 
  Still same result for me..
 
 
 
  Marc,
 
  i'm on windows, and i copied above jars directly into already extracted
 
  folder webapps/solr/web-in/lib, in addition to what were already there. I
 
  didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think
 that
 
  could be the issue? i think tomcat extract the war and use the folder in
 
  webapps (i didn;t put the war file in webapps, instead had put extracted
 
  solr folder directly)
 
 
 
  If it has worked for you guys, specially with my two pdfs, then that's
 
  really great. Please let me know your exact procedure, including what all
 
  you copied and where, or if you see i missed something obvious..
 
 
 
  Thanks,
 
  Praveen
 
 
 
 
 
  On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com
 wrote:
 
 
 
   Both the files work for me, Praveen.
 
  
 
   Thanks,
 
   Sandhya
 
  
 
   From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
   Sent: Tuesday, May 04, 2010 5:22 PM
 
   To: solr-user@lucene.apache.org
 
   Subject: Re: Problem with pdf, upgrading Cell
 
  
 
   another one here..
 
   On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   It bounced because of attachment's size..
 
   attaching one by one now..
 
  
 
  
 
   On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   I noticed following pattern/relationship b/w producer/creator and
 content
 
   extraction, not sure if helpful (as Grant told earlier pdfs are
 notorious):
 
  
 
   producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com
 /
 
   Freeware Edition (not registered)
 
   Creator: PScript5.dll Version 5.2.2
 
   Extraction: no content  --  installing Solr in Tomcat.pdf (attached -
 i
 
   generated)
 
   -
 
  
 
   Producer: Acrobat Distiller 7.0.5 (Windows)
 
   creator: PScript5.dll Version 5.2.2
 
   Extraction: One line content
 
   --
 
  
 
   Producer: Acrobat Distiller 8.1.0 (Windows)
 
   creator: Acrobat PDFMaker 8.1 for Word
 
   Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -
 
   attached - was available freely on their website)
 
   -
 
  
 
   Producer: FOP 0.20.5
 
   Extraction: full content/docs/features.pdf | linkmap.pdf etc

RE: Problem with pdf, upgrading Cell

2010-05-05 Thread Marc Ghorayeb


Praveen,
I am indeed using a trunk version from last week's svn i think. You could 
always try a version from the hudson builds. I did not try this procedure with 
Solr's 1.4 release though.

Marc  
_
Consultez vos emails Orange, Gmail, Yahoo!, Free ... directement depuis HOTMAIL 
!
http://www.windowslive.fr/hotmail/agregation/

RE: Problem with pdf, upgrading Cell

2010-05-05 Thread Sandhya Agarwal

Praveen,

I got the solr 1.4 release from here, 
http://download.filehat.com/apache/lucene/solr/1.4.0/

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: Wednesday, May 05, 2010 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

Marc  Sandhya,
Did you use Solr from trunk?
I used Solr 1.4 distn, and even after copying all the jars, i still get the
same results for the pdfs i posted here.
Thanks.

On Wed, May 5, 2010 at 1:09 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I have the same list, and i added to it the extraction library (apache solr
 cell jar), though you might not need it specifically inside the war file.
 Marc
  From: sagar...@opentext.com
  To: solr-user@lucene.apache.org
  Date: Wed, 5 May 2010 10:21:36 +0530
  Subject: RE: Problem with pdf, upgrading Cell
 
  Looks like the highlighting may not work here. Following is the list of
 jars I copied :
 
  asm-3.1.jar
  bcmail-jdk15-1.45.jar
  bcprov-jdk15-1.45.jar
  commons-compress-1.0.jar
  commons-logging-1.1.1.jar
  dom4j-1.6.1.jar
  fontbox-1.1.0.jar
  geronimo-stax-api_1.0_spec-1.0.1.jar
  jempbox-1.1.0.jar
  log4j-1.2.14.jar
  metadata-extractor-2.4.0-beta-1.jar
  pdfbox-1.1.0.jar
  poi-3.6.jar
  poi-ooxml-3.6.jar
  poi-ooxml-schemas-3.6.jar
  poi-scratchpad-3.6.jar
  tagsoup-1.2.jar
  tika-core-0.7.jar
  tika-parsers-0.7.jar
  xml-apis-1.0.b2.jar
  xmlbeans-2.3.0.jar
 
  Thanks,
  Sandhya
 
 
 
  -Original Message-
  From: Sandhya Agarwal [mailto:sagar...@opentext.com]
  Sent: Wednesday, May 05, 2010 10:06 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
 
  Praveen,
 
 
 
  I only have the highlighted jars copied. Not sure, if we need the other
 jars. Also, I copied the jars directly into solr\WEB-INF\lib, like you did.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
  Sent: Tuesday, May 04, 2010 8:10 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Hi Sandhya..
 
  I must be missing something. I copied all dependencies jars to both
 
  contrib/extraction/lib and web-in/lib folders. Here is the list of jars
 
  copied:
 
 
 
  asm-3.1.jar
 
  bcmail-jdk15-1.45.jar
 
  bcprov-jdk15-1.45.jar
 
  commons-compress-1.0.jar
 
  commons-logging-1.1.1.jar
 
  dom4j-1.6.1.jar
 
  fontbox-1.1.0.jar
 
  geronimo-stax-api_1.0_spec-1.0.1.jar
 
  hamcrest-core-1.1.jar
 
  jempbox-1.1.0.jar
 
  junit-3.8.1.jar
 
  log4j-1.2.14.jar
 
  metadata-extractor-2.4.0-beta-1.jar
 
  mockito-core-1.7.jar
 
  nekohtml-1.9.9.jar
 
  objenesis-1.0.jar
 
  ooxml-schemas-1.0.jar
 
  pdfbox-1.1.0.jar
 
  poi-3.6.jar
 
  poi-ooxml-3.6.jar
 
  poi-ooxml-schemas-3.6.jar
 
  poi-scratchpad-3.6.jar
 
  tagsoup-1.2.jar
 
  tika-core-0.7.jar
 
  tika-parsers-0.7.jar
 
  xml-apis-1.0.b2.jar
 
  xmlbeans-2.3.0.jar
 
 
 
  Still same result for me..
 
 
 
  Marc,
 
  i'm on windows, and i copied above jars directly into already extracted
 
  folder webapps/solr/web-in/lib, in addition to what were already there. I
 
  didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think
 that
 
  could be the issue? i think tomcat extract the war and use the folder in
 
  webapps (i didn;t put the war file in webapps, instead had put extracted
 
  solr folder directly)
 
 
 
  If it has worked for you guys, specially with my two pdfs, then that's
 
  really great. Please let me know your exact procedure, including what all
 
  you copied and where, or if you see i missed something obvious..
 
 
 
  Thanks,
 
  Praveen
 
 
 
 
 
  On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.com
 wrote:
 
 
 
   Both the files work for me, Praveen.
 
  
 
   Thanks,
 
   Sandhya
 
  
 
   From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
   Sent: Tuesday, May 04, 2010 5:22 PM
 
   To: solr-user@lucene.apache.org
 
   Subject: Re: Problem with pdf, upgrading Cell
 
  
 
   another one here..
 
   On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   It bounced because of attachment's size..
 
   attaching one by one now..
 
  
 
  
 
   On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com
 mailto:
 
   pkal...@gmail.com wrote:
 
   I noticed following pattern/relationship b/w producer/creator and
 content
 
   extraction, not sure if helpful (as Grant told earlier pdfs are
 notorious):
 
  
 
   producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com
 /
 
   Freeware Edition (not registered)
 
   Creator: PScript5.dll Version 5.2.2
 
   Extraction: no content  --  installing Solr in Tomcat.pdf (attached -
 i
 
   generated)
 
   -
 
  
 
   Producer: Acrobat Distiller 7.0.5 (Windows)
 
   creator: PScript5.dll Version 5.2.2
 
   Extraction: One line content
 
   --
 
  
 
   Producer: Acrobat Distiller 8.1.0

Re: Problem with pdf, upgrading Cell

2010-05-05 Thread Praveen Agrawal

' to
 classloader
 
  May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar'
 to classloader
 
  May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to
 classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
 to classloader
 
  May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to
 classloader
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
  From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant
 Ingersoll
  Sent: Tuesday, May 04, 2010 6:13 AM
  Cc: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Little more info... Seems to be a classloading issue.  The tests pass,
 but they aren't loading the Tika libraries via the Solr ResourceLoader,
 whereas the example is.  Marc, one thing to try is to unjar the Solr WAR
 file and put the Tika libs in there, as I bet it will then work.  Note,
 however, I haven't tried this.
 
 
 
  On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:
 
 
 
  I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track
 this.  It is indeed a bug somewhere (still investigating).  It seems that
 Tika is now picking an EmptyParser implementation when trying to determine
 which parser to use, despite the fact that it properly identifies the MIME
 Type.
 
 
 
  -Grant
 
 
 
  On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:
 
 
 
  I'm investigating.
 
 
 
  On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
 
 
 
 
  Hi,
 
  Grant, i confirm what Praveen has said, any PDF i try does not work
 with the new Tika and SVN versions. :(
 
  Marc
 
 
 
  From: sagar...@opentext.com
 
  To: solr-user@lucene.apache.org
 
  Date: Mon, 3 May 2010 13:05:24 +0530
 
  Subject: RE: Problem with pdf, upgrading Cell
 
 
 
  Hello,
 
 
 
  Please let me know if anybody figured out a way out of this issue.
 
 
 
  Thanks,
 
  Sandhya
 
 
 
  -Original Message-
 
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
  Sent: Friday, April 30, 2010 11:14 PM
 
  To: solr-user@lucene.apache.org
 
  Subject: Re: Problem with pdf, upgrading Cell
 
 
 
  Grant,
 
  You can try any of the sample pdfs that come in /docs folder of Solr
 1.4
 
  dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc.
 Only
 
  metadata i.e. stream_size, content_type apart from my own literals
 are
 
  indexed, and content is missing..
 
 
 
 
 
  On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll 
 gsing...@apache.orgwrote:
 
 
 
  Praveen and Marc,
 
 
 
  Can you share the PDF (feel free to email my private email) that
 fails in
 
  Solr?
 
 
 
  Thanks,
 
  Grant
 
 
 
 
 
  On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 
 
 
  Hi
 
  Nope i didn't get it to work... Just like you, command line version
 of
 
  tika extracts correctly the content, but once included in Solr, no
 content
 
  is extracted.
 
  What i tried until now is:- Updating the tika libraries inside Solr
 1.4
 
  public version, no luck there.- Downloading the latest SVN version,
 compiled
 
  it, and started from a simple schema, still no luck.- Getting other
 versions
 
  compiled on hudson (nightly builds), and testing them also, still no
 
  extraction.
 
  I sent a mail on the developpers mailing list but they told me i
 should
 
  just mail here, hope some developper reads this because it's quite
 an
 
  important feature of Solr and somehow it got broke between the 1.4
 release,
 
  and the last version on the svn.
 
  Marc
 
  _
 
  Consultez

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

' to 
classloader

May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to 
classloader

May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
 to classloader

May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
 to classloader

May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
 to classloader

May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to 
classloader



Thanks,

Sandhya



-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, May 04, 2010 6:13 AM
Cc: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Little more info... Seems to be a classloading issue.  The tests pass, but they 
aren't loading the Tika libraries via the Solr ResourceLoader, whereas the 
example is.  Marc, one thing to try is to unjar the Solr WAR file and put the 
Tika libs in there, as I bet it will then work.  Note, however, I haven't tried 
this.



On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:



 I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this.  
 It is indeed a bug somewhere (still investigating).  It seems that Tika is 
 now picking an EmptyParser implementation when trying to determine which 
 parser to use, despite the fact that it properly identifies the MIME Type.



 -Grant



 On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:



 I'm investigating.



 On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:





 Hi,

 Grant, i confirm what Praveen has said, any PDF i try does not work with 
 the new Tika and SVN versions. :(

 Marc



 From: sagar...@opentext.com

 To: solr-user@lucene.apache.org

 Date: Mon, 3 May 2010 13:05:24 +0530

 Subject: RE: Problem with pdf, upgrading Cell



 Hello,



 Please let me know if anybody figured out a way out of this issue.



 Thanks,

 Sandhya



 -Original Message-

 From: Praveen Agrawal [mailto:pkal...@gmail.com]

 Sent: Friday, April 30, 2010 11:14 PM

 To: solr-user@lucene.apache.org

 Subject: Re: Problem with pdf, upgrading Cell



 Grant,

 You can try any of the sample pdfs that come in /docs folder of Solr 1.4

 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only

 metadata i.e. stream_size, content_type apart from my own literals are

 indexed, and content is missing..





 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll 
 gsing...@apache.orgwrote:



 Praveen and Marc,



 Can you share the PDF (feel free to email my private email) that fails in

 Solr?



 Thanks,

 Grant





 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:





 Hi

 Nope i didn't get it to work... Just like you, command line version of

 tika extracts correctly the content, but once included in Solr, no content

 is extracted.

 What i tried until now is:- Updating the tika libraries inside Solr 1.4

 public version, no luck there.- Downloading the latest SVN version, 
 compiled

 it, and started from a simple schema, still no luck.- Getting other 
 versions

 compiled on hudson (nightly builds), and testing them also, still no

 extraction.

 I sent a mail on the developpers mailing list but they told me i should

 just mail here, hope some developper reads this because it's quite an

 important feature of Solr and somehow it got broke between the 1.4 
 release,

 and the last version on the svn.

 Marc

 _

 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement

 dans HOTMAIL !

 http://www.windowslive.fr/hotmail/agregation/



 --

 Grant Ingersoll

 http://www.lucidimagination.com/



 Search the Lucene ecosystem using Solr/Lucene:

 http://www.lucidimagination.com/search







 _

 Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur 
 votre téléphone!

 http://www.messengersurvotremobile.com/?d=Hotmail



 --

 Grant Ingersoll

 http://www.lucidimagination.com/



 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search





 --

 Grant Ingersoll

 http://www.lucidimagination.com/



 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search





--

Grant Ingersoll

http://www.lucidimagination.com/



Search the Lucene

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved 
the issue and the content extraction works fine now.

Thanks,
Sandhya

-Original Message-
From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
Sent: Tuesday, May 04, 2010 12:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Problem with pdf, upgrading Cell

Hello,



But I see that the libraries are being loaded :



INFO: Adding specified lib dirs to ClassLoader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' 
to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
 to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
 to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' 
to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to 
classloader

May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to 
classloader

May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' 
to classloader

May 4, 2010 12:51:52 PM

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Marc Ghorayeb


Sandhya,
How did you proceed?I did this:- jar -xf solr.war.- i then added all of the 
libs i had into the web-inf/lib folder- i then recreated the jar with jar -cvf 
solr.war *- replaced the war files- deleted the libs in the shared lib folder- 
started tomcat
i'm now getting an error saying this:
SEVERE: org.apache.solr.common.SolrException: Error loading class 
'org.apache.solr.handler.extraction.ExtractingRequestHandler'at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)  
  at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:418)
at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:454)
at 
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152)
Thanks Grant for investigating the problem!
Marc

 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Tue, 4 May 2010 13:10:25 +0530
 Subject: RE: Problem with pdf, upgrading Cell
 
 Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved 
 the issue and the content extraction works fine now.
 
 Thanks,
 Sandhya
 
 -Original Message-
 From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
 Sent: Tuesday, May 04, 2010 12:58 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Problem with pdf, upgrading Cell
 
 Hello,
 
 
 
 But I see that the libraries are being loaded :
 
 
 
 INFO: Adding specified lib dirs to ClassLoader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' 
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' 
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' 
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
  to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
  to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' 
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' 
 to classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to 
 classloader
 
 May 4, 2010 12:49:59 PM

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

I think this is most likely because tika-core-0.7.jar, no longer has the 
tika-config.xml. Die, to which we have the default tika config being loaded. 
This can be seen in ExtractingRequestHandler.inform() method. Hence, the 
parsers list is empty. I am still investigating.

Thanks,
Sandhya

-Original Message-
From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
Sent: Tuesday, May 04, 2010 1:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Problem with pdf, upgrading Cell

Yes, Grant. You are right. Copying the tika libraries to solr webapp, solved 
the issue and the content extraction works fine now.

Thanks,
Sandhya

-Original Message-
From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
Sent: Tuesday, May 04, 2010 12:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Problem with pdf, upgrading Cell

Hello,



But I see that the libraries are being loaded :



INFO: Adding specified lib dirs to ClassLoader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' 
to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
 to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
 to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' 
to classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

INFO: Adding 
'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar' to 
classloader

May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
replaceClassLoader

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

May be as Sandhya indicated, it was loading libs earlier, so it might be
trying to load from contrib when you have deleted from there, but somehow
not been 'seen' by Solr or something.

May be to keep them there, as well put them in solr/lib in tomcat webapps..

I'm yet to try though..


On Tue, May 4, 2010 at 2:16 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Sandhya,
 How did you proceed?I did this:- jar -xf solr.war.- i then added all of the
 libs i had into the web-inf/lib folder- i then recreated the jar with jar
 -cvf solr.war *- replaced the war files- deleted the libs in the shared lib
 folder- started tomcat
 i'm now getting an error saying this:
 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'org.apache.solr.handler.extraction.ExtractingRequestHandler'at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:418)
  at
 org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:454)
  at
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152)
 Thanks Grant for investigating the problem!
 Marc

  From: sagar...@opentext.com
  To: solr-user@lucene.apache.org
  Date: Tue, 4 May 2010 13:10:25 +0530
  Subject: RE: Problem with pdf, upgrading Cell
 
  Yes, Grant. You are right. Copying the tika libraries to solr webapp,
 solved the issue and the content extraction works fine now.
 
  Thanks,
  Sandhya
 
  -Original Message-
  From: Sandhya Agarwal [mailto:sagar...@opentext.com]
  Sent: Tuesday, May 04, 2010 12:58 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
 
  Hello,
 
 
 
  But I see that the libraries are being loaded :
 
 
 
  INFO: Adding specified lib dirs to ClassLoader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to
 classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 
  INFO: Adding
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar'
 to classloader
 
  May 4, 2010 12:49:59 PM

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Grant Ingersoll

 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar' to 
 classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar' to 
 classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to 
 classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
  to classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
  to classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
  to classloader
 
 May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 
 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to 
 classloader
 
 
 
 Thanks,
 
 Sandhya
 
 
 
 -Original Message-
 From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
 Sent: Tuesday, May 04, 2010 6:13 AM
 Cc: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 
 
 Little more info... Seems to be a classloading issue.  The tests pass, but 
 they aren't loading the Tika libraries via the Solr ResourceLoader, whereas 
 the example is.  Marc, one thing to try is to unjar the Solr WAR file and put 
 the Tika libs in there, as I bet it will then work.  Note, however, I haven't 
 tried this.
 
 
 
 On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:
 
 
 
 I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this.  
 It is indeed a bug somewhere (still investigating).  It seems that Tika is 
 now picking an EmptyParser implementation when trying to determine which 
 parser to use, despite the fact that it properly identifies the MIME Type.
 
 
 
 -Grant
 
 
 
 On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:
 
 
 
 I'm investigating.
 
 
 
 On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
 
 
 
 
 Hi,
 
 Grant, i confirm what Praveen has said, any PDF i try does not work with 
 the new Tika and SVN versions. :(
 
 Marc
 
 
 
 From: sagar...@opentext.com
 
 To: solr-user@lucene.apache.org
 
 Date: Mon, 3 May 2010 13:05:24 +0530
 
 Subject: RE: Problem with pdf, upgrading Cell
 
 
 
 Hello,
 
 
 
 Please let me know if anybody figured out a way out of this issue.
 
 
 
 Thanks,
 
 Sandhya
 
 
 
 -Original Message-
 
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 
 Sent: Friday, April 30, 2010 11:14 PM
 
 To: solr-user@lucene.apache.org
 
 Subject: Re: Problem with pdf, upgrading Cell
 
 
 
 Grant,
 
 You can try any of the sample pdfs that come in /docs folder of Solr 1.4
 
 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
 
 metadata i.e. stream_size, content_type apart from my own literals are
 
 indexed, and content is missing..
 
 
 
 
 
 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll 
 gsing...@apache.orgwrote:
 
 
 
 Praveen and Marc,
 
 
 
 Can you share the PDF (feel free to email my private email) that fails in
 
 Solr?
 
 
 
 Thanks,
 
 Grant
 
 
 
 
 
 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 
 
 
 Hi
 
 Nope i didn't get it to work... Just like you, command line version of
 
 tika extracts correctly the content, but once included in Solr, no 
 content
 
 is extracted.
 
 What i tried until now is:- Updating the tika libraries inside Solr 1.4
 
 public version, no luck there.- Downloading the latest SVN version, 
 compiled
 
 it, and started from a simple schema, still no luck.- Getting other 
 versions
 
 compiled on hudson (nightly builds), and testing them also, still no
 
 extraction.
 
 I sent a mail on the developpers mailing list but they told me i should
 
 just mail here, hope some developper reads this because it's quite an
 
 important feature of Solr and somehow it got broke between the 1.4 
 release,
 
 and the last version on the svn.
 
 Marc
 
 _
 
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 
 dans HOTMAIL !
 
 http://www.windowslive.fr/hotmail/agregation/
 
 
 
 --
 
 Grant Ingersoll
 
 http://www.lucidimagination.com/
 
 
 
 Search the Lucene ecosystem using Solr/Lucene:
 
 http://www.lucidimagination.com/search
 
 
 
 
 
 
 
 _
 
 Hotmail et MSN dans la poche? HOTMAIL et MSN

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Ok. In tika 0.4 and 0.5, I see that this is how the tika config is loaded :



public static TikaConfig getDefaultConfig()

  {

InputStream stream;

try

{

  stream = 
TikaConfig.class.getResourceAsStream(/org/apache/tika/tika-config.xml);



  return new TikaConfig(stream);

} catch (IOException e) {

  throw new RuntimeException(Unable to read default configuration, e);

}

catch (SAXException e) {

  throw new RuntimeException(Unable to parse default configuration, e);

}

catch (TikaException e) {

  throw new RuntimeException(Unable to access default configuration, e);

}

  }



And this has changed in tika 0.7, to



public TikaConfig()

throws MimeTypeException, IOException

  {

this.parsers = new HashMap();



ParseContext context = new ParseContext();

Iterator iterator = ServiceRegistry.lookupProviders(Parser.class);



while (iterator.hasNext()) {

  Parser parser = (Parser)iterator.next();

  for (Iterator i$ = parser.getSupportedTypes(context).iterator(); 
i$.hasNext(); ) { MediaType type = (MediaType)i$.next();

this.parsers.put(type.toString(), parser);

  }

}

this.mimeTypes = MimeTypesFactory.create(tika-mimetypes.xml);

  }



Hence, the reason why we no longer have tika-config.xml, bundled.



Thanks,

Sandhya



-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, May 04, 2010 4:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Yes, it is loading the libraries, but they are in a different classloader that 
apparently the new way Tika loads doesn't have access to.



-Grant



On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote:



 Hello,







 But I see that the libraries are being loaded :







 INFO: Adding specified lib dirs to ClassLoader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' 
 to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar' 
 to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar' 
 to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
  to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
  to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' 
 to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to 
 classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar' 
 to classloader



 May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader



 INFO: Adding 
 'file:/C

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Marc Ghorayeb


Hey,
I got it to work. I just redid my steps, i had forgotten several libraries that 
were imported through the xml. PDF extraction seems to work once again, i have 
yet to find one that raises an exception!

Thanks for the investigation, at least we now have a fix :)
Marc  
_
Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
Blackberry, …
http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

I seems to have mixed results:

Here is what i did:
copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
contrib/extraction/lib (of-course removed old ones),. as well as in
web-inf/lib of solr web app in tomcat.

Now it extracts contents from some pdf, but either no content from others,
or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
still shows no contents. I've two other pdfs, for which it extracts only one
line of content.

Also, now i;m getting a field 'title' single value for some pdfs, and two
for others. In case where it can extract full content, it shows title as
what i gave as literal while submitting the pdf. For pdf wher no comtent was
extracted, it shows one empty title and one mine. For pdf where it extracted
only one line of content, it shows that line as title too and mine one.
'title' field is defined as multivalue in schema.

Any idea, whats going on? or am i missing something?



On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I got it to work. I just redid my steps, i had forgotten several libraries
 that were imported through the xml. PDF extraction seems to work once again,
 i have yet to find one that raises an exception!

 Thanks for the investigation, at least we now have a fix :)
 Marc
 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Praveen,

Along with the tika core and parser jars, did you run mvn 
dependency:copy-dependencies, to generate all the dependencies too.

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: Tuesday, May 04, 2010 4:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

I seems to have mixed results:

Here is what i did:
copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
contrib/extraction/lib (of-course removed old ones),. as well as in
web-inf/lib of solr web app in tomcat.

Now it extracts contents from some pdf, but either no content from others,
or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
still shows no contents. I've two other pdfs, for which it extracts only one
line of content.

Also, now i;m getting a field 'title' single value for some pdfs, and two
for others. In case where it can extract full content, it shows title as
what i gave as literal while submitting the pdf. For pdf wher no comtent was
extracted, it shows one empty title and one mine. For pdf where it extracted
only one line of content, it shows that line as title too and mine one.
'title' field is defined as multivalue in schema.

Any idea, whats going on? or am i missing something?



On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com wrote:


 Hey,
 I got it to work. I just redid my steps, i had forgotten several libraries
 that were imported through the xml. PDF extraction seems to work once again,
 i have yet to find one that raises an exception!

 Thanks for the investigation, at least we now have a fix :)
 Marc
 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

Yes Sandhya,
i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what
you were asking.
Thanks.


On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Ok. So, I am assuming you copied all the dependencies from 
tika-app\target\dependency ? I tried with a number of files and don't see this 
issue yet.

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: Tuesday, May 04, 2010 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

Yes Sandhya,
i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what
you were asking.
Thanks.


On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

This email contained a .zip file attachment. Raytheon does not allow email 
attachments that are considered likely to contain malicious code. For your 
protection this attachment has been removed.

If this email is from an unknown source, please simply delete this email.

If this email was expected, and it is from a known sender, you may follow the 
below suggested instructions to obtain these types of attachments.

+ Instruct the sender to enclose the file(s) in a .zip compressed file, and 
rename the .zip compressed file with a different extension, such as, 
.rtnzip.  Password protecting the renamed .zip compressed file adds an 
additional layer of protection. When you receive the file, please rename it 
with the extension .zip.

Additional instructions and options on how to receive these attachments can be 
found at:

http://security.it.ray.com/antivirus/extensions.html
http://security.it.ray.com/news/2007/zipfiles.html

Should you have any questions or difficulty with these instructions, please 
contact the Help Desk at 877.844.4712

---

It bounced because of attachment's size..
attaching one by one now..


On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
 registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - attached
 - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen



 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.com wrote:

 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.



 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from
 others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Both the files work for me, Praveen.

Thanks,
Sandhya

From: Praveen Agrawal [mailto:pkal...@gmail.com]
Sent: Tuesday, May 04, 2010 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

another one here..
On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal 
pkal...@gmail.commailto:pkal...@gmail.com wrote:
It bounced because of attachment's size..
attaching one by one now..

On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal 
pkal...@gmail.commailto:pkal...@gmail.com wrote:
I noticed following pattern/relationship b/w producer/creator and content 
extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com / 
Freeware Edition (not registered)
Creator: PScript5.dll Version 5.2.2
Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i 
generated)
-

Producer: Acrobat Distiller 7.0.5 (Windows)
creator: PScript5.dll Version 5.2.2
Extraction: One line content
--

Producer: Acrobat Distiller 8.1.0 (Windows)
creator: Acrobat PDFMaker 8.1 for Word
Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - attached - 
was available freely on their website)
-

Producer: FOP 0.20.5
Extraction: full content/docs/features.pdf | linkmap.pdf etc
--
Thanks.
Praveen

On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal 
pkal...@gmail.commailto:pkal...@gmail.com wrote:
Yes Sandhya,
i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is what 
you were asking.
Thanks.

On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal 
sagar...@opentext.commailto:sagar...@opentext.com wrote:
Praveen,

Along with the tika core and parser jars, did you run mvn 
dependency:copy-dependencies, to generate all the dependencies too.

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com]
Sent: Tuesday, May 04, 2010 4:52 PM
To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell
I seems to have mixed results:

Here is what i did:
copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
contrib/extraction/lib (of-course removed old ones),. as well as in
web-inf/lib of solr web app in tomcat.

Now it extracts contents from some pdf, but either no content from others,
or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
still shows no contents. I've two other pdfs, for which it extracts only one
line of content.

Also, now i;m getting a field 'title' single value for some pdfs, and two
for others. In case where it can extract full content, it shows title as
what i gave as literal while submitting the pdf. For pdf wher no comtent was
extracted, it shows one empty title and one mine. For pdf where it extracted
only one line of content, it shows that line as title too and mine one.
'title' field is defined as multivalue in schema.

Any idea, whats going on? or am i missing something?

On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb 
dekay...@hotmail.commailto:dekay...@hotmail.com wrote:

 Hey,
 I got it to work. I just redid my steps, i had forgotten several libraries
 that were imported through the xml. PDF extraction seems to work once again,
 i have yet to find one that raises an exception!

 Thanks for the investigation, at least we now have a fix :)
 Marc
 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

This email contained a .zip file attachment. Raytheon does not allow email 
attachments that are considered likely to contain malicious code. For your 
protection this attachment has been removed.

If this email is from an unknown source, please simply delete this email.

If this email was expected, and it is from a known sender, you may follow the 
below suggested instructions to obtain these types of attachments.

+ Instruct the sender to enclose the file(s) in a .zip compressed file, and 
rename the .zip compressed file with a different extension, such as, 
.rtnzip.  Password protecting the renamed .zip compressed file adds an 
additional layer of protection. When you receive the file, please rename it 
with the extension .zip.

Additional instructions and options on how to receive these attachments can be 
found at:

http://security.it.ray.com/antivirus/extensions.html
http://security.it.ray.com/news/2007/zipfiles.html

Should you have any questions or difficulty with these instructions, please 
contact the Help Desk at 877.844.4712

---

another one here..

On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote:

 It bounced because of attachment's size..
 attaching one by one now..



 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
 registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - attached
 - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen



 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote:

 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.



 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal 
 sagar...@opentext.comwrote:

 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from
 others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and
 two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
  Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Marc Ghorayeb


Praveen,
Did you try the technique I wrote a little earlier?Take your solr.war, put it 
in a directory of its own. Execute jar -xf solr.war, that should extract its 
content. Next, copy all of your libraries inside the WEB-INF/lib folder. This 
means all the extraction/lib files, and the lib files from the Solr's roots. 
Once this is done, we now recreate the solr.war by doing jar -cvf solr.war * 
(the * meaning all the files inside the current directory, so be sure to be 
inside the root directory where you extracted the war previously).
Once this is done, put the new solr.war inside the tomcat webapps folder, and 
recreate from scratch the solr folder (so as not to leave any overlapping 
libraries). This should work hopefully.
For the multivalued fields (title for example), this is a know feature/issue of 
Tika's integration. In my case, I always provide a literal.title along with my 
pdfs, but if Tika successfully extracts a title from the Pdf's meta, then it 
will create the Solr index entry with an array of the inputted literal, and the 
extracted value. There is no way to force an override of the extracted data 
with the literals, they just get appended. Someone correct me if i am wrong 
here :)
Marc

 Date: Tue, 4 May 2010 11:58:56 +
 From: pkal...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 This email contained a .zip file attachment. Raytheon does not allow email 
 attachments that are considered likely to contain malicious code. For your 
 protection this attachment has been removed.
 
 If this email is from an unknown source, please simply delete this email.
 
 If this email was expected, and it is from a known sender, you may follow the 
 below suggested instructions to obtain these types of attachments.
 
 + Instruct the sender to enclose the file(s) in a .zip compressed file, and 
 rename the .zip compressed file with a different extension, such as, 
 .rtnzip.  Password protecting the renamed .zip compressed file adds an 
 additional layer of protection. When you receive the file, please rename it 
 with the extension .zip.
 
 Additional instructions and options on how to receive these attachments can 
 be found at:
 
 http://security.it.ray.com/antivirus/extensions.html
 http://security.it.ray.com/news/2007/zipfiles.html
 
 Should you have any questions or difficulty with these instructions, please 
 contact the Help Desk at 877.844.4712
 
 ---
 
 another one here..
 
 On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.com wrote:
 
  It bounced because of attachment's size..
  attaching one by one now..
 
 
 
  On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.com wrote:
 
  I noticed following pattern/relationship b/w producer/creator and content
  extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
 
  producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
  registered)
  Creator: PScript5.dll Version 5.2.2
  Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
  generated)
  -
 
  Producer: Acrobat Distiller 7.0.5 (Windows)
  creator: PScript5.dll Version 5.2.2
  Extraction: One line content
  --
 
  Producer: Acrobat Distiller 8.1.0 (Windows)
  creator: Acrobat PDFMaker 8.1 for Word
  Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf - 
  attached
  - was available freely on their website)
  -
 
  Producer: FOP 0.20.5
  Extraction: full content/docs/features.pdf | linkmap.pdf etc
  --
  Thanks.
  Praveen
 
 
 
  On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.comwrote:
 
  Yes Sandhya,
  i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
  what you were asking.
  Thanks.
 
 
 
  On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal 
  sagar...@opentext.comwrote:
 
  Praveen,
 
  Along with the tika core and parser jars, did you run mvn
  dependency:copy-dependencies, to generate all the dependencies too.
 
  Thanks,
  Sandhya
 
  -Original Message-
  From: Praveen Agrawal [mailto:pkal...@gmail.com]
  Sent: Tuesday, May 04, 2010 4:52 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Problem with pdf, upgrading Cell
 
  I seems to have mixed results:
 
  Here is what i did:
  copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
  contrib/extraction/lib (of-course removed old ones),. as well as in
  web-inf/lib of solr web app in tomcat.
 
  Now it extracts contents from some pdf, but either no content from
  others,
  or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
  still shows no contents. I've two other pdfs, for which it extracts only
  one
  line of content.
 
  Also, now i;m getting a field 'title' single value for some pdfs, and
  two
  for others. In case where it can extract full content, it shows title as
  what i gave as literal while submitting the pdf. For pdf wher no comtent

Re: Problem with pdf, upgrading Cell

2010-05-04 Thread Praveen Agrawal

Hi Sandhya..
I must be missing something. I copied all dependencies jars to both
contrib/extraction/lib and web-in/lib folders. Here is the list of jars
copied:

asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-compress-1.0.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.1.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
jempbox-1.1.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.1.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7.jar
tika-parsers-0.7.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

Still same result for me..

Marc,
i'm on windows, and i copied above jars directly into already extracted
folder webapps/solr/web-in/lib, in addition to what were already there. I
didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that
could be the issue? i think tomcat extract the war and use the folder in
webapps (i didn;t put the war file in webapps, instead had put extracted
solr folder directly)

If it has worked for you guys, specially with my two pdfs, then that's
really great. Please let me know your exact procedure, including what all
you copied and where, or if you see i missed something obvious..

Thanks,
Praveen


On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote:

 Both the files work for me, Praveen.

 Thanks,
 Sandhya

 From: Praveen Agrawal [mailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 5:22 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 another one here..
 On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 It bounced because of attachment's size..
 attaching one by one now..


 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 I noticed following pattern/relationship b/w producer/creator and content
 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

 producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com /
 Freeware Edition (not registered)
 Creator: PScript5.dll Version 5.2.2
 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i
 generated)
 -

 Producer: Acrobat Distiller 7.0.5 (Windows)
 creator: PScript5.dll Version 5.2.2
 Extraction: One line content
 --

 Producer: Acrobat Distiller 8.1.0 (Windows)
 creator: Acrobat PDFMaker 8.1 for Word
 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -
 attached - was available freely on their website)
 -

 Producer: FOP 0.20.5
 Extraction: full content/docs/features.pdf | linkmap.pdf etc
 --
 Thanks.
 Praveen


 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:
 pkal...@gmail.com wrote:
 Yes Sandhya,
 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
 what you were asking.
 Thanks.


 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com
 mailto:sagar...@opentext.com wrote:
 Praveen,

 Along with the tika core and parser jars, did you run mvn
 dependency:copy-dependencies, to generate all the dependencies too.

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com]
 Sent: Tuesday, May 04, 2010 4:52 PM
 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 I seems to have mixed results:

 Here is what i did:
 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
 contrib/extraction/lib (of-course removed old ones),. as well as in
 web-inf/lib of solr web app in tomcat.

 Now it extracts contents from some pdf, but either no content from others,
 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf
 still shows no contents. I've two other pdfs, for which it extracts only
 one
 line of content.

 Also, now i;m getting a field 'title' single value for some pdfs, and two
 for others. In case where it can extract full content, it shows title as
 what i gave as literal while submitting the pdf. For pdf wher no comtent
 was
 extracted, it shows one empty title and one mine. For pdf where it
 extracted
 only one line of content, it shows that line as title too and mine one.
 'title' field is defined as multivalue in schema.

 Any idea, whats going on? or am i missing something?



 On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb dekay...@hotmail.com
 mailto:dekay...@hotmail.com wrote:

 
  Hey,
  I got it to work. I just redid my steps, i had forgotten several
 libraries
  that were imported through the xml. PDF extraction seems to work once
 again,
  i have yet to find one that raises an exception!
 
  Thanks for the investigation, at least we now have a fix :)
  Marc

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Praveen,



I only have the highlighted jars copied. Not sure, if we need the other jars. 
Also, I copied the jars directly into solr\WEB-INF\lib, like you did.



Thanks,

Sandhya



-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com]
Sent: Tuesday, May 04, 2010 8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Hi Sandhya..

I must be missing something. I copied all dependencies jars to both

contrib/extraction/lib and web-in/lib folders. Here is the list of jars

copied:



asm-3.1.jar

bcmail-jdk15-1.45.jar

bcprov-jdk15-1.45.jar

commons-compress-1.0.jar

commons-logging-1.1.1.jar

dom4j-1.6.1.jar

fontbox-1.1.0.jar

geronimo-stax-api_1.0_spec-1.0.1.jar

hamcrest-core-1.1.jar

jempbox-1.1.0.jar

junit-3.8.1.jar

log4j-1.2.14.jar

metadata-extractor-2.4.0-beta-1.jar

mockito-core-1.7.jar

nekohtml-1.9.9.jar

objenesis-1.0.jar

ooxml-schemas-1.0.jar

pdfbox-1.1.0.jar

poi-3.6.jar

poi-ooxml-3.6.jar

poi-ooxml-schemas-3.6.jar

poi-scratchpad-3.6.jar

tagsoup-1.2.jar

tika-core-0.7.jar

tika-parsers-0.7.jar

xml-apis-1.0.b2.jar

xmlbeans-2.3.0.jar



Still same result for me..



Marc,

i'm on windows, and i copied above jars directly into already extracted

folder webapps/solr/web-in/lib, in addition to what were already there. I

didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that

could be the issue? i think tomcat extract the war and use the folder in

webapps (i didn;t put the war file in webapps, instead had put extracted

solr folder directly)



If it has worked for you guys, specially with my two pdfs, then that's

really great. Please let me know your exact procedure, including what all

you copied and where, or if you see i missed something obvious..



Thanks,

Praveen





On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote:



 Both the files work for me, Praveen.



 Thanks,

 Sandhya



 From: Praveen Agrawal [mailto:pkal...@gmail.com]

 Sent: Tuesday, May 04, 2010 5:22 PM

 To: solr-user@lucene.apache.org

 Subject: Re: Problem with pdf, upgrading Cell



 another one here..

 On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 It bounced because of attachment's size..

 attaching one by one now..





 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content

 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):



 producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com /

 Freeware Edition (not registered)

 Creator: PScript5.dll Version 5.2.2

 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i

 generated)

 -



 Producer: Acrobat Distiller 7.0.5 (Windows)

 creator: PScript5.dll Version 5.2.2

 Extraction: One line content

 --



 Producer: Acrobat Distiller 8.1.0 (Windows)

 creator: Acrobat PDFMaker 8.1 for Word

 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -

 attached - was available freely on their website)

 -



 Producer: FOP 0.20.5

 Extraction: full content/docs/features.pdf | linkmap.pdf etc

 --

 Thanks.

 Praveen





 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 Yes Sandhya,

 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is

 what you were asking.

 Thanks.





 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com

 mailto:sagar...@opentext.com wrote:

 Praveen,



 Along with the tika core and parser jars, did you run mvn

 dependency:copy-dependencies, to generate all the dependencies too.



 Thanks,

 Sandhya



 -Original Message-

 From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com]

 Sent: Tuesday, May 04, 2010 4:52 PM

 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org

 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:



 Here is what i did:

 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in

 contrib/extraction/lib (of-course removed old ones),. as well as in

 web-inf/lib of solr web app in tomcat.



 Now it extracts contents from some pdf, but either no content from others,

 or only a line of content. For ex, /docs/Installing Solr in Tomcat.pdf

 still shows no contents. I've two other pdfs, for which it extracts only

 one

 line of content.



 Also, now i;m getting a field 'title' single value for some pdfs, and two

 for others. In case where it can extract full content, it shows title as

 what i gave as literal while submitting the pdf. For pdf wher no comtent

 was

 extracted, it shows one empty title and one mine. For pdf where it

 extracted

 only one line of content, it shows that line as title too and mine one.

 'title' field

RE: Problem with pdf, upgrading Cell

2010-05-04 Thread Sandhya Agarwal

Looks like the highlighting may not work here. Following is the list of jars I 
copied :

asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-compress-1.0.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.1.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
jempbox-1.1.0.jar
log4j-1.2.14.jar
metadata-extractor-2.4.0-beta-1.jar
pdfbox-1.1.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7.jar
tika-parsers-0.7.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

Thanks,
Sandhya



-Original Message-
From: Sandhya Agarwal [mailto:sagar...@opentext.com] 
Sent: Wednesday, May 05, 2010 10:06 AM
To: solr-user@lucene.apache.org
Subject: RE: Problem with pdf, upgrading Cell

Praveen,



I only have the highlighted jars copied. Not sure, if we need the other jars. 
Also, I copied the jars directly into solr\WEB-INF\lib, like you did.



Thanks,

Sandhya



-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com]
Sent: Tuesday, May 04, 2010 8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Hi Sandhya..

I must be missing something. I copied all dependencies jars to both

contrib/extraction/lib and web-in/lib folders. Here is the list of jars

copied:



asm-3.1.jar

bcmail-jdk15-1.45.jar

bcprov-jdk15-1.45.jar

commons-compress-1.0.jar

commons-logging-1.1.1.jar

dom4j-1.6.1.jar

fontbox-1.1.0.jar

geronimo-stax-api_1.0_spec-1.0.1.jar

hamcrest-core-1.1.jar

jempbox-1.1.0.jar

junit-3.8.1.jar

log4j-1.2.14.jar

metadata-extractor-2.4.0-beta-1.jar

mockito-core-1.7.jar

nekohtml-1.9.9.jar

objenesis-1.0.jar

ooxml-schemas-1.0.jar

pdfbox-1.1.0.jar

poi-3.6.jar

poi-ooxml-3.6.jar

poi-ooxml-schemas-3.6.jar

poi-scratchpad-3.6.jar

tagsoup-1.2.jar

tika-core-0.7.jar

tika-parsers-0.7.jar

xml-apis-1.0.b2.jar

xmlbeans-2.3.0.jar



Still same result for me..



Marc,

i'm on windows, and i copied above jars directly into already extracted

folder webapps/solr/web-in/lib, in addition to what were already there. I

didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that

could be the issue? i think tomcat extract the war and use the folder in

webapps (i didn;t put the war file in webapps, instead had put extracted

solr folder directly)



If it has worked for you guys, specially with my two pdfs, then that's

really great. Please let me know your exact procedure, including what all

you copied and where, or if you see i missed something obvious..



Thanks,

Praveen





On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal sagar...@opentext.comwrote:



 Both the files work for me, Praveen.



 Thanks,

 Sandhya



 From: Praveen Agrawal [mailto:pkal...@gmail.com]

 Sent: Tuesday, May 04, 2010 5:22 PM

 To: solr-user@lucene.apache.org

 Subject: Re: Problem with pdf, upgrading Cell



 another one here..

 On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 It bounced because of attachment's size..

 attaching one by one now..





 On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 I noticed following pattern/relationship b/w producer/creator and content

 extraction, not sure if helpful (as Grant told earlier pdfs are notorious):



 producer: Bullzip PDF Printer / www.bullzip.comhttp://www.bullzip.com /

 Freeware Edition (not registered)

 Creator: PScript5.dll Version 5.2.2

 Extraction: no content  --  installing Solr in Tomcat.pdf (attached - i

 generated)

 -



 Producer: Acrobat Distiller 7.0.5 (Windows)

 creator: PScript5.dll Version 5.2.2

 Extraction: One line content

 --



 Producer: Acrobat Distiller 8.1.0 (Windows)

 creator: Acrobat PDFMaker 8.1 for Word

 Extraction:  one line of content(Free_Two_way_Radio_Guide.pdf -

 attached - was available freely on their website)

 -



 Producer: FOP 0.20.5

 Extraction: full content/docs/features.pdf | linkmap.pdf etc

 --

 Thanks.

 Praveen





 On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal pkal...@gmail.commailto:

 pkal...@gmail.com wrote:

 Yes Sandhya,

 i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is

 what you were asking.

 Thanks.





 On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal sagar...@opentext.com

 mailto:sagar...@opentext.com wrote:

 Praveen,



 Along with the tika core and parser jars, did you run mvn

 dependency:copy-dependencies, to generate all the dependencies too.



 Thanks,

 Sandhya



 -Original Message-

 From: Praveen Agrawal [mailto:pkal...@gmail.commailto:pkal...@gmail.com]

 Sent: Tuesday, May 04, 2010 4:52 PM

 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org

 Subject: Re: Problem with pdf, upgrading Cell

 I seems to have mixed results:



 Here is what i did:

 copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in

 contrib

RE: Problem with pdf, upgrading Cell

2010-05-03 Thread Sandhya Agarwal

Hello,

Please let me know if anybody figured out a way out of this issue. 

Thanks,
Sandhya

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: Friday, April 30, 2010 11:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell

Grant,
You can try any of the sample pdfs that come in /docs folder of Solr 1.4
dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
metadata i.e. stream_size, content_type apart from my own literals are
indexed, and content is missing..


On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Praveen and Marc,

 Can you share the PDF (feel free to email my private email) that fails in
 Solr?

 Thanks,
 Grant


 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

 
  Hi
  Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
  What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, compiled
 it, and started from a simple schema, still no luck.- Getting other versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
  I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 release,
 and the last version on the svn.
  Marc
  _
  Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
  http://www.windowslive.fr/hotmail/agregation/

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search

RE: Problem with pdf, upgrading Cell

2010-05-03 Thread Marc Ghorayeb

Hi,
Grant, i confirm what Praveen has said, any PDF i try does not work with the 
new Tika and SVN versions. :(
Marc

 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Mon, 3 May 2010 13:05:24 +0530
 Subject: RE: Problem with pdf, upgrading Cell

 Hello,

 Please let me know if anybody figured out a way out of this issue. 

 Thanks,
 Sandhya

 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com] 
 Sent: Friday, April 30, 2010 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell

 Grant,
 You can try any of the sample pdfs that come in /docs folder of Solr 1.4
 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
 metadata i.e. stream_size, content_type apart from my own literals are
 indexed, and content is missing..

 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:

  Praveen and Marc,

  Can you share the PDF (feel free to email my private email) that fails in
  Solr?

  Thanks,
  Grant

  On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

   Hi
   Nope i didn't get it to work... Just like you, command line version of
  tika extracts correctly the content, but once included in Solr, no content
  is extracted.
   What i tried until now is:- Updating the tika libraries inside Solr 1.4
  public version, no luck there.- Downloading the latest SVN version, compiled
  it, and started from a simple schema, still no luck.- Getting other versions
  compiled on hudson (nightly builds), and testing them also, still no
  extraction.
   I sent a mail on the developpers mailing list but they told me i should
  just mail here, hope some developper reads this because it's quite an
  important feature of Solr and somehow it got broke between the 1.4 release,
  and the last version on the svn.
   Marc
   _
   Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
  dans HOTMAIL !
   http://www.windowslive.fr/hotmail/agregation/

  --
  Grant Ingersoll
  http://www.lucidimagination.com/

  Search the Lucene ecosystem using Solr/Lucene:
  http://www.lucidimagination.com/search

_
Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur votre 
téléphone!
http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-05-03 Thread Grant Ingersoll

I'm investigating.

On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:

 
 Hi,
 Grant, i confirm what Praveen has said, any PDF i try does not work with the 
 new Tika and SVN versions. :(
 Marc
 
 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Mon, 3 May 2010 13:05:24 +0530
 Subject: RE: Problem with pdf, upgrading Cell
 
 Hello,
 
 Please let me know if anybody figured out a way out of this issue. 
 
 Thanks,
 Sandhya
 
 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com] 
 Sent: Friday, April 30, 2010 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 Grant,
 You can try any of the sample pdfs that come in /docs folder of Solr 1.4
 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
 metadata i.e. stream_size, content_type apart from my own literals are
 indexed, and content is missing..
 
 
 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:
 
 Praveen and Marc,
 
 Can you share the PDF (feel free to email my private email) that fails in
 Solr?
 
 Thanks,
 Grant
 
 
 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 Hi
 Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
 What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, compiled
 it, and started from a simple schema, still no luck.- Getting other versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
 I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 release,
 and the last version on the svn.
 Marc
 _
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
 http://www.windowslive.fr/hotmail/agregation/
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
 
 _
 Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur 
 votre téléphone!
 http://www.messengersurvotremobile.com/?d=Hotmail

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Problem with pdf, upgrading Cell

2010-05-03 Thread Grant Ingersoll

I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this.  It 
is indeed a bug somewhere (still investigating).  It seems that Tika is now 
picking an EmptyParser implementation when trying to determine which parser to 
use, despite the fact that it properly identifies the MIME Type.

-Grant

On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:

 I'm investigating.
 
 On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
 
 Hi,
 Grant, i confirm what Praveen has said, any PDF i try does not work with the 
 new Tika and SVN versions. :(
 Marc
 
 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Mon, 3 May 2010 13:05:24 +0530
 Subject: RE: Problem with pdf, upgrading Cell
 
 Hello,
 
 Please let me know if anybody figured out a way out of this issue. 
 
 Thanks,
 Sandhya
 
 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com] 
 Sent: Friday, April 30, 2010 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 Grant,
 You can try any of the sample pdfs that come in /docs folder of Solr 1.4
 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
 metadata i.e. stream_size, content_type apart from my own literals are
 indexed, and content is missing..
 
 
 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:
 
 Praveen and Marc,
 
 Can you share the PDF (feel free to email my private email) that fails in
 Solr?
 
 Thanks,
 Grant
 
 
 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 Hi
 Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
 What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, 
 compiled
 it, and started from a simple schema, still no luck.- Getting other 
 versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
 I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 release,
 and the last version on the svn.
 Marc
 _
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
 http://www.windowslive.fr/hotmail/agregation/
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 

 _
 Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur 
 votre téléphone!
 http://www.messengersurvotremobile.com/?d=Hotmail
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Problem with pdf, upgrading Cell

2010-05-03 Thread Grant Ingersoll

Little more info... Seems to be a classloading issue.  The tests pass, but they 
aren't loading the Tika libraries via the Solr ResourceLoader, whereas the 
example is.  Marc, one thing to try is to unjar the Solr WAR file and put the 
Tika libs in there, as I bet it will then work.  Note, however, I haven't tried 
this.

On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:

 I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this.  
 It is indeed a bug somewhere (still investigating).  It seems that Tika is 
 now picking an EmptyParser implementation when trying to determine which 
 parser to use, despite the fact that it properly identifies the MIME Type.
 
 -Grant
 
 On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:
 
 I'm investigating.
 
 On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:
 
 
 Hi,
 Grant, i confirm what Praveen has said, any PDF i try does not work with 
 the new Tika and SVN versions. :(
 Marc
 
 From: sagar...@opentext.com
 To: solr-user@lucene.apache.org
 Date: Mon, 3 May 2010 13:05:24 +0530
 Subject: RE: Problem with pdf, upgrading Cell
 
 Hello,
 
 Please let me know if anybody figured out a way out of this issue. 
 
 Thanks,
 Sandhya
 
 -Original Message-
 From: Praveen Agrawal [mailto:pkal...@gmail.com] 
 Sent: Friday, April 30, 2010 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with pdf, upgrading Cell
 
 Grant,
 You can try any of the sample pdfs that come in /docs folder of Solr 1.4
 dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
 metadata i.e. stream_size, content_type apart from my own literals are
 indexed, and content is missing..
 
 
 On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll 
 gsing...@apache.orgwrote:
 
 Praveen and Marc,
 
 Can you share the PDF (feel free to email my private email) that fails in
 Solr?
 
 Thanks,
 Grant
 
 
 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:
 
 
 Hi
 Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
 What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, 
 compiled
 it, and started from a simple schema, still no luck.- Getting other 
 versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
 I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 
 release,
 and the last version on the svn.
 Marc
 _
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
 http://www.windowslive.fr/hotmail/agregation/
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 
   
 _
 Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur 
 votre téléphone!
 http://www.messengersurvotremobile.com/?d=Hotmail
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

RE: Problem with pdf, upgrading Cell

2010-04-30 Thread pk


Mark,
did you managed to get it work?

I did try latest Tika (0.7) command line and successfully parsed earlier
problematic pdf. Then i replaced Tika related jars in Solr-1.4
contrib/extraction/lib folder with new ones. Now it doesn;t throw any
exception, but no content extraction, only metadata! It now even doesn't
extract content from pdfs which it was able to earlier (v0.4). Strange..

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-pdf-upgrading-Cell-tp745557p767447.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Problem with pdf, upgrading Cell

2010-04-30 Thread Sandhya Agarwal

I observed the same issue too, with tika 0.7 jars. It now fails to extract 
content from documents of any type. Works with tika 0.5 though.

Thanks,
Sandhya

-Original Message-
From: pk [mailto:pkal...@gmail.com] 
Sent: Friday, April 30, 2010 3:17 PM
To: solr-user@lucene.apache.org
Subject: RE: Problem with pdf, upgrading Cell

Mark,
did you managed to get it work?

I did try latest Tika (0.7) command line and successfully parsed earlier
problematic pdf. Then i replaced Tika related jars in Solr-1.4
contrib/extraction/lib folder with new ones. Now it doesn;t throw any
exception, but no content extraction, only metadata! It now even doesn't
extract content from pdfs which it was able to earlier (v0.4). Strange..

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-pdf-upgrading-Cell-tp745557p767447.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Grant Ingersoll

Can you share the PDF it is failing on?  FWIW, PDFs are notoriously hard to 
extract.  They come in all shapes and flavors and I've seen many a commercial 
extractor fail on them too.  Have you tried using either Tika standalone or 
PDFBox standalone?  Does the file work there?

On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote:

 
 Okay i've been digging a little bit through the Java code from the SVN, and 
 it seems the load function inside the ExtractingDocumentLoader class does not 
 receive the ContentStream (it is set to null...).Maybe i should send this to 
 the developper mailing list?
 Marc
 
 From: dekay...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Problem with pdf, upgrading Cell
 Date: Fri, 23 Apr 2010 16:03:28 +0200
 
 
 Seems like i'm not the only one with this no extraction 
 problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently
  he tried the same thing, building from the trunk, and indexing a pdf, and 
 no extraction occured... Strange.
 Marc G.

 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail
 
 _
 Découvrez comment SURFER DISCRETEMENT sur un site de rencontres !
 http://clk.atdmt.com/FRM/go/206608211/direct/01/

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Praveen Agrawal

I did try standalone version of tika0.7, and it extracted pdf content
successfully. Then i replaced tika related jars in contrib/extraction/lib of
solr1.4 dist'n with their newer versions, and now it doesn;t extract
contents from ANY pdf.
Earlier (0.4) it was throwing exception for few pdfs, but now no contents or
exception.


On Fri, Apr 30, 2010 at 4:14 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Can you share the PDF it is failing on?  FWIW, PDFs are notoriously hard to
 extract.  They come in all shapes and flavors and I've seen many a
 commercial extractor fail on them too.  Have you tried using either Tika
 standalone or PDFBox standalone?  Does the file work there?

 On Apr 26, 2010, at 8:35 AM, Marc Ghorayeb wrote:

 
  Okay i've been digging a little bit through the Java code from the SVN,
 and it seems the load function inside the ExtractingDocumentLoader class
 does not receive the ContentStream (it is set to null...).Maybe i should
 send this to the developper mailing list?
  Marc
 
  From: dekay...@hotmail.com
  To: solr-user@lucene.apache.org
  Subject: RE: Problem with pdf, upgrading Cell
  Date: Fri, 23 Apr 2010 16:03:28 +0200
 
 
  Seems like i'm not the only one with this no extraction problem:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparentlyhe
  tried the same thing, building from the trunk, and indexing a pdf, and no
 extraction occured... Strange.
  Marc G.
 
  _
  Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
 Blackberry, …
  http://www.messengersurvotremobile.com/?d=Hotmail
 
  _
  Découvrez comment SURFER DISCRETEMENT sur un site de rencontres !
  http://clk.atdmt.com/FRM/go/206608211/direct/01/

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search

Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Marc Ghorayeb


Hi
Nope i didn't get it to work... Just like you, command line version of tika 
extracts correctly the content, but once included in Solr, no content is 
extracted.
What i tried until now is:- Updating the tika libraries inside Solr 1.4 public 
version, no luck there.- Downloading the latest SVN version, compiled it, and 
started from a simple schema, still no luck.- Getting other versions compiled 
on hudson (nightly builds), and testing them also, still no extraction.
I sent a mail on the developpers mailing list but they told me i should just 
mail here, hope some developper reads this because it's quite an important 
feature of Solr and somehow it got broke between the 1.4 release, and the last 
version on the svn.
Marc  
_
Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans 
HOTMAIL !
http://www.windowslive.fr/hotmail/agregation/

Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Grant Ingersoll

Praveen and Marc,

Can you share the PDF (feel free to email my private email) that fails in Solr?

Thanks,
Grant


On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

 
 Hi
 Nope i didn't get it to work... Just like you, command line version of tika 
 extracts correctly the content, but once included in Solr, no content is 
 extracted.
 What i tried until now is:- Updating the tika libraries inside Solr 1.4 
 public version, no luck there.- Downloading the latest SVN version, compiled 
 it, and started from a simple schema, still no luck.- Getting other versions 
 compiled on hudson (nightly builds), and testing them also, still no 
 extraction.
 I sent a mail on the developpers mailing list but they told me i should just 
 mail here, hope some developper reads this because it's quite an important 
 feature of Solr and somehow it got broke between the 1.4 release, and the 
 last version on the svn.
 Marc
 _
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans 
 HOTMAIL !
 http://www.windowslive.fr/hotmail/agregation/

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Problem with pdf, upgrading Cell

2010-04-30 Thread Praveen Agrawal

Grant,
You can try any of the sample pdfs that come in /docs folder of Solr 1.4
dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc. Only
metadata i.e. stream_size, content_type apart from my own literals are
indexed, and content is missing..


On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Praveen and Marc,

 Can you share the PDF (feel free to email my private email) that fails in
 Solr?

 Thanks,
 Grant


 On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

 
  Hi
  Nope i didn't get it to work... Just like you, command line version of
 tika extracts correctly the content, but once included in Solr, no content
 is extracted.
  What i tried until now is:- Updating the tika libraries inside Solr 1.4
 public version, no luck there.- Downloading the latest SVN version, compiled
 it, and started from a simple schema, still no luck.- Getting other versions
 compiled on hudson (nightly builds), and testing them also, still no
 extraction.
  I sent a mail on the developpers mailing list but they told me i should
 just mail here, hope some developper reads this because it's quite an
 important feature of Solr and somehow it got broke between the 1.4 release,
 and the last version on the svn.
  Marc
  _
  Consultez gratuitement vos emails Orange, Gmail, Free, ... directement
 dans HOTMAIL !
  http://www.windowslive.fr/hotmail/agregation/

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search

RE: Problem with pdf, upgrading Cell

2010-04-26 Thread Marc Ghorayeb

Okay i've been digging a little bit through the Java code from the SVN, and it 
seems the load function inside the ExtractingDocumentLoader class does not 
receive the ContentStream (it is set to null...).Maybe i should send this to 
the developper mailing list?
Marc

 From: dekay...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Problem with pdf, upgrading Cell
 Date: Fri, 23 Apr 2010 16:03:28 +0200

 Seems like i'm not the only one with this no extraction 
 problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently
  he tried the same thing, building from the trunk, and indexing a pdf, and no 
 extraction occured... Strange.
 Marc G.

 _
 Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
 Blackberry, …
 http://www.messengersurvotremobile.com/?d=Hotmail

_
Découvrez comment SURFER DISCRETEMENT sur un site de rencontres !
http://clk.atdmt.com/FRM/go/206608211/direct/01/

Re: Problem with pdf, upgrading Cell

2010-04-23 Thread Otis Gospodnetic

Marc, got anything in your logs?

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Marc Ghorayeb dekay...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Fri, April 23, 2010 8:42:53 AM
 Subject: Problem with pdf, upgrading Cell
 
 
Hello,
I configured a Solr server to be able to extract data from various 
 documents, including pdfs. Unfortunately, the data extraction fails on 
 several 
 pdfs. I have read around here that this may be due to the old Tika library 
 being 
 used?I looked around and saw that the svn had a newer version so i checked 
 out 
 the trunk, and built it using ant dist, and ant example.I then set up my 
 schema 
 in the newly built server, and inserted the library from the newly built cell 
 into the lib directory (in solr's home). However, now all i get is a blank 
 response... The indexing works, but it doesn't extract anything, only the 
 literal values that i pass on are indexed.
Any help would be greatly 
 appreciated!! :)
Thank you.
Marc Ghorayeb 
 
   
 
_
Hotmail 
 arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, 
 …

 http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-04-23 Thread Marc Ghorayeb


I'm launching it with the start.jar utility, and there doesn't seem to be 
anything weird inside the console when i upload a pdf. Is there a way to output 
the console to a log file? The only log file that get's updated is a log file 
in the logs directory, and it seems to only show the input/ouput of the web 
requests (get and posts...).
for example:127.0.0.1 -  -  [23/Apr/2010:13:06:47 +] GET 
/solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 -  -  
[23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 200 
780 127.0.0.1 -  -  [23/Apr/2010:13:06:57 +] POST 
/solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
 HTTP/1.1 200 41 127.0.0.1 -  -  [23/Apr/2010:13:06:58 +] POST 
/solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
 HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:06:59 +] POST 
/solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
 HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:07:00 +] POST 
/solr/core0/update HTTP/1.1 200 41 127.0.0.1 -  -  [23/Apr/2010:13:07:00 
+] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 -  -  
[23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 
26395 127.0.0.1 -  -  [23/Apr/2010:13:07:05 +] GET 
/solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 
I don't think that's going to help much :)
 Date: Fri, 23 Apr 2010 06:04:34 -0700
 From: otis_gospodne...@yahoo.com
 Subject: Re: Problem with pdf, upgrading Cell
 To: solr-user@lucene.apache.org
 
 Marc, got anything in your logs?
 
  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
  From: Marc Ghorayeb dekay...@hotmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, April 23, 2010 8:42:53 AM
  Subject: Problem with pdf, upgrading Cell
  
  
 Hello,
 I configured a Solr server to be able to extract data from various 
  documents, including pdfs. Unfortunately, the data extraction fails on 
  several 
  pdfs. I have read around here that this may be due to the old Tika library 
  being 
  used?I looked around and saw that the svn had a newer version so i checked 
  out 
  the trunk, and built it using ant dist, and ant example.I then set up my 
  schema 
  in the newly built server, and inserted the library from the newly built 
  cell 
  into the lib directory (in solr's home). However, now all i get is a blank 
  response... The indexing works, but it doesn't extract anything, only the 
  literal values that i pass on are indexed.
 Any help would be greatly 
  appreciated!! :)
 Thank you.
 Marc Ghorayeb 
  

  
 _
 Hotmail 
  arrive sur votre téléphone ! Compatible Iphone, Windows Phone, Blackberry, 
  …
 
  http://www.messengersurvotremobile.com/?d=Hotmail
 
  
_
Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans 
HOTMAIL !
http://www.windowslive.fr/hotmail/agregation/

RE: Problem with pdf, upgrading Cell

2010-04-23 Thread Marc Ghorayeb


Seems like i'm not the only one with this no extraction 
problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently
 he tried the same thing, building from the trunk, and indexing a pdf, and no 
extraction occured... Strange.
Marc G.

 From: dekay...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Problem with pdf, upgrading Cell
 Date: Fri, 23 Apr 2010 15:12:39 +0200
 
 
 I'm launching it with the start.jar utility, and there doesn't seem to be 
 anything weird inside the console when i upload a pdf. Is there a way to 
 output the console to a log file? The only log file that get's updated is a 
 log file in the logs directory, and it seems to only show the input/ouput of 
 the web requests (get and posts...).
 for example:127.0.0.1 -  -  [23/Apr/2010:13:06:47 +] GET 
 /solr/core0/admin/luke?show=schemawt=json HTTP/1.1 200 21690 127.0.0.1 -  - 
  [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?wt=json HTTP/1.1 
 200 780 127.0.0.1 -  -  [23/Apr/2010:13:06:57 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  HTTP/1.1 200 41 127.0.0.1 -  -  [23/Apr/2010:13:06:58 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:06:59 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:07:00 +] POST 
 /solr/core0/update HTTP/1.1 200 41 127.0.0.1 -  -  [23/Apr/2010:13:07:00 
 +] POST /solr/core0/update HTTP/1.1 200 41 127.0.0.1 -  -  
 [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp HTTP/1.1 200 
 26395 127.0.0.1 -  -  [23/Apr/2010:13:07:05 +] GET 
 /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 
 I don't think that's going to help much :)
  Date: Fri, 23 Apr 2010 06:04:34 -0700
  From: otis_gospodne...@yahoo.com
  Subject: Re: Problem with pdf, upgrading Cell
  To: solr-user@lucene.apache.org
  
  Marc, got anything in your logs?
  
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
  
  
  
  - Original Message 
   From: Marc Ghorayeb dekay...@hotmail.com
   To: solr-user@lucene.apache.org
   Sent: Fri, April 23, 2010 8:42:53 AM
   Subject: Problem with pdf, upgrading Cell
   
   
  Hello,
  I configured a Solr server to be able to extract data from various 
   documents, including pdfs. Unfortunately, the data extraction fails on 
   several 
   pdfs. I have read around here that this may be due to the old Tika 
   library being 
   used?I looked around and saw that the svn had a newer version so i 
   checked out 
   the trunk, and built it using ant dist, and ant example.I then set up my 
   schema 
   in the newly built server, and inserted the library from the newly built 
   cell 
   into the lib directory (in solr's home). However, now all i get is a 
   blank 
   response... The indexing works, but it doesn't extract anything, only the 
   literal values that i pass on are indexed.
  Any help would be greatly 
   appreciated!! :)
  Thank you.
  Marc Ghorayeb 
   
 
   
  _
  Hotmail 
   arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
   Blackberry, 
   …
  
   http://www.messengersurvotremobile.com/?d=Hotmail
  
 
 _
 Consultez gratuitement vos emails Orange, Gmail, Free, ... directement dans 
 HOTMAIL !
 http://www.windowslive.fr/hotmail/agregation

RE: Problem with pdf, upgrading Cell

2010-04-23 Thread Marc Ghorayeb


Seems like i'm not the only one with this no extraction 
problem:http://www.mail-archive.com/solr-user@lucene.apache.org/msg33609.htmlApparently
 he tried the same thing, building from the trunk, and indexing a pdf, and no 
extraction occured... Strange.
Marc G.
  
_
Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone, 
Blackberry, …
http://www.messengersurvotremobile.com/?d=Hotmail

Re: Problem with pdf, upgrading Cell

2010-04-23 Thread Otis Gospodnetic

Marc,

These are your request logs.  You want to look at your Solr logs.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Marc Ghorayeb dekay...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Fri, April 23, 2010 9:12:39 AM
 Subject: RE: Problem with pdf, upgrading Cell
 
 
I'm launching it with the start.jar utility, and there doesn't seem to be 
 anything weird inside the console when i upload a pdf. Is there a way to 
 output 
 the console to a log file? The only log file that get's updated is a log file 
 in 
 the logs directory, and it seems to only show the input/ouput of the web 
 requests (get and posts...).
for example:127.0.0.1 -  -  
 [23/Apr/2010:13:06:47 +] GET /solr/core0/admin/luke?show=schemawt=json 
 HTTP/1.1 200 21690 127.0.0.1 -  -  [23/Apr/2010:13:06:47 +] GET 
 /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 -  -  
 [23/Apr/2010:13:06:57 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  
 HTTP/1.1 200 41 127.0.0.1 -  -  [23/Apr/2010:13:06:58 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cmysql-proxy-en.pdfliteral.title=mysql-proxy-en.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fmysql-proxy-en.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  
 HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:06:59 +] POST 
 /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Cpython-cheat-sheet-v1.pdfliteral.title=python-cheat-sheet-v1.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Fpython-cheat-sheet-v1.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
  
 HTTP/1.1 200 44 127.0.0.1 -  -  [23/Apr/2010:13:07:00 +] POST 
 /solr/core0/update HTTP/1.1 200 41 127.0.0.1 -  -  
 [23/Apr/2010:13:07:00 +] POST /solr/core0/update HTTP/1.1 200 41 
 127.0.0.1 
 -  -  [23/Apr/2010:13:07:05 +] GET /solr/core0/admin/schema.jsp 
 HTTP/1.1 200 26395 127.0.0.1 -  -  [23/Apr/2010:13:07:05 +] GET 
 /solr/core0/admin/jquery-1.2.3.min.js HTTP/1.1 304 0 
I don't think that's 
 going to help much :)
 Date: Fri, 23 Apr 2010 06:04:34 -0700
 
 From: 
 href=mailto:otis_gospodne...@yahoo.com;otis_gospodne...@yahoo.com
 
 Subject: Re: Problem with pdf, upgrading Cell
 To: 
 ymailto=mailto:solr-user@lucene.apache.org; 
 href=mailto:solr-user@lucene.apache.org;solr-user@lucene.apache.org
 
 
 Marc, got anything in your logs?
 
  Otis
 
 
 Sematext :: 
 http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem 
 search :: 
 http://search-lucene.com/
 
 
 
 - Original 
 Message 
  From: Marc Ghorayeb 
 ymailto=mailto:dekay...@hotmail.com; 
 href=mailto:dekay...@hotmail.com;dekay...@hotmail.com
  To: 
 
 href=mailto:solr-user@lucene.apache.org;solr-user@lucene.apache.org
 
  Sent: Fri, April 23, 2010 8:42:53 AM
  Subject: Problem with 
 pdf, upgrading Cell
  
  
 Hello,
 I 
 configured a Solr server to be able to extract data from various 
  
 documents, including pdfs. Unfortunately, the data extraction fails on 
 several 
 
  pdfs. I have read around here that this may be due to the old Tika 
 library being 
  used?I looked around and saw that the svn had a 
 newer version so i checked out 
  the trunk, and built it using ant 
 dist, and ant example.I then set up my schema 
  in the newly built 
 server, and inserted the library from the newly built cell 
  into 
 the lib directory (in solr's home). However, now all i get is a blank 
 
  response... The indexing works, but it doesn't extract anything, only the 
 
  literal values that i pass on are indexed.
 Any help would 
 be greatly 
  appreciated!! :)
 Thank you.
 Marc 
 Ghorayeb
 
 
   
  
 
 _
 
 Hotmail 
  arrive sur votre téléphone ! Compatible Iphone, Windows 
 Phone, Blackberry, 
  …
 
  
 href=http://www.messengersurvotremobile.com/?d=Hotmail; target=_blank 
 http://www.messengersurvotremobile.com/?d=Hotmail

RE: Problem with pdf, upgrading Cell

2010-04-23 Thread Marc Ghorayeb

 PM org.apache.solr.search.SolrIndexSearcher warmINFO: 
autowarming result for searc...@105585dc main
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: 
autowarming searc...@105585dc main from searc...@2efeecca main
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: 
autowarming result for searc...@105585dc main
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: 
autowarming searc...@105585dc main from searc...@2efeecca main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher warmINFO: 
autowarming result for searc...@105585dc main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: 
QuerySenderListener sending requests to searc...@105585dc mainApr 23, 2010 
5:47:14 PM org.apache.solr.core.QuerySenderListener newSearcherINFO: 
QuerySenderListener done.Apr 23, 2010 5:47:14 PM org.apache.solr.core.SolrCore 
registerSearcherINFO: [] Registered new searcher searc...@105585dc mainApr 23, 
2010 5:47:14 PM org.apache.solr.search.SolrIndexSearcher closeINFO: Closing 
searc...@2efeecca main
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}Apr
 23, 2010 5:47:14 PM org.apache.solr.update.processor.LogUpdateProcessor 
finishINFO: {optimize=} 0 46Apr 23, 2010 5:47:14 PM 
org.apache.solr.core.SolrCore executeINFO: [] webapp=/solr path=/update 
params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=1}
 status=0 QTime=46
 Date: Fri, 23 Apr 2010 08:03:14 -0700
 From: otis_gospodne...@yahoo.com
 Subject: Re: Problem with pdf, upgrading Cell
 To: solr-user@lucene.apache.org
 
 Marc,
 
 These are your request logs.  You want to look at your Solr logs.
 
  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
  From: Marc Ghorayeb dekay...@hotmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, April 23, 2010 9:12:39 AM
  Subject: RE: Problem with pdf, upgrading Cell
  
  
 I'm launching it with the start.jar utility, and there doesn't seem to be 
  anything weird inside the console when i upload a pdf. Is there a way to 
  output 
  the console to a log file? The only log file that get's updated is a log 
  file in 
  the logs directory, and it seems to only show the input/ouput of the web 
  requests (get and posts...).
 for example:127.0.0.1 -  -  
  [23/Apr/2010:13:06:47 +] GET 
  /solr/core0/admin/luke?show=schemawt=json 
  HTTP/1.1 200 21690 127.0.0.1 -  -  [23/Apr/2010:13:06:47 +] GET 
  /solr/core0/admin/luke?wt=json HTTP/1.1 200 780 127.0.0.1 -  -  
  [23/Apr/2010:13:06:57 +] POST 
  /solr/core0/update/extract?literal.id=C%3A%5CDocuments+and+Settings%5CM1B%5Cworkspace%5C3DS_FileIndexer%5Ctest%5Clucidworks-solr-refguide-1.4.pdfliteral.title=lucidworks-solr-refguide-1.4.pdfliteral.url=http%3A%2F%2Fwww.3ds.com%2Flucidworks-solr-refguide-1.4.pdfliteral.appKey=medialiteral.type=documentliteral.siteHash=53e446a6b81860dcfa1cc2fef4ef976bliteral.group=portalliteral.group=varliteral.group=0literal.group=caa_goldliteral.group=caa_partnerliteral.group=ag12literal.group=ag17wt=javabinversion=1
   
  HTTP/1.1 200

Re: Problem with pdf, upgrading Cell

2010-04-23 Thread Paul Borgermans

On Fri, Apr 23, 2010 at 5:48 PM, Marc Ghorayeb dekay...@hotmail.com wrote:

 Yes, the only log i can actually get is the one in the command console from 
 windows and there are no errors there ...
 Here are the last lines when i upload a pdf to the update/extract url:

snip

I am pretty sure it is the tika itself that does not manage to convert
your pdf. I'm not using solr cell but tika from a commandline, and it
is only with very recent tika builds pdf extraction works in most
cases.

So I suggest to build tika from svn yourself, and if the commandlien
extraction works, integarte it back with Solr. See

http://wiki.apache.org/solr/ExtractingRequestHandler

for instructions (the comitter section)

hth
Paul

47 matches

Mail list logo