This was my same feeling :-) and so I went for the trunk to have things working quickly, but I also have to consider which one is the best version since I am going to deploy it in the near future in an enterprise environment and choosing the best version is an importat step. I am quite new to Solr but I agree with Alessandro that probably using a slightly patched release should theoretically be more stable than the trunk which get many updates weekly (and daily). Cheers, Tommaso
2010/7/28 David Thibault <dthiba...@esperion.com> > Thanks, I'll try that then. I kind of figured that'd be the answer, but > after fighting with Solr & ExtractingRequestHandler for 2 days I also just > wanted to be done with it once it started working with 4.0...=) However, > stability would be better in the long run. > > Best, > Dave > > -----Original Message----- > From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] > Sent: Wednesday, July 28, 2010 9:33 AM > To: solr-user@lucene.apache.org > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr > CELL/Tika/PDFBox > > In my opinion, the 1.4.1 version with the Patch is more Stable. > Until 4.0 will be released .... > > 2010/7/28 David Thibault <dthiba...@esperion.com> > > > Yesterday I did get this working with version 4.0 from trunk. I haven't > > fully tested it yet, but the content doesn't come through blank anymore, > so > > that's good. Would it be more stable to stick with 1.4.1 and your patch > to > > get to Tika 0.8, or to stick with the 4.0 trunk version? > > > > Best, > > Dave > > > > -----Original Message----- > > From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] > > Sent: Wednesday, July 28, 2010 3:31 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with > Solr > > CELL/Tika/PDFBox > > > > I attached a patch for Solr 1.4.1 release on > > https://issues.apache.org/jira/browse/SOLR-1902 that made things work > for > > me. > > This strange behaviour for me was due to the fact that I copied the > patched > > jars and war inside the dist directory but forgot to update the war > inside > > the example/webapps directory (that is inside Jetty). > > Hope this helps. > > Tommaso > > > > 2010/7/27 David Thibault <dthiba...@esperion.com> > > > > > Alessandro & all, > > > > > > I was having the same issue with Tika crashing on certain PDFs. I also > > > noticed the bug where no content was extracted after upgrading Tika. > > > > > > When I went to the SOLR issue you link to below, I applied all the > > patches, > > > downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, > > and > > > got the following error: > > > SEVERE: java.lang.NoSuchMethodError: > > > > > > org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader; > > > at > > > > > > org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93) > > > at > > > > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244) > > > at > > > > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > > at > > > > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > > > at > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > > > at > > > > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > > > at > > > > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > > > at > > > > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > > > at > > > > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > > > at > > > > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > > > at > > > > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) > > > at > > > > > > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859) > > > at > > > > > > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579) > > > at > > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555) > > > at java.lang.Thread.run(Thread.java:619) > > > > > > This is really weird because I DID apply the SolrResourceLoader patch > > that > > > adds the getClassLoader method. I even verified by going opening up > the > > > JARs and looking at the class file in Eclipse...I can see the > > > SolrResourceLoader.getClassLoader() method. > > > > > > Does anyone know why it can't find the method? After patching the > source > > I > > > did ant clean dist in the base directory of the Solr source tree and > > > everything looked like it compiles (BUILD SUCCESSFUL). Then I copied > all > > > the jars from dist/ and all the library dependencies from > > > contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, > everything > > in > > > the logs looked good. > > > > > > I'm stumped. It would be very nice to have a Solr implementation using > > the > > > newest versions of PDFBox & Tika and actually have content being > > > extracted...=) > > > > > > Best, > > > Dave > > > > > > > > > -----Original Message----- > > > From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] > > > Sent: Tuesday, July 27, 2010 6:09 AM > > > To: solr-user@lucene.apache.org > > > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with > > Solr > > > CELL/Tika/PDFBox > > > > > > Hi Jon, > > > During the last days we front the same problem. > > > Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't > extract > > > content and from others, Solr throws an exception during the Indexing > > > Process . > > > You must: > > > Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8 > > > snapshot and tika-parsers 0.8. > > > Update PdfBox and all related libraries. > > > After that You have to patch Solr 1.4.1 following this patch : > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel > > > This is the firts way to solve the problem. > > > > > > Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no > exception > > > is > > > thrown during the Indexing process, but no content is extracted. > > > Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated) all > > > sounds good but we don't know how stableit is! > > > I hope you have now a clear vision of this issue, > > > Best Regards > > > > > > > > > > > > 2010/7/26 Sharp, Jonathan <jsh...@coh.org> > > > > > > > > > > > Every so often I need to index new batches of scanned PDFs and > > > occasionally > > > > Adobe's OCR can't recognize the text in a couple of these documents. > In > > > > these situations I would like to type in a small amount of text onto > > the > > > > document and have it be extracted by Solr CELL. > > > > > > > > Adobe Pro 9 has a number of different ways to add text directly to a > > PDF > > > > file: > > > > > > > > *Typewriter > > > > *Sticky Note > > > > *Callout boxes > > > > *Text boxes > > > > > > > > I tried indexing documents with each of these text additions with > Solr > > > > 1.4.1 + Solr CELL but can't extract the text in any of these boxes. > > > > > > > > If someone has modified their Solr CELL installation to use more > recent > > > > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can > > > comment > > > > on whether newer versions can pull the text out of any of these > various > > > text > > > > boxes I'd appreciate that very much. > > > > > > > > -Jon > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > SECURITY/CONFIDENTIALITY WARNING: > > > > This message and any attachments are intended solely for the > individual > > > or > > > > entity to which they are addressed. This communication may contain > > > > information that is privileged, confidential, or exempt from > disclosure > > > > under applicable law (e.g., personal health information, research > data, > > > > financial information). Because this e-mail has been sent without > > > > encryption, individuals other than the intended recipient may be able > > to > > > > view the information, forward it to others or tamper with the > > information > > > > without the knowledge or consent of the sender. If you are not the > > > intended > > > > recipient, or the employee or person responsible for delivering the > > > message > > > > to the intended recipient, any dissemination, distribution or copying > > of > > > the > > > > communication is strictly prohibited. If you received the > communication > > > in > > > > error, please notify the sender immediately by replying to this > message > > > and > > > > deleting the message and any accompanying files from your system. If, > > due > > > to > > > > the security risks, you do not wish to receive further communications > > via > > > > e-mail, please reply to this message and inform the sender that you > do > > > not > > > > wish to receive further e-mail from the sender. > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > > > > > > -- > > > -------------------------- > > > > > > Benedetti Alessandro > > > Personal Page: http://tigerbolt.altervista.org > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > > > > > -- > -------------------------- > > Benedetti Alessandro > Personal Page: http://tigerbolt.altervista.org > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > >