RE: Integrating grobid with Tika in solr
Y, integrating Tika is non-trivial. I think Uwe adds the dependencies with great care by hand by carefully looking at the dependency tree in Maven and making sure there weren't any conflicts. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, May 4, 2016 2:38 PM To: solr-user@lucene.apache.org Subject: Re: Integrating grobid with Tika in solr On 5/4/2016 9:21 AM, Betsey Benagh wrote: > I’m feeling particularly dense, because I don’t see any Tika jars in > WEB-INF/lib: Oops. Sorry about that, I forgot that it's all contrib. That's my mistake, not yours. The Tika jars are in contrib/extraction/lib, along with a very large number of dependencies. It turns out that I probably have no idea what I'm talking about. I cannot find any version 1.12 downloads on Tika's website that are structured the same way as what's in our contrib directory, so I have no idea how to actually do the manual upgrade. I seem to remember hearing about people doing a Tika upgrade manually, but I've got no idea how they did it. Thanks, Shawn
Re: Integrating grobid with Tika in solr
On 5/4/2016 9:21 AM, Betsey Benagh wrote: > I’m feeling particularly dense, because I don’t see any Tika jars in > WEB-INF/lib: Oops. Sorry about that, I forgot that it's all contrib. That's my mistake, not yours. The Tika jars are in contrib/extraction/lib, along with a very large number of dependencies. It turns out that I probably have no idea what I'm talking about. I cannot find any version 1.12 downloads on Tika's website that are structured the same way as what's in our contrib directory, so I have no idea how to actually do the manual upgrade. I seem to remember hearing about people doing a Tika upgrade manually, but I've got no idea how they did it. Thanks, Shawn
Re: Integrating grobid with Tika in solr
As a workaround, I’m trying to run Grobid on my files, and then import the corresponding XML into Solr. I don’t see any errors on the post: bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml /Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java -classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar -Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool /Users/bba0124/software/grobid/out/021002_1.tei.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/lrdtest/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r tf,htm,html,txt,log POSTing file 021002_1.tei.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/lrdtest/update... Time spent: 0:00:00.027 But the documents don’t seem to show up in the index, either. Additionally, if I try uploading the documents using the web UI, they appear to upload successfully, Response:{ "responseHeader": { "status": 0, "QTime": 7 } } But aren’t in the index. What am I missing? On 5/4/16, 10:55 AM, "Shawn Heisey"wrote: >On 5/4/2016 8:38 AM, Betsey Benagh wrote: >> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0. >> >> >> On 5/4/16, 10:37 AM, "Allison, Timothy B." wrote: >>> Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika >>>1.11. > >Just upgrading to 6.0.0 isn't enough. As Tim said, Solr 6 currently >uses Tika 1.7, but 1.11 is required. That's four minor versions behind >the minimum. > >Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did >mention in a previous reply, but I do not know when it will be >available. Tim might have a better idea. > >https://issues.apache.org/jira/browse/SOLR-8981 > >You might be able to upgrade Tika in your Solr install to 1.12 yourself >by simply replacing the jar in WEB-INF/lib ... but I do not know whether >this will cause any other problems. Historically, replacing the jar has >been a safe option ... but I can't guarantee that this will always be >the case. > >Thanks, >Shawn >
Re: Integrating grobid with Tika in solr
I’m feeling particularly dense, because I don’t see any Tika jars in WEB-INF/lib: antlr4-runtime-4.5.1-1.jar asm-5.0.4.jar asm-commons-5.0.4.jar commons-cli-1.2.jar commons-codec-1.10.jar commons-collections-3.2.2.jar commons-configuration-1.6.jar commons-exec-1.3.jar commons-fileupload-1.2.1.jar commons-io-2.4.jar commons-lang-2.6.jar concurrentlinkedhashmap-lru-1.2.jar dom4j-1.6.1.jar guava-14.0.1.jar hadoop-annotations-2.6.0.jar hadoop-auth-2.6.0.jar hadoop-common-2.6.0.jar hadoop-hdfs-2.6.0.jar hppc-0.7.1.jar htrace-core-3.0.4.jar httpclient-4.4.1.jar httpcore-4.4.1.jar httpmime-4.4.1.jar jackson-core-2.5.4.jar jackson-dataformat-smile-2.5.4.jar joda-time-2.2.jar listing.txt lucene-analyzers-common-5.5.0.jar lucene-analyzers-kuromoji-5.5.0.jar lucene-analyzers-phonetic-5.5.0.jar lucene-backward-codecs-5.5.0.jar lucene-codecs-5.5.0.jar lucene-core-5.5.0.jar lucene-expressions-5.5.0.jar lucene-grouping-5.5.0.jar lucene-highlighter-5.5.0.jar lucene-join-5.5.0.jar lucene-memory-5.5.0.jar lucene-misc-5.5.0.jar lucene-queries-5.5.0.jar lucene-queryparser-5.5.0.jar lucene-sandbox-5.5.0.jar lucene-spatial-5.5.0.jar lucene-suggest-5.5.0.jar noggit-0.6.jar org.restlet-2.3.0.jar org.restlet.ext.servlet-2.3.0.jar protobuf-java-2.5.0.jar solr-core-5.5.0.jar solr-solrj-5.5.0.jar spatial4j-0.5.jar stax2-api-3.1.4.jar t-digest-3.1.jar woodstox-core-asl-4.4.1.jar zookeeper-3.4.6.jar On 5/4/16, 10:55 AM, "Shawn Heisey"wrote: >On 5/4/2016 8:38 AM, Betsey Benagh wrote: >> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0. >> >> >> On 5/4/16, 10:37 AM, "Allison, Timothy B." wrote: >>> Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika >>>1.11. > >Just upgrading to 6.0.0 isn't enough. As Tim said, Solr 6 currently >uses Tika 1.7, but 1.11 is required. That's four minor versions behind >the minimum. > >Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did >mention in a previous reply, but I do not know when it will be >available. Tim might have a better idea. > >https://issues.apache.org/jira/browse/SOLR-8981 > >You might be able to upgrade Tika in your Solr install to 1.12 yourself >by simply replacing the jar in WEB-INF/lib ... but I do not know whether >this will cause any other problems. Historically, replacing the jar has >been a safe option ... but I can't guarantee that this will always be >the case. > >Thanks, >Shawn >
Re: Integrating grobid with Tika in solr
On 5/4/2016 8:38 AM, Betsey Benagh wrote: > Thanks, I¹m currently using 5.5, and will try upgrading to 6.0. > > > On 5/4/16, 10:37 AM, "Allison, Timothy B."wrote: >> Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. Just upgrading to 6.0.0 isn't enough. As Tim said, Solr 6 currently uses Tika 1.7, but 1.11 is required. That's four minor versions behind the minimum. Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did mention in a previous reply, but I do not know when it will be available. Tim might have a better idea. https://issues.apache.org/jira/browse/SOLR-8981 You might be able to upgrade Tika in your Solr install to 1.12 yourself by simply replacing the jar in WEB-INF/lib ... but I do not know whether this will cause any other problems. Historically, replacing the jar has been a safe option ... but I can't guarantee that this will always be the case. Thanks, Shawn
Re: Integrating grobid with Tika in solr
Thanks, I¹m currently using 5.5, and will try upgrading to 6.0. On 5/4/16, 10:37 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: >Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. > >-Original Message- >From: Allison, Timothy B. [mailto:talli...@mitre.org] >Sent: Wednesday, May 4, 2016 10:29 AM >To: solr-user@lucene.apache.org >Subject: RE: Integrating grobid with Tika in solr > >I think Solr is using a version of Tika that predates that addition of >the Grobid parser. You'll have to add that manually somehow until Solr >upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. > >-Original Message- >From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] >Sent: Wednesday, May 4, 2016 10:07 AM >To: solr-user@lucene.apache.org >Subject: Re: Integrating grobid with Tika in solr > >Grobid runs as a service, and I'm (theoretically) configuring Tika to >call it. > >From the Grobid wiki, here are instructions for integrating with Tika >application: > >First we need to create the GrobidExtractor.properties file that points >to the Grobid REST Service. My file looks like the following: > >grobid.server.url=http://localhost:[port] > >Now you can run GROBID via Tika-app with the following command on a >sample PDF file. > >java -classpath >$HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar >org.apache.tika.cli.TikaCLI >--config=$HOME/src/grobidparser-resources/tika-config.xml -J >$HOME/src/grobid/papers/ICSE06.pdf > >Here's the stack trace. > >name="error-class">org.apache.solr.common.SolrExceptionname="root-error-class">java.lang.ClassNotFoundExceptionname="msg">org.apache.tika.exception.TikaException: Unable to find a >parser class: org.apache.tika.parser.journal.JournalParsername="trace">org.apache.solr.common.SolrException: >org.apache.tika.exception.TikaException: Unable to find a parser class: >org.apache.tika.parser.journal.JournalParser >at >org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(Extract >ingRequestHandler.java:82) >at >org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java: >367) >at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348) >at org.apache.solr.core.PluginBag.get(PluginBag.java:148) >at >org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandle >rBase.java:231) >at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362) >at >org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCal >l.java:326) >at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296) >at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412) >at >org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav >a:225) >at >org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav >a:183) >at >org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl >er.java:1652) >at >org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) >at >org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1 >43) >at >org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577 >) >at >org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja >va:223) >at >org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja >va:1127) >at >org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) >at >org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav >a:185) >at >org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav >a:1061) >at >org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1 >41) >at >org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa >ndlerCollection.java:215) >at >org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio >n.java:110) >at >org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java >:97) >at org.eclipse.jetty.server.Server.handle(Server.java:499) >at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) >at >org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257 >) >at >org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) >at >org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav >a:635) >at >org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java >:555) >at java.lang.Thread.run(Thread.java:745) >Caused by: org.apache.tika.exception.Tika
RE: Integrating grobid with Tika in solr
Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, May 4, 2016 10:29 AM To: solr-user@lucene.apache.org Subject: RE: Integrating grobid with Tika in solr I think Solr is using a version of Tika that predates that addition of the Grobid parser. You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. -Original Message- From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] Sent: Wednesday, May 4, 2016 10:07 AM To: solr-user@lucene.apache.org Subject: Re: Integrating grobid with Tika in solr Grobid runs as a service, and I'm (theoretically) configuring Tika to call it. >From the Grobid wiki, here are instructions for integrating with Tika >application: First we need to create the GrobidExtractor.properties file that points to the Grobid REST Service. My file looks like the following: grobid.server.url=http://localhost:[port] Now you can run GROBID via Tika-app with the following command on a sample PDF file. java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf Here's the stack trace. org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82) at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367) at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348) at org.apache.solr.core.PluginBag.get(PluginBag.java:148) at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231) at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362) at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326) at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92) at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80) ... 30 more Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.ja
RE: Integrating grobid with Tika in solr
I think Solr is using a version of Tika that predates that addition of the Grobid parser. You'll have to add that manually somehow until Solr upgrades to Tika 1.13 (soon to be released...I think). SOLR-8981. -Original Message- From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] Sent: Wednesday, May 4, 2016 10:07 AM To: solr-user@lucene.apache.org Subject: Re: Integrating grobid with Tika in solr Grobid runs as a service, and I'm (theoretically) configuring Tika to call it. >From the Grobid wiki, here are instructions for integrating with Tika >application: First we need to create the GrobidExtractor.properties file that points to the Grobid REST Service. My file looks like the following: grobid.server.url=http://localhost:[port] Now you can run GROBID via Tika-app with the following command on a sample PDF file. java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf Here's the stack trace. org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82) at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367) at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348) at org.apache.solr.core.PluginBag.get(PluginBag.java:148) at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231) at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362) at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326) at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92) at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80) ... 30 more Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.ja
Re: Integrating grobid with Tika in solr
Grobid runs as a service, and I’m (theoretically) configuring Tika to call it. >From the Grobid wiki, here are instructions for integrating with Tika >application: First we need to create the GrobidExtractor.properties file that points to the Grobid REST Service. My file looks like the following: grobid.server.url=http://localhost:[port] Now you can run GROBID via Tika-app with the following command on a sample PDF file. java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf Here’s the stack trace. org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82) at org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367) at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348) at org.apache.solr.core.PluginBag.get(PluginBag.java:148) at org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231) at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362) at org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326) at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unable to find a parser class: org.apache.tika.parser.journal.JournalParser at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92) at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80) ... 30 more Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.journal.JournalParser at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189) at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338) ... 35 more 500 On 5/4/16, 10:00 AM, "Shawn Heisey"> wrote: On 5/4/2016 7:15 AM, Betsey Benagh wrote: (X-posted from stack overflow) This feels like a basic, dumb question, but my reading of the documentation has not led me to an answer. i'm using Solr to index journal articles. Using the
Re: Integrating grobid with Tika in solr
On 5/4/2016 7:15 AM, Betsey Benagh wrote: > (X-posted from stack overflow) > > This feels like a basic, dumb question, but my reading of the documentation > has not led me to an answer. > > > i'm using Solr to index journal articles. Using the out-of-the-box > configuration, it indexed the text of the documents, but I'm looking to use > Grobid to pull out the authors, title, affiliations, etc. I got grobid up and > running as a service. > > I added > > /path/to/tika-config.xml > > to the requestHandler for /update/extract in solrconfig.xml > > The tika-config looks like: > > > > > > application/pdf > > > > > > I'm getting a ClassNotFound exception when I try to import a document, but > can't figure out where to set the classpath to fix it. I do not know anything about grobid. We'll need to see the exception -- the entire multi-line stacktrace, including any "caused by" sections. In general, you should create a lib directory in the solr home and place all extra jars in that directory. Otherwise you need elements in solrconfig.xml to load jars -- and they will be loaded once for every core that uses that element. ${solr.solr.home}/lib loads jars *once* when Solr starts and makes them available to all cores. Thanks, Shawn
Integrating grobid with Tika in solr
(X-posted from stack overflow) This feels like a basic, dumb question, but my reading of the documentation has not led me to an answer. i'm using Solr to index journal articles. Using the out-of-the-box configuration, it indexed the text of the documents, but I'm looking to use Grobid to pull out the authors, title, affiliations, etc. I got grobid up and running as a service. I added /path/to/tika-config.xml to the requestHandler for /update/extract in solrconfig.xml The tika-config looks like: application/pdf I'm getting a ClassNotFound exception when I try to import a document, but can't figure out where to set the classpath to fix it.