RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y, integrating Tika is non-trivial.  I think Uwe adds the dependencies with 
great care by hand by carefully looking at the dependency tree in Maven and 
making sure there weren't any conflicts.


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, May 4, 2016 2:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

On 5/4/2016 9:21 AM, Betsey Benagh wrote:
> I’m feeling particularly dense, because I don’t see any Tika jars in
> WEB-INF/lib:

Oops. Sorry about that, I forgot that it's all contrib.  That's my mistake, not 
yours.

The Tika jars are in contrib/extraction/lib, along with a very large number of 
dependencies.

It turns out that I probably have no idea what I'm talking about.  I cannot 
find any version 1.12 downloads on Tika's website that are structured the same 
way as what's in our contrib directory, so I have no idea how to actually do 
the manual upgrade.

I seem to remember hearing about people doing a Tika upgrade manually, but I've 
got no idea how they did it.

Thanks,
Shawn



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Shawn Heisey
On 5/4/2016 9:21 AM, Betsey Benagh wrote:
> I’m feeling particularly dense, because I don’t see any Tika jars in
> WEB-INF/lib:

Oops. Sorry about that, I forgot that it's all contrib.  That's my
mistake, not yours.

The Tika jars are in contrib/extraction/lib, along with a very large
number of dependencies.

It turns out that I probably have no idea what I'm talking about.  I
cannot find any version 1.12 downloads on Tika's website that are
structured the same way as what's in our contrib directory, so I have no
idea how to actually do the manual upgrade.

I seem to remember hearing about people doing a Tika upgrade manually,
but I've got no idea how they did it.

Thanks,
Shawn



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
As a workaround, I’m trying to run Grobid on my files, and then import the
corresponding XML into Solr.

I don’t see any errors on the post:

bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
/Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
-classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
-Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
/Users/bba0124/software/grobid/out/021002_1.tei.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
tf,htm,html,txt,log
POSTing file 021002_1.tei.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/lrdtest/update...
Time spent: 0:00:00.027

But the documents don’t seem to show up in the index, either.


Additionally, if I try uploading the documents using the web UI, they
appear to upload successfully,

Response:{
  "responseHeader": {
"status": 0,
"QTime": 7
  }
}


But aren’t in the index.

What am I missing?

On 5/4/16, 10:55 AM, "Shawn Heisey"  wrote:

>On 5/4/2016 8:38 AM, Betsey Benagh wrote:
>> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.
>>
>>
>> On 5/4/16, 10:37 AM, "Allison, Timothy B."  wrote:
>>> Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika
>>>1.11.
>
>Just upgrading to 6.0.0 isn't enough.  As Tim said, Solr 6 currently
>uses Tika 1.7, but 1.11 is required.  That's four minor versions behind
>the minimum.
>
>Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did
>mention in a previous reply, but I do not know when it will be
>available.  Tim might have a better idea.
>
>https://issues.apache.org/jira/browse/SOLR-8981
>
>You might be able to upgrade Tika in your Solr install to 1.12 yourself
>by simply replacing the jar in WEB-INF/lib ... but I do not know whether
>this will cause any other problems.  Historically, replacing the jar has
>been a safe option ... but I can't guarantee that this will always be
>the case.
>
>Thanks,
>Shawn
>



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
I’m feeling particularly dense, because I don’t see any Tika jars in
WEB-INF/lib:
antlr4-runtime-4.5.1-1.jar
 asm-5.0.4.jar
 asm-commons-5.0.4.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
 commons-configuration-1.6.jar
 commons-exec-1.3.jar
 commons-fileupload-1.2.1.jar
 commons-io-2.4.jar
 commons-lang-2.6.jar
 concurrentlinkedhashmap-lru-1.2.jar
 dom4j-1.6.1.jar
 guava-14.0.1.jar
 hadoop-annotations-2.6.0.jar
 hadoop-auth-2.6.0.jar
 hadoop-common-2.6.0.jar
 hadoop-hdfs-2.6.0.jar
 hppc-0.7.1.jar
 htrace-core-3.0.4.jar
 httpclient-4.4.1.jar
 httpcore-4.4.1.jar
 httpmime-4.4.1.jar
 jackson-core-2.5.4.jar
 jackson-dataformat-smile-2.5.4.jar
 joda-time-2.2.jar
 listing.txt
 lucene-analyzers-common-5.5.0.jar
 lucene-analyzers-kuromoji-5.5.0.jar
 lucene-analyzers-phonetic-5.5.0.jar
 lucene-backward-codecs-5.5.0.jar
 lucene-codecs-5.5.0.jar
 lucene-core-5.5.0.jar
 lucene-expressions-5.5.0.jar
 lucene-grouping-5.5.0.jar
 lucene-highlighter-5.5.0.jar
 lucene-join-5.5.0.jar
 lucene-memory-5.5.0.jar
 lucene-misc-5.5.0.jar
 lucene-queries-5.5.0.jar
 lucene-queryparser-5.5.0.jar
 lucene-sandbox-5.5.0.jar
 lucene-spatial-5.5.0.jar
 lucene-suggest-5.5.0.jar
 noggit-0.6.jar
 org.restlet-2.3.0.jar
 org.restlet.ext.servlet-2.3.0.jar
 protobuf-java-2.5.0.jar
 solr-core-5.5.0.jar
 solr-solrj-5.5.0.jar
 spatial4j-0.5.jar
 stax2-api-3.1.4.jar
 t-digest-3.1.jar
 woodstox-core-asl-4.4.1.jar
 zookeeper-3.4.6.jar










On 5/4/16, 10:55 AM, "Shawn Heisey"  wrote:

>On 5/4/2016 8:38 AM, Betsey Benagh wrote:
>> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.
>>
>>
>> On 5/4/16, 10:37 AM, "Allison, Timothy B."  wrote:
>>> Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika
>>>1.11.
>
>Just upgrading to 6.0.0 isn't enough.  As Tim said, Solr 6 currently
>uses Tika 1.7, but 1.11 is required.  That's four minor versions behind
>the minimum.
>
>Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did
>mention in a previous reply, but I do not know when it will be
>available.  Tim might have a better idea.
>
>https://issues.apache.org/jira/browse/SOLR-8981
>
>You might be able to upgrade Tika in your Solr install to 1.12 yourself
>by simply replacing the jar in WEB-INF/lib ... but I do not know whether
>this will cause any other problems.  Historically, replacing the jar has
>been a safe option ... but I can't guarantee that this will always be
>the case.
>
>Thanks,
>Shawn
>



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Shawn Heisey
On 5/4/2016 8:38 AM, Betsey Benagh wrote:
> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.
>
>
> On 5/4/16, 10:37 AM, "Allison, Timothy B."  wrote:
>> Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.

Just upgrading to 6.0.0 isn't enough.  As Tim said, Solr 6 currently
uses Tika 1.7, but 1.11 is required.  That's four minor versions behind
the minimum.

Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did
mention in a previous reply, but I do not know when it will be
available.  Tim might have a better idea.

https://issues.apache.org/jira/browse/SOLR-8981

You might be able to upgrade Tika in your Solr install to 1.12 yourself
by simply replacing the jar in WEB-INF/lib ... but I do not know whether
this will cause any other problems.  Historically, replacing the jar has
been a safe option ... but I can't guarantee that this will always be
the case.

Thanks,
Shawn



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.


On 5/4/16, 10:37 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

>Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.
>
>-Original Message-
>From: Allison, Timothy B. [mailto:talli...@mitre.org]
>Sent: Wednesday, May 4, 2016 10:29 AM
>To: solr-user@lucene.apache.org
>Subject: RE: Integrating grobid with Tika in solr
>
>I think Solr is using a version of Tika that predates that addition of
>the Grobid parser.  You'll have to add that manually somehow until Solr
>upgrades to Tika 1.13 (soon to be released...I think).  SOLR-8981.
>
>-Original Message-
>From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]
>Sent: Wednesday, May 4, 2016 10:07 AM
>To: solr-user@lucene.apache.org
>Subject: Re: Integrating grobid with Tika in solr
>
>Grobid runs as a service, and I'm (theoretically) configuring Tika to
>call it.
>
>From the Grobid wiki, here are instructions for integrating with Tika
>application:
>
>First we need to create the GrobidExtractor.properties file that points
>to the Grobid REST Service. My file looks like the following:
>
>grobid.server.url=http://localhost:[port]
>
>Now you can run GROBID via Tika-app with the following command on a
>sample PDF file.
>
>java -classpath 
>$HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar
>org.apache.tika.cli.TikaCLI
>--config=$HOME/src/grobidparser-resources/tika-config.xml -J
>$HOME/src/grobid/papers/ICSE06.pdf
>
>Here's the stack trace.
>
>name="error-class">org.apache.solr.common.SolrExceptionname="root-error-class">java.lang.ClassNotFoundExceptionname="msg">org.apache.tika.exception.TikaException: Unable to find a
>parser class: org.apache.tika.parser.journal.JournalParsername="trace">org.apache.solr.common.SolrException:
>org.apache.tika.exception.TikaException: Unable to find a parser class:
>org.apache.tika.parser.journal.JournalParser
>at 
>org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(Extract
>ingRequestHandler.java:82)
>at 
>org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:
>367)
>at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
>at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
>at 
>org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandle
>rBase.java:231)
>at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
>at 
>org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCal
>l.java:326)
>at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:225)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:183)
>at 
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>er.java:1652)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>43)
>at 
>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577
>)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>va:223)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
>va:1127)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav
>a:185)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav
>a:1061)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>41)
>at 
>org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa
>ndlerCollection.java:215)
>at 
>org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio
>n.java:110)
>at 
>org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java
>:97)
>at org.eclipse.jetty.server.Server.handle(Server.java:499)
>at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>at 
>org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257
>)
>at 
>org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a:635)
>at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:555)
>at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.tika.exception.Tika

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, May 4, 2016 10:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Integrating grobid with Tika in solr

I think Solr is using a version of Tika that predates that addition of the 
Grobid parser.  You'll have to add that manually somehow until Solr upgrades to 
Tika 1.13 (soon to be released...I think).  SOLR-8981.

-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] 
Sent: Wednesday, May 4, 2016 10:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.ja

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
I think Solr is using a version of Tika that predates that addition of the 
Grobid parser.  You'll have to add that manually somehow until Solr upgrades to 
Tika 1.13 (soon to be released...I think).  SOLR-8981.

-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] 
Sent: Wednesday, May 4, 2016 10:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method) at 
java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.ja

Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
Grobid runs as a service, and I’m (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here’s the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189)
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338)
... 35 more
500



On 5/4/16, 10:00 AM, "Shawn Heisey" 
> wrote:

On 5/4/2016 7:15 AM, Betsey Benagh wrote:
(X-posted from stack overflow)
This feels like a basic, dumb question, but my reading of the documentation has 
not led me to an answer.
i'm using Solr to index journal articles. Using the 

Re: Integrating grobid with Tika in solr

2016-05-04 Thread Shawn Heisey
On 5/4/2016 7:15 AM, Betsey Benagh wrote:
> (X-posted from stack overflow)
> 
> This feels like a basic, dumb question, but my reading of the documentation 
> has not led me to an answer.
> 
> 
> i'm using Solr to index journal articles. Using the out-of-the-box 
> configuration, it indexed the text of the documents, but I'm looking to use 
> Grobid to pull out the authors, title, affiliations, etc. I got grobid up and 
> running as a service.
> 
> I added
> 
> /path/to/tika-config.xml
> 
> to the requestHandler for /update/extract in solrconfig.xml
> 
> The tika-config looks like:
> 
> 
> 
>   
> 
>   application/pdf
> 
>   
> 
> 
> 
> I'm getting a ClassNotFound exception when I try to import a document, but 
> can't figure out where to set the classpath to fix it.

I do not know anything about grobid.

We'll need to see the exception -- the entire multi-line stacktrace,
including any "caused by" sections.

In general, you should create a lib directory in the solr home and place
all extra jars in that directory.  Otherwise you need  elements in
solrconfig.xml to load jars -- and they will be loaded once for every
core that uses that  element.  ${solr.solr.home}/lib loads jars
*once* when Solr starts and makes them available to all cores.

Thanks,
Shawn



Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
(X-posted from stack overflow)

This feels like a basic, dumb question, but my reading of the documentation has 
not led me to an answer.


i'm using Solr to index journal articles. Using the out-of-the-box 
configuration, it indexed the text of the documents, but I'm looking to use 
Grobid to pull out the authors, title, affiliations, etc. I got grobid up and 
running as a service.

I added

/path/to/tika-config.xml

to the requestHandler for /update/extract in solrconfig.xml

The tika-config looks like:



  

  application/pdf

  



I'm getting a ClassNotFound exception when I try to import a document, but 
can't figure out where to set the classpath to fix it.