Re: Question about indexing PDFs

2016-08-26 Thread Betsey Benagh
Erick,

I’m not sure of anything.  I’m new to Solr and find the documentation
extremely confusing.  I’ve searched the web and found tutorials/advice,
but they generally refer to older versions of Solr, and refer to
methods/settings/whatever that no longer exist. That’s why I’m asking for
help here.

I looked at the list of fields in the schema browser, and ‘content' is not
there.  If that is not enough to ‘assume’ that the content is not being
indexed, then please enlighten me as to what is.

I inserted the docs in batches by posting them, following the ‘Quick
Start’ tutorial.  It seemed like a safe assumption that the tutorial on
the Solr site would be correct and produce desirable results.

What I really want to do is index the XML versions of the documents which
have been run through another system, but I cannot for the life of me
figure out how to do that.  I’ve tried, but the documentation about XML
makes no sense to me.  I thought indexing the PDF versions would be easier
and more straightforward, but perhaps that is not the case.

Thanks,

betsey

On 8/25/16, 5:39 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>That is always a dangerous assumption. Are you sure
>you're searching on the proper field? Are you sure it's indexed? Are
>you sure it's
>
>The schema browser I indicated above will give you some
>idea what's actually in the field. You can not only see the
>fields Solr (actually Lucene) see in your index, but you can
>also see what some of the terms are.
>
>Adding =query and looking at the parsed query
>will show you what fields are being searched against. The
>most common causes of what you're describing are:
>
>> not searching against the field you think you are. This
>is very easy to do without knowing it.
>
>> not actually having 'indexed="true" set in your schema
>
>> not committing after inserting the doc
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
>betsey.ben...@stresearch.com> wrote:
>
>> It looks like the metadata of the PDFs was indexed, but not the content
>> (which is what I was interested in).  Searches on terms I know exist in
>> the content come up empty.
>>
>> On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.ben...@stresearch.com>
>>wrote:
>>
>> >Right, that¹s where I looked.  No Œcontent¹.  Which is what confused
>>me.
>> >
>> >
>> >On 8/25/16, 1:56 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>> >
>> >>when you say "I don't see it in the schema for that collection" are
>>you
>> >>talking schema.xml? managed_schema? Or actual documents in the index?
>> >>Often
>> >>these are defined by dynamic fields and the like in the schema files.
>> >>
>> >>Take a look at the admin UI>>schema browser>>drop down and you'll see
>>all
>> >>the actual fields in your index...
>> >>
>> >>Best,
>> >>Erick
>> >>
>> >>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>> >><betsey.ben...@stresearch.com
>> >>> wrote:
>> >>
>> >>> Following the instructions in the quick start guide, I imported a
>>bunch
>> >>>of
>> >>> PDF documents into my Solr 6.0 instance.  As far as I can tell from
>>the
>> >>> documentation, there should be a 'content' field indexing, well, the
>> >>> content, but I don't see it in the schema for that collection.  Is
>> >>>there
>> >>> something obvious I might have missed?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >
>>
>>



Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.


On 8/25/16, 1:56 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>when you say "I don't see it in the schema for that collection" are you
>talking schema.xml? managed_schema? Or actual documents in the index?
>Often
>these are defined by dynamic fields and the like in the schema files.
>
>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>the actual fields in your index...
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
><betsey.ben...@stresearch.com
>> wrote:
>
>> Following the instructions in the quick start guide, I imported a bunch
>>of
>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>> documentation, there should be a 'content' field indexing, well, the
>> content, but I don't see it in the schema for that collection.  Is there
>> something obvious I might have missed?
>>
>> Thanks!
>>
>>



Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
It looks like the metadata of the PDFs was indexed, but not the content
(which is what I was interested in).  Searches on terms I know exist in
the content come up empty.

On 8/25/16, 2:16 PM, "Betsey Benagh" <betsey.ben...@stresearch.com> wrote:

>Right, that¹s where I looked.  No Œcontent¹.  Which is what confused me.
>
>
>On 8/25/16, 1:56 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>>when you say "I don't see it in the schema for that collection" are you
>>talking schema.xml? managed_schema? Or actual documents in the index?
>>Often
>>these are defined by dynamic fields and the like in the schema files.
>>
>>Take a look at the admin UI>>schema browser>>drop down and you'll see all
>>the actual fields in your index...
>>
>>Best,
>>Erick
>>
>>On Thu, Aug 25, 2016 at 8:39 AM, Betsey Benagh
>><betsey.ben...@stresearch.com
>>> wrote:
>>
>>> Following the instructions in the quick start guide, I imported a bunch
>>>of
>>> PDF documents into my Solr 6.0 instance.  As far as I can tell from the
>>> documentation, there should be a 'content' field indexing, well, the
>>> content, but I don't see it in the schema for that collection.  Is
>>>there
>>> something obvious I might have missed?
>>>
>>> Thanks!
>>>
>>>
>



Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
Following the instructions in the quick start guide, I imported a bunch of PDF 
documents into my Solr 6.0 instance.  As far as I can tell from the 
documentation, there should be a 'content' field indexing, well, the content, 
but I don't see it in the schema for that collection.  Is there something 
obvious I might have missed?

Thanks!



Oddity with importing documents...

2016-05-06 Thread Betsey Benagh
Since it appears that using a recent version of Tika with Solr is not really 
feasible, I'm trying to run Grobid on my files, and then import the
corresponding XML into Solr.

I don't see any errors on the post:

bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
/Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
-classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
-Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
/Users/bba0124/software/grobid/out/021002_1.tei.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
tf,htm,html,txt,log
POSTing file 021002_1.tei.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/lrdtest/update...
Time spent: 0:00:00.027

But the documents don't seem to show up in the index, either.


Additionally, if I try uploading the documents using the web UI, they
appear to upload successfully,

Response:{
  "responseHeader": {
"status": 0,
"QTime": 7
  }
}


But aren't in the index.

What am I missing?



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
As a workaround, I’m trying to run Grobid on my files, and then import the
corresponding XML into Solr.

I don’t see any errors on the post:

bba0124$ bin/post -c lrdtest ~/software/grobid/out/021002_1.tei.xml
/Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/java
-classpath /Users/bba0124/software/solr-5.5.0/dist/solr-core-5.5.0.jar
-Dauto=yes -Dc=lrdtest -Ddata=files org.apache.solr.util.SimplePostTool
/Users/bba0124/software/grobid/out/021002_1.tei.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/lrdtest/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,r
tf,htm,html,txt,log
POSTing file 021002_1.tei.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/lrdtest/update...
Time spent: 0:00:00.027

But the documents don’t seem to show up in the index, either.


Additionally, if I try uploading the documents using the web UI, they
appear to upload successfully,

Response:{
  "responseHeader": {
"status": 0,
"QTime": 7
  }
}


But aren’t in the index.

What am I missing?

On 5/4/16, 10:55 AM, "Shawn Heisey" <apa...@elyograg.org> wrote:

>On 5/4/2016 8:38 AM, Betsey Benagh wrote:
>> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.
>>
>>
>> On 5/4/16, 10:37 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>>> Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika
>>>1.11.
>
>Just upgrading to 6.0.0 isn't enough.  As Tim said, Solr 6 currently
>uses Tika 1.7, but 1.11 is required.  That's four minor versions behind
>the minimum.
>
>Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did
>mention in a previous reply, but I do not know when it will be
>available.  Tim might have a better idea.
>
>https://issues.apache.org/jira/browse/SOLR-8981
>
>You might be able to upgrade Tika in your Solr install to 1.12 yourself
>by simply replacing the jar in WEB-INF/lib ... but I do not know whether
>this will cause any other problems.  Historically, replacing the jar has
>been a safe option ... but I can't guarantee that this will always be
>the case.
>
>Thanks,
>Shawn
>



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
I’m feeling particularly dense, because I don’t see any Tika jars in
WEB-INF/lib:
antlr4-runtime-4.5.1-1.jar
 asm-5.0.4.jar
 asm-commons-5.0.4.jar
 commons-cli-1.2.jar
 commons-codec-1.10.jar
 commons-collections-3.2.2.jar
 commons-configuration-1.6.jar
 commons-exec-1.3.jar
 commons-fileupload-1.2.1.jar
 commons-io-2.4.jar
 commons-lang-2.6.jar
 concurrentlinkedhashmap-lru-1.2.jar
 dom4j-1.6.1.jar
 guava-14.0.1.jar
 hadoop-annotations-2.6.0.jar
 hadoop-auth-2.6.0.jar
 hadoop-common-2.6.0.jar
 hadoop-hdfs-2.6.0.jar
 hppc-0.7.1.jar
 htrace-core-3.0.4.jar
 httpclient-4.4.1.jar
 httpcore-4.4.1.jar
 httpmime-4.4.1.jar
 jackson-core-2.5.4.jar
 jackson-dataformat-smile-2.5.4.jar
 joda-time-2.2.jar
 listing.txt
 lucene-analyzers-common-5.5.0.jar
 lucene-analyzers-kuromoji-5.5.0.jar
 lucene-analyzers-phonetic-5.5.0.jar
 lucene-backward-codecs-5.5.0.jar
 lucene-codecs-5.5.0.jar
 lucene-core-5.5.0.jar
 lucene-expressions-5.5.0.jar
 lucene-grouping-5.5.0.jar
 lucene-highlighter-5.5.0.jar
 lucene-join-5.5.0.jar
 lucene-memory-5.5.0.jar
 lucene-misc-5.5.0.jar
 lucene-queries-5.5.0.jar
 lucene-queryparser-5.5.0.jar
 lucene-sandbox-5.5.0.jar
 lucene-spatial-5.5.0.jar
 lucene-suggest-5.5.0.jar
 noggit-0.6.jar
 org.restlet-2.3.0.jar
 org.restlet.ext.servlet-2.3.0.jar
 protobuf-java-2.5.0.jar
 solr-core-5.5.0.jar
 solr-solrj-5.5.0.jar
 spatial4j-0.5.jar
 stax2-api-3.1.4.jar
 t-digest-3.1.jar
 woodstox-core-asl-4.4.1.jar
 zookeeper-3.4.6.jar










On 5/4/16, 10:55 AM, "Shawn Heisey" <apa...@elyograg.org> wrote:

>On 5/4/2016 8:38 AM, Betsey Benagh wrote:
>> Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.
>>
>>
>> On 5/4/16, 10:37 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>>> Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika
>>>1.11.
>
>Just upgrading to 6.0.0 isn't enough.  As Tim said, Solr 6 currently
>uses Tika 1.7, but 1.11 is required.  That's four minor versions behind
>the minimum.
>
>Tim has filed an issue for upgrading Tika to 1.13 in Solr, which he did
>mention in a previous reply, but I do not know when it will be
>available.  Tim might have a better idea.
>
>https://issues.apache.org/jira/browse/SOLR-8981
>
>You might be able to upgrade Tika in your Solr install to 1.12 yourself
>by simply replacing the jar in WEB-INF/lib ... but I do not know whether
>this will cause any other problems.  Historically, replacing the jar has
>been a safe option ... but I can't guarantee that this will always be
>the case.
>
>Thanks,
>Shawn
>



Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
Thanks, I¹m currently using 5.5, and will try upgrading to 6.0.


On 5/4/16, 10:37 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

>Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.
>
>-Original Message-
>From: Allison, Timothy B. [mailto:talli...@mitre.org]
>Sent: Wednesday, May 4, 2016 10:29 AM
>To: solr-user@lucene.apache.org
>Subject: RE: Integrating grobid with Tika in solr
>
>I think Solr is using a version of Tika that predates that addition of
>the Grobid parser.  You'll have to add that manually somehow until Solr
>upgrades to Tika 1.13 (soon to be released...I think).  SOLR-8981.
>
>-Original Message-
>From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]
>Sent: Wednesday, May 4, 2016 10:07 AM
>To: solr-user@lucene.apache.org
>Subject: Re: Integrating grobid with Tika in solr
>
>Grobid runs as a service, and I'm (theoretically) configuring Tika to
>call it.
>
>From the Grobid wiki, here are instructions for integrating with Tika
>application:
>
>First we need to create the GrobidExtractor.properties file that points
>to the Grobid REST Service. My file looks like the following:
>
>grobid.server.url=http://localhost:[port]
>
>Now you can run GROBID via Tika-app with the following command on a
>sample PDF file.
>
>java -classpath 
>$HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar
>org.apache.tika.cli.TikaCLI
>--config=$HOME/src/grobidparser-resources/tika-config.xml -J
>$HOME/src/grobid/papers/ICSE06.pdf
>
>Here's the stack trace.
>
>name="error-class">org.apache.solr.common.SolrExceptionname="root-error-class">java.lang.ClassNotFoundExceptionname="msg">org.apache.tika.exception.TikaException: Unable to find a
>parser class: org.apache.tika.parser.journal.JournalParsername="trace">org.apache.solr.common.SolrException:
>org.apache.tika.exception.TikaException: Unable to find a parser class:
>org.apache.tika.parser.journal.JournalParser
>at 
>org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(Extract
>ingRequestHandler.java:82)
>at 
>org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:
>367)
>at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
>at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
>at 
>org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandle
>rBase.java:231)
>at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
>at 
>org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCal
>l.java:326)
>at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:225)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:183)
>at 
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>er.java:1652)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>43)
>at 
>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577
>)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>va:223)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
>va:1127)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav
>a:185)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav
>a:1061)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>41)
>at 
>org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa
>ndlerCollection.java:215)
>at 
>org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio
>n.java:110)
>at 
>org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java
>:97)
>at org.eclipse.jetty.server.Server.handle(Server.java:499)
>at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>at 
>org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257
>)
>at 
>org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a:635)
>at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:555)
>at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.tika.exception.Tika

Re: Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
Grobid runs as a service, and I’m (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here’s the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189)
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:338)
... 35 more
500



On 5/4/16, 10:00 AM, "Shawn Heisey" 
<apa...@elyograg.org<mailto:apa...@elyograg.org>> wrote:

On 5/4/2016 7:15 AM, Betsey Benagh wrote:
(X-posted from stack overflow)
This feels like a basic, dumb question, but my reading of the documentation has 
not led me to an answer.
i'm using Solr to index jou

Integrating grobid with Tika in solr

2016-05-04 Thread Betsey Benagh
(X-posted from stack overflow)

This feels like a basic, dumb question, but my reading of the documentation has 
not led me to an answer.


i'm using Solr to index journal articles. Using the out-of-the-box 
configuration, it indexed the text of the documents, but I'm looking to use 
Grobid to pull out the authors, title, affiliations, etc. I got grobid up and 
running as a service.

I added

/path/to/tika-config.xml

to the requestHandler for /update/extract in solrconfig.xml

The tika-config looks like:



  

  application/pdf

  



I'm getting a ClassNotFound exception when I try to import a document, but 
can't figure out where to set the classpath to fix it.


Re: Growing memory?

2016-04-14 Thread Betsey Benagh
bin/solr status shows the memory usage increasing, as does the admin ui.

I¹m running this on a shared machine that is supporting several other
applications, so I can¹t be particularly greedy with memory usage.  Is
there anything out there that gives guidelines on what an appropriate
amount of heap is based on number of documents or whatever?  We¹re just
playing around with it right now, but it sounds like we may need a
different machine in order to load in all of the data we want to have
available.

Thanks,
betsey

On 4/14/16, 3:08 PM, "Shawn Heisey" <apa...@elyograg.org> wrote:

>On 4/14/2016 12:45 PM, Betsey Benagh wrote:
>> I'm running solr 6.0.0 in server mode. I have one core. I loaded about
>>2000 documents in, and it was using about 54 MB of memory. No problem.
>>Nobody was issuing queries or doing anything else, but over the course
>>of about 4 hours, the memory usage had tripled to 152 MB. I shut solr
>>down and restarted it, and saw the memory usage back at 54 MB. Again,
>>with no queries or anything being executed against the core, the memory
>>usage is creeping up - after 17 minutes, it was up to 60 MB. I've looked
>>at the documentation for how to limit memory usage, but I want to
>>understand why it's creeping up when nothing is happening, lest it run
>>out of memory when I limit the usage. The machine is running CentOS 6.6,
>>if that matters, with Java 1.8.0_65.
>
>When you start Solr 5.0 or later directly from the download or directly
>after installing it with the service installer script (on *NIX
>platforms), Solr starts with a 512MB Java heap.  You can change this if
>you need to -- most Solr users do need to increase the heap size to a
>few gigabytes.
>
>Java uses a garbage collection memory model.  It's perfectly normal
>during the operation of a Java program, even one that is not doing
>anything you can see, for the memory utilization to rise up to the
>configured heap size.  This is simply how things work in systems using a
>garbage collection memory model.
>
>Where exactly are you looking to find the memory utilization?  In the
>admin UI, that number will go up over time, until one of the memory
>pools gets full and Java does a garbage collection, and then it will
>likely go down again.  From the operating system point of view, the
>resident memory usage will increase up to a point (when the entire heap
>has been allocated) and probably never go back down -- but it also
>shouldn't go up either.
>
>Thanks,
>Shawn
>



Re: Growing memory?

2016-04-14 Thread Betsey Benagh
Thanks for the quick response.  Forgive the naïve question, but shouldn¹t
it be doing garbage collection automatically? Having to manually force GC
via jconsole isn¹t a sustainable solution.

Thanks again,
betsey

On 4/14/16, 2:54 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>well, things _are_ running, specifically the communications channels
>are looking for incoming messages and the like, generating garbage
>etc.
>
>Try attaching jconsole to the process and hitting the GC button to
>force a garbage collection. As long as your memory gets to some level
>and drops back to that level after forcing GCs, you'll be fine.
>
>Best,
>Erick
>
>On Thu, Apr 14, 2016 at 11:45 AM, Betsey Benagh
><betsey.ben...@stresearch.com> wrote:
>> X-posted from stack overflow...
>>
>> I'm running solr 6.0.0 in server mode. I have one core. I loaded about
>>2000 documents in, and it was using about 54 MB of memory. No problem.
>>Nobody was issuing queries or doing anything else, but over the course
>>of about 4 hours, the memory usage had tripled to 152 MB. I shut solr
>>down and restarted it, and saw the memory usage back at 54 MB. Again,
>>with no queries or anything being executed against the core, the memory
>>usage is creeping up - after 17 minutes, it was up to 60 MB. I've looked
>>at the documentation for how to limit memory usage, but I want to
>>understand why it's creeping up when nothing is happening, lest it run
>>out of memory when I limit the usage. The machine is running CentOS 6.6,
>>if that matters, with Java 1.8.0_65.
>>
>> Thanks!
>>



Growing memory?

2016-04-14 Thread Betsey Benagh
X-posted from stack overflow...

I'm running solr 6.0.0 in server mode. I have one core. I loaded about 2000 
documents in, and it was using about 54 MB of memory. No problem. Nobody was 
issuing queries or doing anything else, but over the course of about 4 hours, 
the memory usage had tripled to 152 MB. I shut solr down and restarted it, and 
saw the memory usage back at 54 MB. Again, with no queries or anything being 
executed against the core, the memory usage is creeping up - after 17 minutes, 
it was up to 60 MB. I've looked at the documentation for how to limit memory 
usage, but I want to understand why it's creeping up when nothing is happening, 
lest it run out of memory when I limit the usage. The machine is running CentOS 
6.6, if that matters, with Java 1.8.0_65.

Thanks!