Re: Text Only Extraction Using Solr and Tika

Emyr James Thu, 05 May 2011 07:27:31 -0700

Hi,

I'm not really sure how these can help with my problem. Can you give abit more info on this ?


I think what i'm after is a fairly common request..

http://lucene.472066.n3.nabble.com/Controlling-Tika-s-metadata-td2378677.html
http://lucene.472066.n3.nabble.com/Select-tika-output-for-extract-only-td499059.html#a499062

Did the change that Yonik Seely mentions to allow more control over theoutput ever make it into 1.4 ?


Regards,
Emyr

On 05/05/11 15:01, Anuj Kumar wrote:

Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.

References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Regards,
Anuj

On Thu, May 5, 2011 at 7:28 PM, Emyr James<emyr.ja...@sussex.ac.uk>  wrote:

Thanks for the suggestion but there surely must be a better way than that
to do it ?
I don't want to post the whole file up, get it extracted on the server,
send the extracted text back to the client then send it all back up to the
server again as plain text.


On 05/05/11 14:55, Jay Luker wrote:

Hi Emyr,

You could try using the "extractOnly=true" parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James<emyr.ja...@sussex.ac.uk>
  wrote:

Hi All,

I have solr and tika installed and am happily extracting and indexing
various files.
Unfortunately on some word documents it blows up since it tries to
auto-generate a 'title' field but my title field in the schema is single
valued.

Here is my config for the extract handler...

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>

Is there a config option to make it only extract text, or ideally to
allow
me to specify which metadata fields to accept ?

E.g. I'd like to use any author metadata it finds but to not use any
title
metadata it finds as I want title to be single valued and set explicitly
using a literal.title in the post request.

I did look around for some docs but all i can find are very basic
examples.
there's no comprehensive configuration documentation out there as far as
I
can tell.


ALSO...

I get some other bad responses coming back such as...

<html><head><title>Apache Tomcat/6.0.28 - Error
report</title><style><!--H1

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
B
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P

{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
</head><body><h1>HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

java.lang.NoSuchMethodError:

org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
    at

org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
    at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
    at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
    at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at

org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at

org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at

org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at

org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at

org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at

org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at

org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at

org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at

org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at

org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
    at

org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
    at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
    at java.lang.Thread.run(Thread.java:636)
</h1><HR size="1" noshade="noshade"><p><b>type</b>   Status
report</p><p><b>message</b>

<u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

For the above my url was...


http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten

I guess there's something special I need to be able to process power
point
files ? Maybe I need to get the latest apache POI ? Any suggestions
welcome...


Regards,

Emyr

Re: Text Only Extraction Using Solr and Tika

Reply via email to