Thanks for the suggestion but there surely must be a better way than that to do it ? I don't want to post the whole file up, get it extracted on the server, send the extracted text back to the client then send it all back up to the server again as plain text.

On 05/05/11 14:55, Jay Luker wrote:
Hi Emyr,

You could try using the "extractOnly=true" parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James<emyr.ja...@sussex.ac.uk>  wrote:
Hi All,

I have solr and tika installed and am happily extracting and indexing
various files.
Unfortunately on some word documents it blows up since it tries to
auto-generate a 'title' field but my title field in the schema is single
valued.

Here is my config for the extract handler...

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>

Is there a config option to make it only extract text, or ideally to allow
me to specify which metadata fields to accept ?

E.g. I'd like to use any author metadata it finds but to not use any title
metadata it finds as I want title to be single valued and set explicitly
using a literal.title in the post request.

I did look around for some docs but all i can find are very basic examples.
there's no comprehensive configuration documentation out there as far as I
can tell.


ALSO...

I get some other bad responses coming back such as...

<html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
</head><body><h1>HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

java.lang.NoSuchMethodError:
org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
    at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
    at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
    at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
    at java.lang.Thread.run(Thread.java:636)
</h1><HR size="1" noshade="noshade"><p><b>type</b>  Status
report</p><p><b>message</b>
<u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

For the above my url was...

  
http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten

I guess there's something special I need to be able to process power point
files ? Maybe I need to get the latest apache POI ? Any suggestions
welcome...


Regards,

Emyr


Reply via email to