Hello!

I am sending a PDF document to Tika Server and it is being detected as a plain 
text file (see full stack trace at bottom). If I specify 'Content-Type: 
application/pdf' in the header of the request, then Tika is able to extract 
content. In the tests below, mydocument.pdf is simply a text file I printed to 
PDF using Google Chrome.

Am I wrong in expecting that Tika determine the type of document without any 
additional help?

Sent:
  curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf"
 curl -X PUT http://localhost:9998/tika -F "[email protected]"
Received:
  HTTP 415 Unsupported Media Type exception

Sent:
  curl -X PUT http://localhost:9998/tika --data-binary "@mydocument.pdf" -H 
"Content-Type: application/pdf"
  curl -X PUT http://localhost:9998/meta -F "[email protected]" -H 
"Content-Type: application/pdf"
Received:
  Text for the PDF


INFO  tika (application/x-www-form-urlencoded)
WARN  tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.server.resource.TikaResource$1@1469bc28
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:390)
        at 
org.apache.tika.server.resource.TikaResource$5.write(TikaResource.java:489)
        at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
        at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1414)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:243)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:119)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:82)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231)
        at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
        at 
org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:125)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 32 more
ERROR Problem with writing the data, class 
org.apache.tika.server.resource.TikaResource$5, ContentType: text/plain


Thanks!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.

Reply via email to