Why not feed the original PDF files in instead? Just curious if pdftotext is doing a better job than Tika's PDFBox stuff.

        Erik

On Mar 22, 2010, at 9:30 AM, Ross wrote:

Thanks Georg

I don't think it's that because it crashes on a one word test file I
create using the nano editor. I don't think nano is adding anything
extra.

My real files are created by a Windows utility called pdftotext. I
solved the problem by getting pdftotext to generate html files rather
than plain text. It just adds an html header and wraps everything in a
<pre> tag. That seems to keep Solr happy.

Ross

On Mon, Mar 22, 2010 at 9:08 AM, György Frivolt
<gyorgy.friv...@gmail.com> wrote:
Hi,

I had problem with indexing documents some months ago as well. I found that there were XML control characters in the documents and these were not
handled by Solr. Maybe it is the case for you as well.

Regards,

   Georg


On Sun, Mar 21, 2010 at 5:58 PM, Ross <tetr...@gmail.com> wrote:

Hi all

I'm trying to import some text files. I'm mostly following Avi
Rappoport's tutorial.  Some of my files cause Solr to crash while
indexing. I've narrowed it down to a very simple example.

I have a file named test.txt with one line. That line is the word
XXBLE and nothing else

This is the command I'm using.

curl "
http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true
"
-F "myfi...@test.txt"

The result is pasted below. Other files work just fine. The problem
seems to be related to the letters B and E. If I change them to
something else or make them lower case then it works. In my real
files, the XX is something else but the result is the same. It's a
common word in the files. I guess for this "quick and dirty" job I'm
doing I could do a bulk replace in the files to make it lower case.

Is there any workaround for this?

Thanks
Ross

<html><head><title>Apache Tomcat/6.0.20 - Error
report</title><style><!--H1

{font-family:Tahoma,Arial,sans-serif;color:white;background- color:#525D76;font-size:22px;}
H2
{font-family:Tahoma,Arial,sans-serif;color:white;background- color:#525D76;font-size:16px;}
H3
{font-family:Tahoma,Arial,sans-serif;color:white;background- color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background- color:white;}
B
{font-family:Tahoma,Arial,sans-serif;color:white;background- color:#525D76;}
P
{font-family:Tahoma,Arial,sans- serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color :
#525D76;}--></style> </head><body><h1>HTTP Status 500 -
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
       at
org .apache .solr .handler .ContentStreamHandlerBase .handleRequestBody(ContentStreamHandlerBase.java:54)
       at
org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 131)
       at
org.apache.solr.core.RequestHandlers $LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
       at
org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java: 338)
       at
org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 241)
       at
org .apache .catalina .core .ApplicationFilterChain .internalDoFilter(ApplicationFilterChain.java:235)
       at
org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 206)
       at
org .apache .catalina .core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
       at
org .apache .catalina .core.StandardContextValve.invoke(StandardContextValve.java:191)
       at
org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
       at
org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
       at
org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109)
       at
org .apache .catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
       at
org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 849)
       at
org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:583)
       at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java: 454)
       at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 121)
       at
org .apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 105)
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
       ... 18 more
Caused by: java.lang.NullPointerException
       at java.io.Reader.&lt;init&gt;(Reader.java:78)
at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 93) at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 108) at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:59)
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 119)
       ... 20 more
</h1><HR size="1" noshade="noshade"><p><b>type</b> Status
report</p><p><b>message</b>
<u>org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
       at
org .apache .solr .handler .ContentStreamHandlerBase .handleRequestBody(ContentStreamHandlerBase.java:54)
       at
org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 131)
       at
org.apache.solr.core.RequestHandlers $LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
       at
org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java: 338)
       at
org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 241)
       at
org .apache .catalina .core .ApplicationFilterChain .internalDoFilter(ApplicationFilterChain.java:235)
       at
org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 206)
       at
org .apache .catalina .core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
       at
org .apache .catalina .core.StandardContextValve.invoke(StandardContextValve.java:191)
       at
org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
       at
org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
       at
org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109)
       at
org .apache .catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
       at
org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 849)
       at
org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:583)
       at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java: 454)
       at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 121)
       at
org .apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 105)
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
       ... 18 more
Caused by: java.lang.NullPointerException
       at java.io.Reader.&lt;init&gt;(Reader.java:78)
at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 93) at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 108) at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:59)
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 119)
       ... 20 more
</u></p><p><b>description</b> <u>The server encountered an internal
error (org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
       at
org .apache .solr .handler .ContentStreamHandlerBase .handleRequestBody(ContentStreamHandlerBase.java:54)
       at
org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 131)
       at
org.apache.solr.core.RequestHandlers $LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
       at
org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java: 338)
       at
org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 241)
       at
org .apache .catalina .core .ApplicationFilterChain .internalDoFilter(ApplicationFilterChain.java:235)
       at
org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 206)
       at
org .apache .catalina .core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
       at
org .apache .catalina .core.StandardContextValve.invoke(StandardContextValve.java:191)
       at
org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
       at
org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
       at
org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109)
       at
org .apache .catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
       at
org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 849)
       at
org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:583)
       at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java: 454)
       at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 121)
       at
org .apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 105)
       at
org .apache .solr .handler .extraction .ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
       ... 18 more
Caused by: java.lang.NullPointerException
       at java.io.Reader.&lt;init&gt;(Reader.java:78)
at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 93) at java.io.BufferedReader.&lt;init&gt;(BufferedReader.java: 108) at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:59)
       at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 119)
       ... 20 more
) that prevented it from fulfilling this request.</u></p><HR size="1"
noshade="noshade"><h3>Apache Tomcat/6.0.20</h3></body></html>



Reply via email to