Re: Not all parsed docs is indexed & inconsistent parsed docs.

Lewis John Mcgibbney Sat, 12 Jan 2013 09:02:57 -0800

Hi,

On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata
<[email protected]>wrote:


>
> We can see that some of parse processes were not completed successfully.
>

Yes I see this. I also see that you have a http.proxy.port = 8080 but no
proxy host and that the protocol-httpclient plugin is not activated.
I also see some strange fetcher behaviour as it seems to fetch the server
instance e.g. 2013-01-12 05:37:41,987 INFO  fetcher.FetcherJob - fetching
http://localhost/, however I assume there is no document @ this location on
the server...

That being said, as we've established fetching does not seem to be the
problem.

Unless you wish to skip parsing for truncated documents then you will need
to increase the http.content.limit to something over ~40K. This will then
remove the following log output (meaning that the document should be fully
parsed)
2013-01-12 05:38:27,508 WARN  parse.ParserJob -
http://localhost/sapi/Solr-install-v2.pdf skipped. Content of size 395125
was truncated to 65536
You may also wish to consider the parser.skip.truncated property in
nutch-site.xml

I don't suppose these PDF's are password protected or something like that?

I would also explicitly map the content type
application/vnd.oasis.opendocument.text to parse-tika in parse-plugins.xml.

2013-01-12 05:39:07,594 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content type
application/vnd.oasis.opendocument.text, but they are not mapped to it  in
the parse-plugins.xml file

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Reply via email to