Hi

>
> Anyway, from the logs, i see that it ended up trying to parse Thumb.db
> files and some .pyd files which looked to it like plain-text.
> By ignoring .db and .pyd files in craw-urlfilter i managed to get the
> number of hanging threads down to a lower number. This is good to
> understand what's happening but i can't predict all the extensions the
> crawler is going to meet during the filesystem crawl.
>
> my parsing configuration is :
>
>
>  <name>plugin.includes</name>
>
> <value>protocol-httpclient|parse-(rss|text|html|tika)|language-identifier|urlfilter-regex|index-(basic|anchor)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
>
> Do you have any suggestions? Should i file an issue?
>

First I'd do the parsing separately from the fetching. If the parsing fails
you can always retry with different params without having to refetch.

I think there are some issues with protocol-httpclient so if you don't need
identification use protocol-http instead.

See https://issues.apache.org/jira/browse/NUTCH-696 - Andrzej has provided a
patch which would be the solution to your problem

Finally if you are able to identify the URLs which are problematic in the
logs try using Tika directly to see whether the problem comes from there
(e.g. parsing loop, wrong identification of mime-type etc...) and if so file
an issue on the Tika JIRA.

HTH

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to