Julien Nioche wrote: > Hi > > > >> Anyway, from the logs, i see that it ended up trying to parse Thumb.db >> files and some .pyd files which looked to it like plain-text. >> By ignoring .db and .pyd files in craw-urlfilter i managed to get the >> number of hanging threads down to a lower number. This is good to >> understand what's happening but i can't predict all the extensions the >> crawler is going to meet during the filesystem crawl. >> >> my parsing configuration is : >> >> >> <name>plugin.includes</name> >> >> <value>protocol-httpclient|parse-(rss|text|html|tika)|language-identifier|urlfilter-regex|index-(basic|anchor)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> >> >> Do you have any suggestions? Should i file an issue? >> >> > > First I'd do the parsing separately from the fetching. If the parsing fails > you can always retry with different params without having to refetch. > > I think there are some issues with protocol-httpclient so if you don't need > identification use protocol-http instead. > > i start to believe that the problem is here. i'm trying to reproduce the problem outside of the "crawl" command, through multi-step script approach. the problem happens again before the parse command. i guess the problem is indeed in the protocol-httpclient. Though i can't understand what's happening. Why, in the fetching phase, the tika parser is called for TXT? The parser is called on the content in the Fetcher output() method.
I'll try to switch back to protocol-http and ask the sysadmins to put my ip in the webserver's whitelist. > See https://issues.apache.org/jira/browse/NUTCH-696 - Andrzej has provided a > patch which would be the solution to your problem > > that could be a nice idea, actually > Finally if you are able to identify the URLs which are problematic in the > logs try using Tika directly to see whether the problem comes from there > (e.g. parsing loop, wrong identification of mime-type etc...) and if so file > an issue on the Tika JIRA. > > HTH > > Julien > -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [email protected] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [email protected] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.

