Julien Nioche wrote:
> Hi
>
>
>   
>> Anyway, from the logs, i see that it ended up trying to parse Thumb.db
>> files and some .pyd files which looked to it like plain-text.
>> By ignoring .db and .pyd files in craw-urlfilter i managed to get the
>> number of hanging threads down to a lower number. This is good to
>> understand what's happening but i can't predict all the extensions the
>> crawler is going to meet during the filesystem crawl.
>>
>> my parsing configuration is :
>>
>>
>>  <name>plugin.includes</name>
>>
>> <value>protocol-httpclient|parse-(rss|text|html|tika)|language-identifier|urlfilter-regex|index-(basic|anchor)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>
>>
>> Do you have any suggestions? Should i file an issue?
>>
>>     
>
> First I'd do the parsing separately from the fetching. If the parsing fails
> you can always retry with different params without having to refetch.
>
> I think there are some issues with protocol-httpclient so if you don't need
> identification use protocol-http instead.
>
>   
i start to believe that the problem is here.
i'm trying to reproduce the problem outside of the "crawl" command,
through multi-step script approach.
the problem happens again before the parse command. i guess the problem
is indeed in the protocol-httpclient. Though i can't understand what's
happening. Why, in the fetching phase, the tika parser is called for
TXT? The parser is called on the content in the Fetcher output() method.

I'll try to switch back to protocol-http and ask the sysadmins to put my
ip in the webserver's whitelist.
> See https://issues.apache.org/jira/browse/NUTCH-696 - Andrzej has provided a
> patch which would be the solution to your problem
>
>   

that could be a nice idea, actually

> Finally if you are able to identify the URLs which are problematic in the
> logs try using Tika directly to see whether the problem comes from there
> (e.g. parsing loop, wrong identification of mime-type etc...) and if so file
> an issue on the Tika JIRA.
>
> HTH
>
> Julien
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to [email protected] in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.


Reply via email to