well spotted! Should have checked the name indeed. Anyway, Claudio let us
know how you are getting on.

Julien

On 6 July 2010 17:23, Andrzej Bialecki <[email protected]> wrote:

> On 2010-07-06 16:22, Julien Nioche wrote:
>
>> i'm trying to reproduce the problem outside of the "crawl" command,
>>> through multi-step script approach.
>>> the problem happens again before the parse command.
>>>
>>
>>
>> you specified -noparse on the fetch command line, didn't you?
>>
>>
>>  i guess the problem
>>> is indeed in the protocol-httpclient. Though i can't understand what's
>>> happening. Why, in the fetching phase, the tika parser is called for
>>> TXT? The parser is called on the content in the Fetcher output() method.
>>>
>>>
>> the parser should not be called at all if you specify -noparse for the
>> fetch
>>
>> as for TXT the parser is used to find outlinks
>>
>>
> Careful here - the option is called -noParsing. -noparse won't work, in
> such case Fetcher will default to whatever was set in
> nutch-site/nutch-default.xml (which often is set to parsing).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to