On 2010-07-06 16:22, Julien Nioche wrote:
i'm trying to reproduce the problem outside of the "crawl" command,
through multi-step script approach.
the problem happens again before the parse command.
you specified -noparse on the fetch command line, didn't you?
i guess the problem
is indeed in the protocol-httpclient. Though i can't understand what's
happening. Why, in the fetching phase, the tika parser is called for
TXT? The parser is called on the content in the Fetcher output() method.
the parser should not be called at all if you specify -noparse for the fetch
as for TXT the parser is used to find outlinks
Careful here - the option is called -noParsing. -noparse won't work, in
such case Fetcher will default to whatever was set in
nutch-site/nutch-default.xml (which often is set to parsing).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com