Re: Nutch Fetch Stuck

2010-03-13 Thread Andrzej Bialecki

On 2010-03-13 00:12, Abhi Yerra wrote:

So I had -noParsing set. So parsing was not part of the fetch. The
pages have been crawled, but the reducers have crashed. So if I
restart the fetch will it try to crawl all those pages again?


Yes. It would be good to investigate first Why it crashed, otherwise 
it's likely to happen again. Are you running this on a cluster? Check 
the logs of the crashed tasks (in logs/userlogs/ on respective 
tasktracker nodes).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Proxy Authentication

2010-03-13 Thread Susam Pal
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal  wrote:
> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
>  wrote:
>> Il 11/03/2010 16.20, Susam Pal ha scritto:
>>>
>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
>>>   wrote:
>>>

 Hi everyone,

 I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
 When
 I try to fetch my website list, into the log file I see that the
 authentication was failed...

 I've configured my nutch-site.xml file with all that properties needed
 for
 proxy auth, but my error is "httpclient.HttpMethodDirector - No
 credentials
 available for BASIC 'Squid proxy-caching web
 server'@proxy.my.host:my.port"


>>>
>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the
>>> value for 'plugins.include' property in 'conf/nutch-site.xml'?
>>>
>>> Regards,
>>> Susam Pal
>>>
>>>
>>>
>>
>> Hi Susam,
>>
>> yes of course!! :) Maybe I can post you the configuration file:
>>
>> 
>> 
>>
>> 
>>
>> 
>>
>> 
>> http.agent.name
>> my.agent.name
>> 
>> 
>> 
>>
>> 
>> plugin.includes
>> protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>> 
>> 
>> 
>>
>> 
>> http.auth.file
>> my_file.xml
>> Authentication configuration file for
>>  'protocol-httpclient' plugin.
>> 
>> 
>>
>> 
>> http.proxy.host
>> ip.my.proxy
>> The proxy hostname.  If empty, no proxy is used.
>> 
>>
>> 
>> http.proxy.port
>> my.port
>> The proxy port.
>> 
>>
>> 
>> http.proxy.username
>> my.user
>> 
>> 
>> 
>>
>> 
>> http.proxy.password
>> my.pwd
>> 
>> 
>> 
>>
>> 
>> http.proxy.realm
>> my_realm
>> 
>> 
>> 
>>
>> 
>> http.agent.host
>> my.local.pc
>> The agent host.
>> 
>>
>> 
>> http.useHttp11
>> true
>> 
>> 
>> 
>>
>> 
>>
>> Only another question: where i must put the user authentication parameters
>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
>> authentication?
>>
>> Thank you for your attention,
>>
>>
>> --
>> ---
>>
>> Graziano Aliberti
>>
>> Engineering Ingegneria Informatica S.p.A
>>
>> Via S. Martino della Battaglia, 56 - 00185 ROMA
>>
>> *Tel.:* 06.49.201.387
>>
>> *E-Mail:* graziano.alibe...@eng.it
>>
>>
>>
>
> The configuration looks okay to me. Yes, the proxy authentication
> details are set in 'conf/nutch-site.xml'. The file mentioned in
> 'http.auth.file' property is used for configuring authentication
> details for authenticating to a web server.
>
> Unfortunately, there aren't any log statements in the part of the code
> that reads the proxy authentication details. So, I can't suggest you
> to turn on debug logs to get some clues about the issue. However, in
> case you want to troubleshoot it yourself by building Nutch from
> source, I can tell you the code that deals with this.
>
> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
>
> The line number is: 200.
>
> If I get time this weekend, I will try to insert some log statements
> into this code and send a modified JAR file to you which might help
> you to troubleshoot what is going on. But I can't promise this since
> it depends on my weekend plans.
>
> Two questions before I end this mail. Did you set the value of
> 'http.proxy.realm' property as: Squid proxy-caching web server ?
>
> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
> file? I'm not sure whether this line should appear for proxy
> authentication but it does appear for web server authentication.
>
> Regards,
> Susam Pal
>

I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.

Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.

Regards,
Susam Pal


protocol-httpclient.jar
Description: application/java-archive
/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0