On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal wrote:
> On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
> wrote:
>> Il 11/03/2010 16.20, Susam Pal ha scritto:
>>>
>>> On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
>>> wrote:
>>>
Hi everyone,
I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
When
I try to fetch my website list, into the log file I see that the
authentication was failed...
I've configured my nutch-site.xml file with all that properties needed
for
proxy auth, but my error is "httpclient.HttpMethodDirector - No
credentials
available for BASIC 'Squid proxy-caching web
server'@proxy.my.host:my.port"
>>>
>>> Did you replace 'protocol-http' with 'protocol-httpclient' in the
>>> value for 'plugins.include' property in 'conf/nutch-site.xml'?
>>>
>>> Regards,
>>> Susam Pal
>>>
>>>
>>>
>>
>> Hi Susam,
>>
>> yes of course!! :) Maybe I can post you the configuration file:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> http.agent.name
>> my.agent.name
>>
>>
>>
>>
>>
>> plugin.includes
>> protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>>
>>
>>
>>
>>
>> http.auth.file
>> my_file.xml
>> Authentication configuration file for
>> 'protocol-httpclient' plugin.
>>
>>
>>
>>
>> http.proxy.host
>> ip.my.proxy
>> The proxy hostname. If empty, no proxy is used.
>>
>>
>>
>> http.proxy.port
>> my.port
>> The proxy port.
>>
>>
>>
>> http.proxy.username
>> my.user
>>
>>
>>
>>
>>
>> http.proxy.password
>> my.pwd
>>
>>
>>
>>
>>
>> http.proxy.realm
>> my_realm
>>
>>
>>
>>
>>
>> http.agent.host
>> my.local.pc
>> The agent host.
>>
>>
>>
>> http.useHttp11
>> true
>>
>>
>>
>>
>>
>>
>> Only another question: where i must put the user authentication parameters
>> (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
>> authentication?
>>
>> Thank you for your attention,
>>
>>
>> --
>> ---
>>
>> Graziano Aliberti
>>
>> Engineering Ingegneria Informatica S.p.A
>>
>> Via S. Martino della Battaglia, 56 - 00185 ROMA
>>
>> *Tel.:* 06.49.201.387
>>
>> *E-Mail:* graziano.alibe...@eng.it
>>
>>
>>
>
> The configuration looks okay to me. Yes, the proxy authentication
> details are set in 'conf/nutch-site.xml'. The file mentioned in
> 'http.auth.file' property is used for configuring authentication
> details for authenticating to a web server.
>
> Unfortunately, there aren't any log statements in the part of the code
> that reads the proxy authentication details. So, I can't suggest you
> to turn on debug logs to get some clues about the issue. However, in
> case you want to troubleshoot it yourself by building Nutch from
> source, I can tell you the code that deals with this.
>
> The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
>
> The line number is: 200.
>
> If I get time this weekend, I will try to insert some log statements
> into this code and send a modified JAR file to you which might help
> you to troubleshoot what is going on. But I can't promise this since
> it depends on my weekend plans.
>
> Two questions before I end this mail. Did you set the value of
> 'http.proxy.realm' property as: Squid proxy-caching web server ?
>
> Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
> file? I'm not sure whether this line should appear for proxy
> authentication but it does appear for web server authentication.
>
> Regards,
> Susam Pal
>
I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.
Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.
Regards,
Susam Pal
protocol-httpclient.jar
Description: application/java-archive
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0