Re: Proxy Authentication
Il 13/03/2010 22.55, Susam Pal ha scritto: On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com wrote: On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.itwrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal Hi Susam, yes of course!! :) Maybe I can post you the configuration file: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuemy.agent.name/value description /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description /description /property property namehttp.auth.file/name valuemy_file.xml/value descriptionAuthentication configuration file for 'protocol-httpclient' plugin. /description /property property namehttp.proxy.host/name valueip.my.proxy/value descriptionThe proxy hostname. If empty, no proxy is used./description /property property namehttp.proxy.port/name valuemy.port/value descriptionThe proxy port./description /property property namehttp.proxy.username/name valuemy.user/value description /description /property property namehttp.proxy.password/name valuemy.pwd/value description /description /property property namehttp.proxy.realm/name valuemy_realm/value description /description /property property namehttp.agent.host/name valuemy.local.pc/value descriptionThe agent host./description /property property namehttp.useHttp11/name valuetrue/value description /description /property /configuration Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- --- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.alibe...@eng.it The configuration looks okay to me. Yes, the proxy authentication details are set in 'conf/nutch-site.xml'. The file mentioned in 'http.auth.file' property is used for configuring authentication details for authenticating to a web server. Unfortunately, there aren't any log statements in the part of the code that reads the proxy authentication details. So, I can't suggest you to turn on debug logs to get some clues about the issue. However, in case you want to troubleshoot it yourself by building Nutch from source, I can tell you the code that deals with this. The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup The line number is: 200. If I get time this weekend, I will try to insert some log statements into this code and send a modified JAR file to you which might help you to troubleshoot what is going on. But I can't promise this since it depends on my weekend plans. Two questions before I end this mail. Did you set the value of 'http.proxy.realm' property as: Squid proxy-caching web server ? Also, do you see any 'auth.AuthChallengeProcessor' lines in the log file? I'm not sure whether this line should appear for proxy authentication but it does appear for web server authentication. Regards, Susam Pal I managed to find some time to insert more logs into protocol-httpclient and create a JAR. I have attached it with this email. Please replace your 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the one that I have attached. Also, edit your 'conf/log4j.properties' file to add these two lines: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout When you run a crawl now, you should see more logs in 'logs/hadoop.log' than before. I hope it helps you in providing some clues. In case you want to compare the logs with how the control flows from the source code, I have attached the JAVA file as well. Regards, Susam Pal Hi Susam, first of all I want to thank you for your support :).
Re: Content of redirected urls empty
Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960 _ Stay in touch. http://go.microsoft.com/?linkid=9712959 _ Take your contacts everywhere http://go.microsoft.com/?linkid=9712959 _ Stay in touch. http://go.microsoft.com/?linkid=9712959
RE: Content of redirected urls empty
Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \|
Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages
Hello everyone I'm trying to add a new plugin to Nutch/Solr for having new fields and finally searching about it in the terminal interface. For that , i have readen the howto WritingPluginexample 0.9 from the apache wiki and i 'm trying to doing that . I have a problem with the building of the new plugin with ant : This problem like cames from packages and ant can't import classes from other packages . Can anyone help me please what i must do for resolving this problem Here is a copy of results of ANT building : [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac]^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35: cannot find symbol [javac] symbol : class HTMLMetaTags [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] HTMLMetaTags metaTags, DocumentFragment doc) { [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3: package org.apache.nutch.searcher does not exist [javac] import org.apache.nutch.searcher.FieldQueryFilter; [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12: cannot find symbol [javac] symbol: class FieldQueryFilter [javac] public class RecommendedQueryFilter extends FieldQueryFilter { [javac] ^ [javac] Note: /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 23 errors BUILD FAILED /home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the compiler error output for details. Total time: 3 seconds Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki is for 0.9 version ??? I didn't think that can be come from that ??Maybe YES or NOT ??? Please can anyone from the expert give me a help ? Thks
Re: Content of redirected urls empty
and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org
Re: Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages
HELLO There is'nt anyone who know from where this problem came ? PLEASE A HELP 2010/3/15 Arnaud Garcia arnaud1...@gmail.com Hello everyone I'm trying to add a new plugin to Nutch/Solr for having new fields and finally searching about it in the terminal interface. For that , i have readen the howto WritingPluginexample 0.9 from the apache wiki and i 'm trying to doing that . I have a problem with the building of the new plugin with ant : This problem like cames from packages and ant can't import classes from other packages . Can anyone help me please what i must do for resolving this problem Here is a copy of results of ANT building : [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac]^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35: cannot find symbol [javac] symbol : class HTMLMetaTags [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] HTMLMetaTags metaTags, DocumentFragment doc) { [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3: package org.apache.nutch.searcher does not exist [javac] import org.apache.nutch.searcher.FieldQueryFilter; [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12: cannot find symbol [javac] symbol: class FieldQueryFilter [javac] public class RecommendedQueryFilter extends FieldQueryFilter { [javac] ^ [javac] Note: /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 23 errors BUILD FAILED /home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the compiler error output for details. Total time: 3 seconds Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki is for 0.9 version ??? I didn't think that can be come from that ??Maybe YES or NOT ??? Please can anyone from the expert give me a help ? Thks
Re: Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages
Hi Obviously You didn't include necessary references to other JARs or source directories. It's configured in plugin.xml Check the file and add there all necessary references. Remember that each plugin is compiled separately and it doesn't know about other plugins. Compare your ANT files with files from other plugins. Maybe it will give you a hint. Best Regards Alexander Aristov On 15 March 2010 17:26, Arnaud Garcia arnaud1...@gmail.com wrote: HELLO There is'nt anyone who know from where this problem came ? PLEASE A HELP 2010/3/15 Arnaud Garcia arnaud1...@gmail.com Hello everyone I'm trying to add a new plugin to Nutch/Solr for having new fields and finally searching about it in the terminal interface. For that , i have readen the howto WritingPluginexample 0.9 from the apache wiki and i 'm trying to doing that . I have a problem with the building of the new plugin with ant : This problem like cames from packages and ant can't import classes from other packages . Can anyone help me please what i must do for resolving this problem Here is a copy of results of ANT building : [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac]^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35: cannot find symbol [javac] symbol : class HTMLMetaTags [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] HTMLMetaTags metaTags, DocumentFragment doc) { [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34: cannot find symbol [javac] symbol : class Parse [javac] location: class org.apache.nutch.parse.recommended.RecommendedParser [javac] public Parse filter(Content content, Parse parse, [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3: package org.apache.nutch.searcher does not exist [javac] import org.apache.nutch.searcher.FieldQueryFilter; [javac] ^ [javac] /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12: cannot find symbol [javac] symbol: class FieldQueryFilter [javac] public class RecommendedQueryFilter extends FieldQueryFilter { [javac] ^ [javac] Note: /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 23 errors BUILD FAILED /home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the compiler error output for details. Total time: 3 seconds Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki is for 0.9 version ??? I didn't think that can be come from that ??Maybe YES or NOT ??? Please can anyone from the expert give me a help ? Thks
RE: Content of redirected urls empty
Oh sorry i mistook again, and yes you are complitely right 1- The HTTPS has a content in my segment. 2- the HTTP has an empty content. in my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one
RE: Content of redirected urls empty
hi again, i forgot to ask what does mean _repr_ ? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 15 Mar 2010 15:29:48 + Oh sorry i mistook again, and yes you are complitely right 1- The HTTPS has a content in my segment. 2- the HTTP has an empty content. in my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather
Re: Content of redirected urls empty
my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. strange i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. well you could query on the hostname or the whole URL is suppose. You could also index with Lucene and use Luke to debug the content of the index but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. _repr_ : representative - see class ReprUrlFixer Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To:
problem crawling entire internal website
Hi, I'm a new nutch user. My company wants me to look into using this technology to index our internal wiki website as well as sharepoint docs (using tika). Right now, I just want nutch to index the entire wiki site but I'm having problems. I've read other people's problems with this but I haven't found a solution that worked for me. I have nutch 1.0 installed. The wiki site is MoinMoin if that helps. The pages don't have extensions like .html. They're in the form of http://wiki:8000/Engineering as an example. So all pages only have 1-level depth paths. I'm running nutch with the follow command: bin/nuch crawl urls -dir crawl -depth 100 -topN 100 crawl.log I have a urls folder with a file called wiki that points to the top-level page of the site. I set the crawl-urlfilter.txt to accept everything except the default exclusions: -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[...@=] -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +. And I set the db.ignore.external.links property in nutch-default.xml to true so it doesn't go outside of the site. (db.ignore.interal.links is set to false) After the crawl command completes, the search returns some pages, but there are still some pages that are maybe 2 or 3 levels from the starting page that don't show up on search. Any help would be appreciated. Thanks, Kane -- View this message in context: http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p27908943.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Proxy Authentication
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 13/03/2010 22.55, Susam Pal ha scritto: On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com wrote: On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal Hi Susam, yes of course!! :) Maybe I can post you the configuration file: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuemy.agent.name/value description /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description /description /property property namehttp.auth.file/name valuemy_file.xml/value descriptionAuthentication configuration file for 'protocol-httpclient' plugin. /description /property property namehttp.proxy.host/name valueip.my.proxy/value descriptionThe proxy hostname. If empty, no proxy is used./description /property property namehttp.proxy.port/name valuemy.port/value descriptionThe proxy port./description /property property namehttp.proxy.username/name valuemy.user/value description /description /property property namehttp.proxy.password/name valuemy.pwd/value description /description /property property namehttp.proxy.realm/name valuemy_realm/value description /description /property property namehttp.agent.host/name valuemy.local.pc/value descriptionThe agent host./description /property property namehttp.useHttp11/name valuetrue/value description /description /property /configuration Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- --- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.alibe...@eng.it The configuration looks okay to me. Yes, the proxy authentication details are set in 'conf/nutch-site.xml'. The file mentioned in 'http.auth.file' property is used for configuring authentication details for authenticating to a web server. Unfortunately, there aren't any log statements in the part of the code that reads the proxy authentication details. So, I can't suggest you to turn on debug logs to get some clues about the issue. However, in case you want to troubleshoot it yourself by building Nutch from source, I can tell you the code that deals with this. The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup The line number is: 200. If I get time this weekend, I will try to insert some log statements into this code and send a modified JAR file to you which might help you to troubleshoot what is going on. But I can't promise this since it depends on my weekend plans. Two questions before I end this mail. Did you set the value of 'http.proxy.realm' property as: Squid proxy-caching web server ? Also, do you see any 'auth.AuthChallengeProcessor' lines in the log file? I'm not sure whether this line should appear for proxy authentication but it does appear for web server authentication. Regards, Susam Pal I managed to find some time to insert more logs into protocol-httpclient and create a JAR. I have attached it with this email. Please replace your 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the one that I have attached. Also, edit your 'conf/log4j.properties' file to add these two lines: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout When you run a crawl now, you should see more logs in 'logs/hadoop.log' than before. I hope it helps you in providing some clues. In case you want to compare the logs with how the
RE: Content of redirected urls empty
Hi, finaly i learned how to display only indexed URLs in the solr index the url is http://localhost:8080/solr/select/?q=*:*fl=url,content q=*:* is for all entries in the index fl=url,content display only urls and their content. Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have only the target ones (HTTPS) with an empty content. i dont know if some one could explain why nutch is missing the content of redirected urls when indexing !!! Date: Mon, 15 Mar 2010 16:28:03 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. strange i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. well you could query on the hostname or the whole URL is suppose. You could also index with Lucene and use Luke to debug the content of the index but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. _repr_ : representative - see class ReprUrlFixer Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :)
Re: Proxy Authentication
On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal susam@gmail.com wrote: On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 13/03/2010 22.55, Susam Pal ha scritto: On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com wrote: On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal Hi Susam, yes of course!! :) Maybe I can post you the configuration file: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuemy.agent.name/value description /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description /description /property property namehttp.auth.file/name valuemy_file.xml/value descriptionAuthentication configuration file for 'protocol-httpclient' plugin. /description /property property namehttp.proxy.host/name valueip.my.proxy/value descriptionThe proxy hostname. If empty, no proxy is used./description /property property namehttp.proxy.port/name valuemy.port/value descriptionThe proxy port./description /property property namehttp.proxy.username/name valuemy.user/value description /description /property property namehttp.proxy.password/name valuemy.pwd/value description /description /property property namehttp.proxy.realm/name valuemy_realm/value description /description /property property namehttp.agent.host/name valuemy.local.pc/value descriptionThe agent host./description /property property namehttp.useHttp11/name valuetrue/value description /description /property /configuration Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- --- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.alibe...@eng.it The configuration looks okay to me. Yes, the proxy authentication details are set in 'conf/nutch-site.xml'. The file mentioned in 'http.auth.file' property is used for configuring authentication details for authenticating to a web server. Unfortunately, there aren't any log statements in the part of the code that reads the proxy authentication details. So, I can't suggest you to turn on debug logs to get some clues about the issue. However, in case you want to troubleshoot it yourself by building Nutch from source, I can tell you the code that deals with this. The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup The line number is: 200. If I get time this weekend, I will try to insert some log statements into this code and send a modified JAR file to you which might help you to troubleshoot what is going on. But I can't promise this since it depends on my weekend plans. Two questions before I end this mail. Did you set the value of 'http.proxy.realm' property as: Squid proxy-caching web server ? Also, do you see any 'auth.AuthChallengeProcessor' lines in the log file? I'm not sure whether this line should appear for proxy authentication but it does appear for web server authentication. Regards, Susam Pal I managed to find some time to insert more logs into protocol-httpclient and create a JAR. I have attached it with this email. Please replace your 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the one that I have attached. Also, edit your 'conf/log4j.properties' file to add these two lines: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout When you run a crawl now, you should see more logs in 'logs/hadoop.log' than before. I hope it helps you in