Re: Proxy Authentication
Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal Hi Susam, yes of course!! :) Maybe I can post you the configuration file: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuemy.agent.name/value description /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description /description /property property namehttp.auth.file/name valuemy_file.xml/value descriptionAuthentication configuration file for 'protocol-httpclient' plugin. /description /property property namehttp.proxy.host/name valueip.my.proxy/value descriptionThe proxy hostname. If empty, no proxy is used./description /property property namehttp.proxy.port/name valuemy.port/value descriptionThe proxy port./description /property property namehttp.proxy.username/name valuemy.user/value description /description /property property namehttp.proxy.password/name valuemy.pwd/value description /description /property property namehttp.proxy.realm/name valuemy_realm/value description /description /property property namehttp.agent.host/name valuemy.local.pc/value descriptionThe agent host./description /property property namehttp.useHttp11/name valuetrue/value description /description /property /configuration Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- --- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.alibe...@eng.it
Re: Proxy Authentication
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal Hi Susam, yes of course!! :) Maybe I can post you the configuration file: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuemy.agent.name/value description /description /property property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value description /description /property property namehttp.auth.file/name valuemy_file.xml/value descriptionAuthentication configuration file for 'protocol-httpclient' plugin. /description /property property namehttp.proxy.host/name valueip.my.proxy/value descriptionThe proxy hostname. If empty, no proxy is used./description /property property namehttp.proxy.port/name valuemy.port/value descriptionThe proxy port./description /property property namehttp.proxy.username/name valuemy.user/value description /description /property property namehttp.proxy.password/name valuemy.pwd/value description /description /property property namehttp.proxy.realm/name valuemy_realm/value description /description /property property namehttp.agent.host/name valuemy.local.pc/value descriptionThe agent host./description /property property namehttp.useHttp11/name valuetrue/value description /description /property /configuration Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- --- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.alibe...@eng.it The configuration looks okay to me. Yes, the proxy authentication details are set in 'conf/nutch-site.xml'. The file mentioned in 'http.auth.file' property is used for configuring authentication details for authenticating to a web server. Unfortunately, there aren't any log statements in the part of the code that reads the proxy authentication details. So, I can't suggest you to turn on debug logs to get some clues about the issue. However, in case you want to troubleshoot it yourself by building Nutch from source, I can tell you the code that deals with this. The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup The line number is: 200. If I get time this weekend, I will try to insert some log statements into this code and send a modified JAR file to you which might help you to troubleshoot what is going on. But I can't promise this since it depends on my weekend plans. Two questions before I end this mail. Did you set the value of 'http.proxy.realm' property as: Squid proxy-caching web server ? Also, do you see any 'auth.AuthChallengeProcessor' lines in the log file? I'm not sure whether this line should appear for proxy authentication but it does appear for web server authentication. Regards, Susam Pal
Avoid indexing common html to all pages, promoting page titles.
Hi, I'm developing a site that has shows the dynamic content in a div id=content, the rest of the page doesn't really change. I'd like to store and index only the contents of this div, to basically avoid re-indexing over and over the same content (header, footer, menu). I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a couple of things: 1.- Should I extend the parse-html plugin, or should I just replace it? 2.- The example talks about finding a meta tag, extracting some information from it, and adding a field in the index. I think I just need to get rid of all html except the div id=content tag, and index its content. Can someone point me in the right direction? And just one more thing, I'd like to give a higher score to pages which the search terms appear in the title. Right now pages that contain the terms in the body rank higher than those that contain the search terms in the title, how could I modify this behaviour? Thanks for your help, Pedro.
Can nutch index file-exchanger such as depositfiles.com
Is there possible to do this with nutch? -- View this message in context: http://old.nabble.com/Can-nutch-index-file-exchanger-such-as-depositfiles.com-tp27874535p27874535.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Avoid indexing common html to all pages, promoting page titles.
On 2010-03-12 12:52, Pedro Bezunartea López wrote: Hi, I'm developing a site that has shows the dynamic content in adiv id=content, the rest of the page doesn't really change. I'd like to store and index only the contents of thisdiv, to basically avoid re-indexing over and over the same content (header, footer, menu). I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a couple of things: 1.- Should I extend the parse-html plugin, or should I just replace it? You should write an HtmlParseFilter and extract only the portions that you care about, and then replace the output parseText with your extracted text. 2.- The example talks about finding a meta tag, extracting some information from it, and adding a field in the index. I think I just need to get rid of all html except the div id=content tag, and index its content. Can someone point me in the right direction? See above. And just one more thing, I'd like to give a higher score to pages which the search terms appear in the title. Right now pages that contain the terms in the body rank higher than those that contain the search terms in the title, how could I modify this behaviour? You can define these weights in the configuration, look for query boost properties. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Abt: Detect slow and timeout servers and drop their URLs
Merci Julien ~ deployed it yesterday and it worked like a charm ! I was still using the official 1.0 release and obviously i've been missing out on quite a few nice improvements ;-) -yves Julien Nioche wrote: Bonjour Yves, Did you see https://issues.apache.org/jira/browse/NUTCH-770? It has been committed to the trunk back in December. HTH Julien
setting search dir for nutch web app
Just sharing my experience with setting the search directory for the nutch webapp. This is a leading cause of the disappointing Hits 0-0 (out of about 0 total matching pages) message. I had a situation like Noah Silverman: On Thu, 2009-12-17 at 16:32 -0800, Noah Silverman wrote: Hello, Just to summarize. 1) Nutch crawl completes without error. 2) I can search from command line and see results. (Assume this means that index is created.) bin/nutch org.apache.nutch.searcher.NutchBean foobar 3) Tomcat configured through nutch-site file to point to nutch/crawl directory 4) catalina.out logfile indicates that tomcat is opening nutch/crawl 2009-12-16 22:00:39,740 INFO SearchBean - opening indexes in /home/noah/Documents/nutch/crawl/indexes 5) No results when searching in web front end 6) No errors in the logs Is there some way to debug this? Perhaps more verbose logging? Thanks!!! -N The log message in 4 is only somewhat helpful since if anything goes wrong, nothing will be said. Noah's problem was that he needed to point to the top level directory. My case was that I needed to set the permissions correctly. I had crawled as root so the crawl directory was root:root with permissions 544. (at least readable) I moved it to $TOMCAT/work and gave it ownership $TOMCAT_USER:$TOMCAT_GROUP with permissions 755. Now it works. In any case, the nutch web app will simply log at info that it's opening indexes at $DIR. If permissions are wrong, or the directory doesn't exist, it will say nothing, not even at debug logging. No exceptions will be thrown.
Recrawl and crawl-urlfilter.txt
I'm having multiple problems recrawling with nutch 0.9. Here are 2 questions. :-) Right now, using the script I find here ( http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html ), I think I'm close to a workable solution, but the recrawl doesn't respect the crawl-urlfilter.txt. Is there a way to specify this configuration for the recrawl? Our final implementation will be a single-sited crawl with close-to-realtime search results (ideally, we'll crawl about every 30 minutes or 1 hour). In that regard, is there any way to have nutch respect cache value response codes (304 Not Modified) instead of the fetcher time in the configuration file? Thanks! -Josh Pavel
Nutch Fetch Stuck
Hi, We did a fetch and the maps are 100% done, but the reducers have crashed. We did a large fetch so is there a way to restart the reducers without restarting the fetch? -Abhi
Re: Nutch Fetch Stuck
On 2010-03-12 23:39, Abhi Yerra wrote: Hi, We did a fetch and the maps are 100% done, but the reducers have crashed. We did a large fetch so is there a way to restart the reducers without restarting the fetch? Unfortunately no. Was the fetcher in the parsing mode? If so, I strongly recommend that you first fetch, and then run the parsing as a separate step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Fetch Stuck
So I had -noParsing set. So parsing was not part of the fetch. The pages have been crawled, but the reducers have crashed. So if I restart the fetch will it try to crawl all those pages again? -Abhi - Original Message - From: Andrzej Bialecki a...@getopt.org To: nutch-user@lucene.apache.org Sent: Friday, March 12, 2010 3:05:00 PM Subject: Re: Nutch Fetch Stuck On 2010-03-12 23:39, Abhi Yerra wrote: Hi, We did a fetch and the maps are 100% done, but the reducers have crashed. We did a large fetch so is there a way to restart the reducers without restarting the fetch? Unfortunately no. Was the fetcher in the parsing mode? If so, I strongly recommend that you first fetch, and then run the parsing as a separate step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Content of redirected urls empty
no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960 _ Stay in touch. http://go.microsoft.com/?linkid=9712959 _ Take your contacts everywhere http://go.microsoft.com/?linkid=9712959 _ Stay in touch. http://go.microsoft.com/?linkid=9712959