Re: Http 407 error
This information is not enough to understand the problem.The log you have sent seems to be the messages that appear on the console, whereas I had requested for 'logs/hadoop.log' file. The log in this file is usually in this format:- 2008-01-03 00:00:16,652 INFO fetcher.Fetcher - fetching http://www.example.com/ 2008-01-03 00:00:17,029 INFO fetcher.Fetcher - fetching http://www.example.net/ Please send the following information:- 1. The Nutch version you are using. (NUTCH-559v0.5 was generated against the trunk. If you are using Nutch-0.9, the patch might not go smoothly. You might have to manually compare whether the patch went through nicely.) 2. It would be better if you also send the output of your patch command. 3. The relevant logs from 'log/hadoop.log' with DEBUG enabled. Please make sure before sending that the log file has the DEBUG lines. 4. The output of a sample HTTP query to your proxy server with netcat or telnet. For example:- $ nc -v 192.168.101.1 80 intproxy [192.168.101.1] 80 (www) open GET http://www.google.com/ HTTP/1.0 Host: www.google.com HTTP/1.1 407 Proxy Authentication Required ( The Server requires authorization to fulfill the request. Access to the Web Proxy filter is denied. ) Via: 1.1 INTPROXY Proxy-Authenticate: Negotiate Proxy-Authenticate: Kerberos Proxy-Authenticate: NTLM Proxy-Authenticate: Basic realm=INTPROXY Connection: Keep-Alive Proxy-Connection: Keep-Alive Pragma: no-cache Cache-Control: no-cache Content-Type: text/html Content-Length: 4119 Only the reponse header is enough as shown above. No need to send the complete response. 5. The values of 'http.proxy.realm' property you have used in your 'conf/nutch-site.xml'. (I assume you have provided the correct host, port, username and password in the other http.proxy.* properties. Ideally, ou should also set the http.agent.host property properly though I have never found this to cause a problem.) Regards, Susam Pal On Jan 3, 2008 12:47 PM, Nidhi malik [EMAIL PROTECTED] wrote: I am sending my Hadoop file and I apllied also patch559V0.5 at the time of fetching I am getting this messages - Fetcher: starting Fetcher: segment: crawl/segments/20080103125023 Fetcher: threads: 10 fetching http://www.w3schools.com/ http.proxy.host = netmon.iitb.ac.in http.proxy.port = 80 http.timeout = 10 http.content.limit = 65536 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 5000 http.max.delays = 100 Configured Client fetch of http://www.w3schools.com/ failed with: Http code=407, url= http://www.w3schools.com/ Fetcher: done
Http 407 error
I am sending my Hadoop file and I apllied also patch559V0.5 at the time of fetching I am getting this messages - Fetcher: starting Fetcher: segment: crawl/segments/20080103125023 Fetcher: threads: 10 fetching http://www.w3schools.com/ http.proxy.host = netmon.iitb.ac.in http.proxy.port = 80 http.timeout = 10 http.content.limit = 65536 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 5000 http.max.delays = 100 Configured Client fetch of http://www.w3schools.com/ failed with: Http code=407, url= http://www.w3schools.com/ Fetcher: done 2008-01-03 12:50:04,275 INFO crawl.Injector - Injector: starting 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: urlDir: urls 2008-01-03 12:50:04,895 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2008-01-03 12:50:11,140 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Indexing Filter
hadoop file and nutch-407 error
At the time of fetching I am getting this below message and I attached the haddop.log file Fetcher: starting Fetcher: segment: crawl/segments/20080104002039 Fetcher: threads: 10 fetching http://www.w3schools.com/ http.proxy.host = netmon.iitb.ac.in http.proxy.port = 80 http.timeout = 10 http.content.limit = 65536 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 5000 http.max.delays = 100 Configured Client fetch of http://www.w3schools.com/ failed with: Http code=407, url= http://www.w3schools.com/ Fetcher: done 2008-01-03 12:50:04,275 INFO crawl.Injector - Injector: starting 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2008-01-03 12:50:04,347 INFO crawl.Injector - Injector: urlDir: urls 2008-01-03 12:50:04,895 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2008-01-03 12:50:11,140 INFO plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Registered Plugins: 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword) 2008-01-03 12:50:12,171 INFO plugin.PluginRepository - XML Libraries (lib-xml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - URL Query Filter (query-url) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - RSS Parse Plug-in (parse-rss) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Site Query Filter (query-site) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - OpenOffice/OpenDocument Parse Plug-in (parse-oo) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - Log4j (lib-log4j) 2008-01-03 12:50:12,172 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - SWF Parse Plug-in (parse-swf) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Registered Extension-Points: 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2008-01-03 12:50:12,173 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2008-01-03 12:50:12,174 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin
Re: hadoop file and nutch-407 error
Hi, I have replied this once and since you have provided no additional information, my reply is going to remain almost same. Please send the following information:- 1. The Nutch version you are using. (NUTCH-559v0.5 was generated against the trunk. If you are using Nutch-0.9, the patch might not go smoothly. You might have to manually compare whether the patch went through nicely.) 2. How did the ant build go? Were there any errors in the build or the build completed with the following message:- BUILD SUCCESSFUL ? 3. It would be better if you also send the output of your patch command. 4. The relevant logs from 'log/hadoop.log' with DEBUG enabled for protocol-httpclient. To enable DEBUG for protocol-httpclient, please do the following:- 1. Open 'conf/log4j.properties'. 2. Add the following line and save the file:- log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout 3. Delete log/hadoop.log, run a new crawl and send the 'log/hadoop.log' file. Please make sure before sending, that the log file has the DEBUG lines. They look like this:- 2008-01-02 21:55:30,177 DEBUG httpclient.Http - url: https://mail.yahoo.com/robots.txt; status code: 404; bytes received: 2337 2008-01-02 21:55:32,900 DEBUG httpclient.Http - url: https://mail.yahoo.com/; status code: 200; bytes received: 26291 If DEBUG lines are missing, it means you have either not enabled DEBUG properly or you have not successfully patched and built Nutch. Regards, Susam Pal On Jan 4, 2008 12:08 AM, Nidhi malik [EMAIL PROTECTED] wrote: At the time of fetching I am getting this below message and I attached the haddop.log file Fetcher: starting Fetcher: segment: crawl/segments/20080104002039 Fetcher: threads: 10 fetching http://www.w3schools.com/ http.proxy.host = netmon.iitb.ac.in http.proxy.port = 80 http.timeout = 10 http.content.limit = 65536 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com; [EMAIL PROTECTED]) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 5000 http.max.delays = 100 Configured Client fetch of http://www.w3schools.com/ failed with: Http code=407, url=http://www.w3schools.com/ Fetcher: done
Prefix Query in Nutch and Wildcard support.
Hello Frens, Is there anyway to do prefix query in Nutch ? Eg Query the content field for the occurance of abc* ? I could do it in Lucene, but i want to do it in nuthch . Going through the mialing list it appeared that Nutch does not support such queries. Is it ture ? Thanks !