Re: Http 407 error

2008-01-03 Thread Susam Pal
This information is not enough to understand the problem.The log you
have sent seems to be the messages that appear on the console, whereas
I had requested for 'logs/hadoop.log' file.

The log in this file is usually in this format:-

2008-01-03 00:00:16,652 INFO  fetcher.Fetcher - fetching http://www.example.com/
2008-01-03 00:00:17,029 INFO  fetcher.Fetcher - fetching http://www.example.net/

Please send the following information:-

1. The Nutch version you are using. (NUTCH-559v0.5 was generated
against the trunk. If you are using Nutch-0.9, the patch might not go
smoothly. You might have to manually compare whether the patch went
through nicely.)

2. It would be better if you also send the output of your patch command.

3. The relevant logs from 'log/hadoop.log' with DEBUG enabled. Please
make sure before sending that the log file has the DEBUG lines.

4. The output of a sample HTTP query to your proxy server with netcat
or telnet. For example:-

$ nc -v 192.168.101.1 80
intproxy [192.168.101.1] 80 (www) open
GET http://www.google.com/ HTTP/1.0
Host: www.google.com

HTTP/1.1 407 Proxy Authentication Required ( The Server requires
authorization to fulfill the request. Access to the Web Proxy filter
is denied.  )
Via: 1.1 INTPROXY
Proxy-Authenticate: Negotiate
Proxy-Authenticate: Kerberos
Proxy-Authenticate: NTLM
Proxy-Authenticate: Basic realm=INTPROXY
Connection: Keep-Alive
Proxy-Connection: Keep-Alive
Pragma: no-cache
Cache-Control: no-cache
Content-Type: text/html
Content-Length: 4119

Only the reponse header is enough as shown above. No need to send the
complete response.

5. The values of 'http.proxy.realm' property you have used in your
'conf/nutch-site.xml'. (I assume you have provided the correct host,
port, username and password in the other http.proxy.* properties.
Ideally, ou should also set the http.agent.host property properly
though I have never found this to cause a problem.)

Regards,
Susam Pal

On Jan 3, 2008 12:47 PM, Nidhi malik [EMAIL PROTECTED] wrote:
 I am sending my Hadoop file and I apllied also patch559V0.5

 at the time of fetching I am getting this messages
 -
 Fetcher: starting
 Fetcher: segment: crawl/segments/20080103125023
 Fetcher: threads: 10
 fetching http://www.w3schools.com/
 http.proxy.host = netmon.iitb.ac.in
 http.proxy.port = 80
 http.timeout = 10
 http.content.limit = 65536
 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
 [EMAIL PROTECTED])
 protocol.plugin.check.blocking = true
 protocol.plugin.check.robots = true
 fetcher.server.delay = 5000
 http.max.delays = 100
 Configured Client
 fetch of http://www.w3schools.com/ failed with: Http code=407, url=
 http://www.w3schools.com/
 Fetcher: done

 


Http 407 error

2008-01-03 Thread Nidhi malik
I am sending my Hadoop file and I apllied also patch559V0.5

at the time of fetching I am getting this messages
-
Fetcher: starting
Fetcher: segment: crawl/segments/20080103125023
Fetcher: threads: 10
fetching http://www.w3schools.com/
http.proxy.host = netmon.iitb.ac.in
http.proxy.port = 80
http.timeout = 10
http.content.limit = 65536
http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
[EMAIL PROTECTED])
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 5000
http.max.delays = 100
Configured Client
fetch of http://www.w3schools.com/ failed with: Http code=407, url=
http://www.w3schools.com/
Fetcher: done


2008-01-03 12:50:04,275 INFO  crawl.Injector - Injector: starting
2008-01-03 12:50:04,347 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2008-01-03 12:50:04,347 INFO  crawl.Injector - Injector: urlDir: urls
2008-01-03 12:50:04,895 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2008-01-03 12:50:11,140 INFO  plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - Registered Plugins:
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in (parse-pdf)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Http / Https Protocol Plug-in (protocol-httpclient)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	HTTP Framework (lib-http)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Regex URL Filter (urlfilter-regex)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	MSWord Parse Plug-in (parse-msword)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	XML Libraries (lib-xml)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	MSExcel Parse Plug-in (parse-msexcel)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	OPIC Scoring Plug-in (scoring-opic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Zip Parse Plug-in (parse-zip)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	URL Query Filter (query-url)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Parse MS Documents Framework (lib-parsems)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Regex URL Filter Framework (lib-regex-filter)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	the nutch core extension points (nutch-extensionpoints)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Query Filter (query-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic URL Normalizer (urlnormalizer-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Html Parse Plug-in (parse-html)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	RSS Parse Plug-in (parse-rss)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Indexing Filter (index-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Summarizer Plug-in (summary-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Site Query Filter (query-site)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Text Parse Plug-in (parse-text)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Pass-through URL Normalizer (urlnormalizer-pass)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Regex URL Normalizer (urlnormalizer-regex)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	CyberNeko HTML Parser (lib-nekohtml)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	OpenOffice/OpenDocument Parse Plug-in (parse-oo)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	JavaScript Parser (parse-js)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	SWF Parse Plug-in (parse-swf)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Indexing Filter 

hadoop file and nutch-407 error

2008-01-03 Thread Nidhi malik
At the time of fetching I am getting this below message  and I attached the
haddop.log file

Fetcher: starting
Fetcher: segment: crawl/segments/20080104002039
Fetcher: threads: 10
fetching http://www.w3schools.com/
http.proxy.host = netmon.iitb.ac.in
http.proxy.port = 80
http.timeout = 10
http.content.limit = 65536
http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
[EMAIL PROTECTED])
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 5000
http.max.delays = 100
Configured Client
fetch of http://www.w3schools.com/ failed with: Http code=407, url=
http://www.w3schools.com/
Fetcher: done
2008-01-03 12:50:04,275 INFO  crawl.Injector - Injector: starting
2008-01-03 12:50:04,347 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2008-01-03 12:50:04,347 INFO  crawl.Injector - Injector: urlDir: urls
2008-01-03 12:50:04,895 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2008-01-03 12:50:11,140 INFO  plugin.PluginRepository - Plugins: looking in: /home/digvijay/Nutch/nutch-0.9/plugins
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - Registered Plugins:
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in (parse-pdf)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Http / Https Protocol Plug-in (protocol-httpclient)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	HTTP Framework (lib-http)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	Regex URL Filter (urlfilter-regex)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	MSWord Parse Plug-in (parse-msword)
2008-01-03 12:50:12,171 INFO  plugin.PluginRepository - 	XML Libraries (lib-xml)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	MSExcel Parse Plug-in (parse-msexcel)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	OPIC Scoring Plug-in (scoring-opic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Zip Parse Plug-in (parse-zip)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	URL Query Filter (query-url)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Parse MS Documents Framework (lib-parsems)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Regex URL Filter Framework (lib-regex-filter)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	the nutch core extension points (nutch-extensionpoints)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Query Filter (query-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic URL Normalizer (urlnormalizer-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Html Parse Plug-in (parse-html)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	RSS Parse Plug-in (parse-rss)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Indexing Filter (index-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Basic Summarizer Plug-in (summary-basic)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Site Query Filter (query-site)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Text Parse Plug-in (parse-text)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Pass-through URL Normalizer (urlnormalizer-pass)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Regex URL Normalizer (urlnormalizer-regex)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	CyberNeko HTML Parser (lib-nekohtml)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	OpenOffice/OpenDocument Parse Plug-in (parse-oo)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2008-01-03 12:50:12,172 INFO  plugin.PluginRepository - 	JavaScript Parser (parse-js)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	SWF Parse Plug-in (parse-swf)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-01-03 12:50:12,173 INFO  plugin.PluginRepository - 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-01-03 12:50:12,174 INFO  plugin.PluginRepository - 	Nutch Online Search Results Clustering Plugin 

Re: hadoop file and nutch-407 error

2008-01-03 Thread Susam Pal
Hi,

I have replied this once and since you have provided no additional
information, my reply is going to remain almost same.

Please send the following information:-

1. The Nutch version you are using. (NUTCH-559v0.5 was generated
against the trunk. If you are using Nutch-0.9, the patch might not go
smoothly. You might have to manually compare whether the patch went
through nicely.)

2. How did the ant build go? Were there any errors in the build or the
build completed with the following message:- BUILD SUCCESSFUL ?

3. It would be better if you also send the output of your patch command.

4. The relevant logs from 'log/hadoop.log' with DEBUG enabled for
protocol-httpclient.

To enable DEBUG for protocol-httpclient, please do the following:-

1. Open 'conf/log4j.properties'.

2. Add the following line and save the file:-
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

3. Delete log/hadoop.log, run a new crawl and send the 'log/hadoop.log' file.

Please make sure before sending, that the log file has the DEBUG
lines. They look like this:-

2008-01-02 21:55:30,177 DEBUG httpclient.Http - url:
https://mail.yahoo.com/robots.txt; status code: 404; bytes received:
2337
2008-01-02 21:55:32,900 DEBUG httpclient.Http - url:
https://mail.yahoo.com/; status code: 200; bytes received: 26291

If DEBUG lines are missing, it means you have either not enabled DEBUG
properly or you have not successfully patched and built Nutch.

Regards,
Susam Pal

On Jan 4, 2008 12:08 AM, Nidhi malik [EMAIL PROTECTED] wrote:
 At the time of fetching I am getting this below message  and I attached the
 haddop.log file

 Fetcher: starting
 Fetcher: segment: crawl/segments/20080104002039
 Fetcher: threads: 10
 fetching http://www.w3schools.com/
 http.proxy.host = netmon.iitb.ac.in
 http.proxy.port = 80
 http.timeout = 10
 http.content.limit = 65536
 http.agent = digi/Nutch-0.9 (digvijay; http://www.google.com;
 [EMAIL PROTECTED])
 protocol.plugin.check.blocking = true
 protocol.plugin.check.robots = true
 fetcher.server.delay = 5000
 http.max.delays = 100
 Configured Client
 fetch of http://www.w3schools.com/ failed with: Http code=407,
 url=http://www.w3schools.com/
 Fetcher: done




Prefix Query in Nutch and Wildcard support.

2008-01-03 Thread Developer Developer
Hello Frens,

Is there anyway to do prefix query in Nutch ? Eg Query the content field for
the occurance of abc* ? I could do it in Lucene,  but i want to do it in
nuthch . Going through the mialing list it appeared that Nutch does not
support such queries. Is it ture ?

Thanks !