[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
removed conf/nutch-site.xml conf

--
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
  
+ == Configuration ==
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
- == Common Credentials Configuration ==
- This is the simplest possible configuration which involves setting just one 
set of credentials. It is useful in trusted Intranets where all sites are 
trusted and require the same username/password for authentication.
- 
- === Quick Guide ===
-  1. Include 'protocol-httpclient' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
- 
- This is explained in details in the following section.
- 
- === Details ===
- To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to 
include some properties which is explained in this section. First and foremost, 
to enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{property
-   nameplugin.includes/name
-   valueprotocol-httpclient|urlfilter-regex|.../value
-   description.../description
- /property}}}
- 
- (... indicates a long line truncated)
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
for proxy NTLM authentication as well as web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
- 
- Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
- 
- == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -98, +57 @@

  
  The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, the 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for 

svn commit: r591791 - in /lucene/nutch/trunk: ./ lib/ lib/native/Linux-i386-32/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/searcher/ src/java/org/apa

2007-11-04 Thread kubes
Author: kubes
Date: Sun Nov  4 07:38:35 2007
New Revision: 591791

URL: http://svn.apache.org/viewvc?rev=591791view=rev
Log:
NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x.

Added:
lucene/nutch/trunk/lib/hadoop-0.15.0-core.jar   (with props)
Removed:
lucene/nutch/trunk/lib/hadoop-0.12.3-core.jar
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.a
lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so
lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1
lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1.0.0
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/FsDirectory.java
lucene/nutch/trunk/src/java/org/apache/nutch/searcher/IndexSearcher.java
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java

Modified: lucene/nutch/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev=591791r1=591790r2=591791view=diff
==
--- lucene/nutch/trunk/CHANGES.txt (original)
+++ lucene/nutch/trunk/CHANGES.txt Sun Nov  4 07:38:35 2007
@@ -154,6 +154,8 @@
 52. NUTCH-501 -  Implement a different caching mechanism for objects cached in
 configuration. (dogacan)
 
+53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
+
 Release 0.9 - 2007-04-02
 
  1. Changed log4j confiquration to log to stdout on commandline

Added: lucene/nutch/trunk/lib/hadoop-0.15.0-core.jar
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/lib/hadoop-0.15.0-core.jar?rev=591791view=auto
==
Binary file - no diff available.

Propchange: lucene/nutch/trunk/lib/hadoop-0.15.0-core.jar
--
svn:mime-type = application/octet-stream

Modified: lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.a
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.a?rev=591791r1=591790r2=591791view=diff
==
Binary files - no diff available.

Modified: lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so?rev=591791r1=591790r2=591791view=diff
==
Binary files - no diff available.

Modified: lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1?rev=591791r1=591790r2=591791view=diff
==
Binary files - no diff available.

Modified: lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1.0.0
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/lib/native/Linux-i386-32/libhadoop.so.1.0.0?rev=591791r1=591790r2=591791view=diff
==
Binary files - no diff available.

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java?rev=591791r1=591790r2=591791view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java Sun Nov  4 
07:38:35 2007
@@ -172,7 +172,7 @@
 CrawlDb.install(mergeJob, crawlDb);
 
 // clean up
-FileSystem fs = new JobClient(getConf()).getFs();
+FileSystem fs = FileSystem.get(getConf());
 fs.delete(tempDir);
 if (LOG.isInfoEnabled()) { LOG.info(Injector: done); }
 

Modified: 
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java?rev=591791r1=591790r2=591791view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java 
Sun Nov  4 07:38:35 2007
@@ -174,7 +174,7 @@
 this.index = index;
   }
 
-  public boolean next(Writable key, Writable value)
+  public boolean next(WritableComparable key, Writable value)
 throws IOException {
 
 // skip empty indexes

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/FsDirectory.java
URL: