svn commit: r578703 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/NodeWalker.java src/test/org/apache/nutch/util/TestNodeWalker.java

2007-09-24 Thread dogacan
Author: dogacan
Date: Mon Sep 24 01:27:34 2007
New Revision: 578703

URL: http://svn.apache.org/viewvc?rev=578703view=rev
Log:
NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child. 
Contributed by Emmanuel Joke.

Added:
lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java

Modified: lucene/nutch/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev=578703r1=578702r2=578703view=diff
==
--- lucene/nutch/trunk/CHANGES.txt (original)
+++ lucene/nutch/trunk/CHANGES.txt Mon Sep 24 01:27:34 2007
@@ -136,6 +136,9 @@
 46. NUTCH-554 - Generator throws IOException on invalid urls.
 (Brian Whitman via ab)
 
+47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
+(Emmanuel Joke via dogacan)
+
 Release 0.9 - 2007-04-02
 
  1. Changed log4j confiquration to log to stdout on commandline

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java?rev=578703r1=578702r2=578703view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java Mon Sep 
24 01:27:34 2007
@@ -77,8 +77,7 @@
 
 int childLen = (currentChildren != null) ? currentChildren.getLength() : 0;
 
-// put the children node on the stack in first to last order
-for (int i = childLen - 1; i = 0; i--) {
+for (int i = 0 ; i  childLen ; i++) {
   Node child = nodes.peek();
   if (child.equals(currentChildren.item(i))) {
 nodes.pop();

Added: lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java?rev=578703view=auto
==
--- lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java 
(added)
+++ lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java Mon 
Sep 24 01:27:34 2007
@@ -0,0 +1,105 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.ByteArrayInputStream;
+import junit.framework.TestCase;
+
+import org.apache.xerces.parsers.DOMParser;
+import org.w3c.dom.Node;
+import org.xml.sax.InputSource;
+
+
+
+
+/** Unit tests for NodeWalker methods. */
+public class TestNodeWalker extends TestCase {
+  public TestNodeWalker(String name) { 
+super(name); 
+  }
+
+  /* a snapshot of the nutch webpage */
+  private final static String WEBPAGE= 
+  !DOCTYPE html PUBLIC \-//W3C//DTD XHTML 1.0 Strict//EN\ 
\http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\;
+  + html xmlns=\http://www.w3.org/1999/xhtml\; lang=\en\ 
xml:lang=\en\headtitleNutch/title/head
+  + body
+  + ul
+  + licrawl several billion pages per month/li
+  + limaintain an index of these pages/li
+  + lisearch that index up to 1000 times per second/li
+  + liprovide very high quality search results/li
+  + lioperate at minimal cost/li
+  + /ul
+  + /body
+  + /html;
+
+  private final static String[] ULCONTENT = new String[4];
+  
+  protected void setUp() throws Exception{
+ULCONTENT[0]=crawl several billion pages per month ;
+ULCONTENT[1]=maintain an index of these pages ;
+ULCONTENT[2]=search that index up to 1000 times per second  ;
+ULCONTENT[3]=operate at minimal cost ;
+  }
+
+  public void testSkipChildren() {
+DOMParser parser= new DOMParser();
+try {
+  parser.parse(new InputSource(new 
ByteArrayInputStream(WEBPAGE.getBytes(;
+} catch (Exception e) {
+  e.printStackTrace();
+}
+ 
+StringBuffer sb = new StringBuffer();
+NodeWalker walker = new NodeWalker(parser.getDocument());
+while (walker.hasNext()) {
+  Node currentNode = walker.nextNode();
+  short nodeType = currentNode.getNodeType();
+  

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-09-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Initial draft copied from protocol-http11

New page:
== Introduction ==
'protocol-httpclient' is a protocol plugin which supports retrieving documents 
via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
and NTLM authentication schemes for web server as well as proxy server.

== Author of Authentication Features ==
Susam Pal, Infosys Technologies Limited

== Necessity ==
There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use 
its authentication features. This is an improvement on the previous two 
plugins. The author of the authentication features has tested it in Infosys 
Technologies Limited by crawling the corporate intranet requiring NTLM 
authentication and this has been found to work well.

== Download ==
Currently, this plugin is in the form of patch in JIRA. Download the patch from 
[https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it 
to trunk.

== Quick Guide ==
This section is a quick guide to configure authentication related properties 
for 'protocol-httpclient'.

 1. Include 'protocol-httpclient' in 'plugin.includes'.
 1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
 1. For basic or digest authentication in web servers, set 'http.auth.username' 
and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a 
realm as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
 1. It is recommended that 'http.useHttp11' be set to true.

This is explained in a little more detail in the next section.

== Nutch Configuration ==
To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include 
some properties which is explained in this section. First and foremost, to 
enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-

{{{property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|.../value
  description.../description
/property}}}

(... indicates a long line truncated)

Next, if authentication is required for proxy server, the following properties 
need to be set in 'conf/nutch-site.xml'.

 * http.proxy.username
 * http.proxy.password
 * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
 * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)

If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.

 * http.auth.username
 * http.auth.password
 * http.auth.realm
 * http.auth.host

The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.

Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.

== Underlying HttpClient Library ==
'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one 

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
content moved to HttpAuthenticationSchemes

--
+ protocol-http11 has been converted to a patch for protocol-httpclient as per 
the discussion held at [https://issues.apache.org/jira/browse/NUTCH-557 JIRA 
NUTCH-557].
- == Introduction ==
- 'protocol-http11' is a protocol plugin which supports retrieving documents 
via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
and NTLM authentication schemes for web server as well as proxy server.
  
+ Therefore, the content of this page has been moved to 
HttpAuthenticationSchemes.
- == Author ==
- Susam Pal, Infosys Technologies Limited
  
- == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. 'protocol-http11' was written to solve 
these problems, provide additional features like authentication support for 
proxy server and better inline documentation for the properties to be used in 
'nutch-site.xml' to enable 'protocol-http11' and use its authentication 
features. This is an improvement on the previous two plugins. The author of 
this plugin has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a 
valid protocol name.
- 
- == Download ==
- Currently, this plugin is in the form of patch in JIRA. Download the patch 
from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply 
it to trunk.
- 
- == Quick Guide ==
- This section is a quick guide to configure authentication related properties 
for 'protocol-http11'.
- 
-  1. Include 'protocol-http11' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
-  1. It is recommended that 'http.useHttp11' be set to true.
- 
- This is explained in a little more detail in the next section.
- 
- == Nutch Configuration ==
- To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include 
some properties which is explained in this section. First and foremost, to 
enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{property
-   nameplugin.includes/name
-   valueprotocol-http11|urlfilter-regex|.../value
-   description.../description
- /property}}}
- 
- (... indicates truncation)
- 
- It is recommended that HTTP 1.1 should be enabled.
- 
- {{{property
-   namehttp.useHttp11/name
-   valuetrue/value
-   description.../description
- /property}}}
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different 

[Nutch Wiki] Update of FrontPage by susam

2007-09-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Http Authentication Schemes

--
   * CrossPlatformNutchScripts
   * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's 
progress.
   * [Nutch 0.9 Crawl Script Tutorial]
+  * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself 
using NTLM, Basic or Digest authentication schemes.
  
  == Nutch Development ==
   * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start 
developing and contributing to Nutch.