svn commit: r367251 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java

2006-01-09 Thread ab
Author: ab
Date: Mon Jan  9 00:58:58 2006
New Revision: 367251

URL: http://svn.apache.org/viewcvs?rev=367251view=rev
Log:
Replace the custom metadata serialization with the one provided by the
ContentProperties class. This fixes the breakage if multiple property
values per key are in use.

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java?rev=367251r1=367250r2=367251view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java Mon Jan  
9 00:58:58 2006
@@ -31,7 +31,7 @@
 public final class ParseData extends VersionedWritable {
   public static final String DIR_NAME = parse_data;
 
-  private final static byte VERSION = 2;
+  private final static byte VERSION = 3;
 
   private String title;
   private Outlink[] outlinks;
@@ -96,10 +96,15 @@
   Outlink.skip(in);
 }
 
-int propertyCount = in.readInt(); // read metadata
-metadata = new ContentProperties();
-for (int i = 0; i  propertyCount; i++) {
-  metadata.put(UTF8.readString(in), UTF8.readString(in));
+if (version  3) {
+  int propertyCount = in.readInt(); // read metadata
+  metadata = new ContentProperties();
+  for (int i = 0; i  propertyCount; i++) {
+metadata.put(UTF8.readString(in), UTF8.readString(in));
+  }
+} else {
+  metadata = new ContentProperties();
+  metadata.readFields(in);
 }
 
   }
@@ -113,14 +118,7 @@
 for (int i = 0; i  outlinks.length; i++) {
   outlinks[i].write(out);
 }
-
-out.writeInt(metadata.size());// write metadata
-Iterator i = metadata.entrySet().iterator();
-while (i.hasNext()) {
-  Map.Entry e = (Map.Entry)i.next();
-  UTF8.writeString(out, (String)e.getKey());
-  UTF8.writeString(out, (String)e.getValue());
-}
+metadata.write(out);
   }
 
   public static ParseData read(DataInput in) throws IOException {




[Nutch Wiki] Update of FAQ by MichaelStack

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MichaelStack:
http://wiki.apache.org/nutch/FAQ

--
  Anchor text makes a large contribution to document score (You can see the 
anchor text for a page by browsing to explain then editing the URL to put in 
place anchors.jsp in place of explain.jsp).
  
   What is the RSS symbol in search results all about? 
- Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named OpenSearchServlet that redoes the search returning the results instead 
formatted as RSS (XML).  The RSS format is based on 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from 
[http://www.a9.com a9.com] (Also see [href=http://opensearch.a9.com/ 
OpenSearch]). Nutch extensions add to the OpenSearch RSS the original query, 
navigation information, and any extra fields that available in the search 
result such as Nutch boost, segment name, etc. 
+ Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named OpenSearchServlet.  OpenSearchServlet reruns the query and returns the 
results formatted instead as RSS (XML).  The RSS format is based on 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from 
[http://www.a9.com a9.com]: OpenSearch RSS 1.0 is an extension to the RSS 2.0 
standard, conforming to the guidelines for RSS extensibility as outlined by the 
RSS 2.0 specification (See also [http://opensearch.a9.com/ OpenSearch]). Nutch 
in turn  makes extension to OpenSearch.  The Nutch extensions are identified by 
the 'nutch' namespace prefix and add to OpenSearch navigation information, the 
original query, and all fields that are available at search result time 
including the Nutch page boost, the name of the segment the page resides in, 
etc. 
  
- Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against OpenSearchServlet rather than 
search.jsp.  Results as XML can also be transformed using XSL stylesheets, the 
likely direction of UI development in nutch going by mailing list posts.
+ Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against OpenSearchServlet rather than 
search.jsp.  Results as XML can also be transformed using XSL stylesheets, the 
likely direction of UI development going forward.
  
  === Crawling ===
  


[Nutch Wiki] Update of FAQ by MichaelStack

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MichaelStack:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
Edit.  Add mention of RSS 2.0.

--
  Anchor text makes a large contribution to document score (You can see the 
anchor text for a page by browsing to explain then editing the URL to put in 
place anchors.jsp in place of explain.jsp).
  
   What is the RSS symbol in search results all about? 
- Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named OpenSearchServlet.  OpenSearchServlet reruns the query and returns the 
results formatted instead as RSS (XML).  The RSS format is based on 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from 
[http://www.a9.com a9.com]: OpenSearch RSS 1.0 is an extension to the RSS 2.0 
standard, conforming to the guidelines for RSS extensibility as outlined by the 
RSS 2.0 specification (See also [http://opensearch.a9.com/ OpenSearch]). Nutch 
in turn  makes extension to OpenSearch.  The Nutch extensions are identified by 
the 'nutch' namespace prefix and add to OpenSearch navigation information, the 
original query, and all fields that are available at search result time 
including the Nutch page boost, the name of the segment the page resides in, 
etc. 
+ Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named 
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html
 OpenSearchServlet].  
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html
 OpenSearchServlet] reruns the query and returns the results formatted instead 
as RSS (XML).  The RSS format is based on 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from 
[http://www.a9.com a9.com]: [http://a9.com/-/spec/opensearchrss/1.0/ 
OpenSearch] RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the 
guidelines for RSS extensibility as outlined by the RSS 2.0 specification (See 
also [http://opensearch.a9.com/ opensearch]). Nutch in turn  makes extension to 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch].  The Nutch extensions are 
identified by the 'nutch' namespace prefix and add to 
[http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch] navigation information,
  the original query, and all fields that are available at search result time 
including the Nutch page boost, the name of the segment the page resides in, 
etc. 
  
- Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against OpenSearchServlet rather than 
search.jsp.  Results as XML can also be transformed using XSL stylesheets, the 
likely direction of UI development going forward.
+ Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against 
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html
 OpenSearchServlet] rather than search.jsp.  Results as XML can also be 
transformed using XSL stylesheets, the likely direction of UI development going 
forward.
  
  === Crawling ===
  


[Nutch Wiki] Update of bin/nutch segread by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_segread

The comment on the change is:
fix classpath from apache.org migration

--
- segread is an alias for net.nutch.segment.SegmentReader
+ segread is an alias for org.apache.nutch.segment.SegmentReader
  
  This class holds together all data readers for an existing segment. Some 
convenience methods are also provided, to read from the segment and to 
reposition the current pointer.
  


[Nutch Wiki] Update of bin/nutch segread by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_segread

The comment on the change is:
fixed usage 

--
  
  This class holds together all data readers for an existing segment. Some 
convenience methods are also provided, to read from the segment and to 
reposition the current pointer.
  
- Usage: SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] 
[-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]]
+ Usage: bin/nutch segread [-fix] [-dump] [-dumpsort] [-list] [-nocontent] 
[-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]]
  NOTE: at least one segment dir name is required, or '-dir' option.[[BR]]
  -fix[[BR]]
automatically fix corrupted segments[[BR]]


[Nutch Wiki] Update of bin/nutch admin by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_admin

The comment on the change is:
Fixed classpath change

--
- admin is an alias for net.nutch.tools.WebDB!AdminTool
+ admin is an alias for org.apache.nutch.tools.WebDB!AdminTool
  
  The WebDB!AdminTool is for Nutch administrators who need special access to 
the webdb. It allows for finer editing of the stored values.
  
- Usage: bin/nutch net.nutch.tools.WebDB!AdminTool (-local | -ndfs 
namenode:port) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]
+ Usage: bin/nutch org.apache.nutch.tools.WebDB!AdminTool (-local | -ndfs 
namenode:port) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]
  
  CommandLineOptions
  


svn commit: r367405 - /lucene/nutch/trunk/src/plugin/build.xml

2006-01-09 Thread jerome
Author: jerome
Date: Mon Jan  9 13:45:25 2006
New Revision: 367405

URL: http://svn.apache.org/viewcvs?rev=367405view=rev
Log:
Remove deployment of analysis plugins (under dev)
Remove protocol-http unit tests (moved to lib-http)

Modified:
lucene/nutch/trunk/src/plugin/build.xml

Modified: lucene/nutch/trunk/src/plugin/build.xml
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/build.xml?rev=367405r1=367404r2=367405view=diff
==
--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Mon Jan  9 13:45:25 2006
@@ -6,8 +6,6 @@
   !-- Build  deploy all the plugin jars.--
   !-- == --
   target name=deploy
- ant dir=analysis-de target=deploy/
- ant dir=analysis-fr target=deploy/
  ant dir=clustering-carrot2 target=deploy/
  ant dir=creativecommons target=deploy/
  ant dir=index-basic target=deploy/
@@ -47,8 +45,8 @@
   target name=test
  ant dir=creativecommons target=test/
  ant dir=languageidentifier target=test/
+ ant dir=lib-http target=test/
  ant dir=ontology target=test/
- ant dir=protocol-http target=test/
  ant dir=parse-ext target=test/
  ant dir=parse-html target=test/
  !-- ant dir=parse-mp3 target=test/ --
@@ -71,6 +69,7 @@
 ant dir=index-basic target=clean/
 ant dir=index-more target=clean/
 ant dir=languageidentifier target=clean/
+ant dir=lib-http target=clean/
 ant dir=lib-jakarta-poi target=clean/
 ant dir=lib-lucene-analyzers target=clean/
 ant dir=nutch-extensionpoints target=clean/




svn commit: r367406 - in /lucene/nutch/trunk/src: java/org/apache/nutch/ipc/RPC.java test/org/apache/nutch/ipc/TestRPC.java

2006-01-09 Thread cutting
Author: cutting
Date: Mon Jan  9 13:50:48 2006
New Revision: 367406

URL: http://svn.apache.org/viewcvs?rev=367406view=rev
Log:
Fix parallel RPC calls to work correctly with methods that return void.

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java
lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java?rev=367406r1=367405r2=367406view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java Mon Jan  9 
13:50:48 2006
@@ -149,6 +149,10 @@
 
 Writable[] wrappedValues = CLIENT.call(invocations, addrs);
 
+if (method.getReturnType() == Void.TYPE) {
+  return null;
+}
+
 Object[] values =
   (Object[])Array.newInstance(method.getReturnType(),wrappedValues.length);
 for (int i = 0; i  values.length; i++)

Modified: lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java?rev=367406r1=367405r2=367406view=diff
==
--- lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java (original)
+++ lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java Mon Jan  9 
13:50:48 2006
@@ -110,13 +110,17 @@
 }
 assertTrue(caught);
 
-// try a multi-call
-Method method =
+// try some multi-calls
+Method echo =
   TestProtocol.class.getMethod(echo, new Class[] { String.class });
-String[] values = (String[])RPC.call(method, new String[][]{{a},{b}},
+String[] strings = (String[])RPC.call(echo, new String[][]{{a},{b}},
  new InetSocketAddress[] {addr, addr});
-assertTrue(Arrays.equals(values, new String[]{a,b}));
+assertTrue(Arrays.equals(strings, new String[]{a,b}));
 
+Method ping = TestProtocol.class.getMethod(ping, new Class[] {});
+Object[] voids = (Object[])RPC.call(ping, new Object[][]{{},{}},
+new InetSocketAddress[] {addr, addr});
+assertEquals(voids, null);
 
 server.stop();
   }




svn commit: r367408 - /lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java

2006-01-09 Thread cutting
Author: cutting
Date: Mon Jan  9 13:55:31 2006
New Revision: 367408

URL: http://svn.apache.org/viewcvs?rev=367408view=rev
Log:
NUTCH-160: Switch RegexURLFilter to use Java regex's rather than oro, since 
Java's seem to be faster  more reliable.  By Rod Taylor.

Modified:

lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java

Modified: 
lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java?rev=367408r1=367407r2=367408view=diff
==
--- 
lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java
 (original)
+++ 
lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java
 Mon Jan  9 13:55:31 2006
@@ -32,12 +32,7 @@
 import java.util.ArrayList;
 import java.util.Iterator;
 import java.util.logging.Logger;
-
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
-import org.apache.oro.text.regex.Perl5Pattern;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.MalformedPatternException;
+import java.util.regex.*;
 
 /**
  * Filters URLs based on a file of regular expressions. The file is named by
@@ -80,15 +75,14 @@
   }
 
   private static class Rule {
-public Perl5Pattern pattern;
+public Pattern pattern;
 public boolean sign;
 public String regex;
   }
 
   private List rules;
-  private PatternMatcher matcher = new Perl5Matcher();
 
-  public RegexURLFilter() throws IOException, MalformedPatternException {
+  public RegexURLFilter() throws IOException, PatternSyntaxException {
 String file = NutchConf.get().get(urlfilter.regex.file);
 // attribute file takes precedence if defined
 if (attributeFile != null)
@@ -103,7 +97,7 @@
   }
 
   public RegexURLFilter(String filename)
-throws IOException, MalformedPatternException {
+throws IOException, PatternSyntaxException {
 rules = readConfigurationFile(new FileReader(filename));
   }
 
@@ -111,7 +105,9 @@
 Iterator i=rules.iterator();
 while(i.hasNext()) {
   Rule r=(Rule) i.next();
-  if (matcher.contains(url,r.pattern)) {
+  Matcher matcher = r.pattern.matcher(url);
+
+  if (matcher.find()) {
 //System.out.println(Matched  + r.regex);
 return r.sign ? url : null;
   }
@@ -129,10 +125,9 @@
   // 
 
   private static List readConfigurationFile(Reader reader)
-throws IOException, MalformedPatternException {
+throws IOException, PatternSyntaxException {
 
 BufferedReader in=new BufferedReader(reader);
-Perl5Compiler compiler=new Perl5Compiler();
 List rules=new ArrayList();
 String line;

@@ -157,7 +152,7 @@
   String regex=line.substring(1);
 
   Rule rule=new Rule();
-  rule.pattern=(Perl5Pattern) compiler.compile(regex);
+  rule.pattern=Pattern.compile(regex);
   rule.sign=sign;
   rule.regex=regex;
   rules.add(rule);
@@ -167,7 +162,7 @@
   }
 
   public static void main(String args[])
-throws IOException, MalformedPatternException {
+throws IOException, PatternSyntaxException {
 
 RegexURLFilter filter=new RegexURLFilter();
 BufferedReader in=new BufferedReader(new InputStreamReader(System.in));




[Nutch Wiki] Update of bin/nutch datanode by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_datanode

The comment on the change is:
fixed classpath to org.apache

--
- datanode is an alias for net.nutch.ndfs.NDFS
+ datanode is an alias for org.apache.nutch.ndfs.NDFS
  
  The NDFS class holds the NDFS client and server.
  
@@ -8, +8 @@

  
  This info is stored on disk (the !NameNode is responsible for asking other 
machines to replicate the data). The !DataNode reports the table's contents to 
the NameNode upon startup and every so often afterwards.
  
- Usage: bin/nutch net.nutch.ndfs.NDFS dataDir localMachine namenode:port
+ Usage: bin/nutch org.apache.nutch.ndfs.NDFS dataDir localMachine 
namenode:port
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch fetch by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_fetch

--
- fetch is an alias for net.nutch.fetcher.Fetcher
+ fetch is an alias for org.apache.nutch.fetcher.Fetcher
  
  The fetcher. Most of the work is done by plugins.
  
- Usage: bin/nutch net.nutch.fetcher.Fetcher (-local | -ndfs namenode:port) 
[-logLevel level] [-noParsing] [-showThreadID] [-threads n] dir
+ Usage: bin/nutch org.apache.nutch.fetcher.Fetcher (-local | -ndfs 
namenode:port) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] 
dir
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch inject by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_inject

The comment on the change is:
fixed classpath to org.apache

--
- inject is an alias for net.nutch.db.!WebDBInjector
+ inject is an alias for org.apache.nutch.db.!WebDBInjector
  
  This class takes a flat file of URLs and adds them as entries into a web page 
 link db. Useful for bootstrapping the system.
  
- Usage: bin/nutch net.nutch.db.!WebDBInjector (-local | -ndfs namenode:port) 
db_dir (-urlfile url_file | -dmozfile dmoz_file) [-subset 
subsetDenominator] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] 
[-topicFile topic list file] [-topic topic [-topic topic [...]]]
+ Usage: bin/nutch org.apache.nutch.db.!WebDBInjector (-local | -ndfs 
namenode:port) db_dir (-urlfile url_file | -dmozfile dmoz_file) 
[-subset subsetDenominator] [-includeAdultMaterial] [-skew skew] 
[-noDmozDesc] [-topicFile topic list file] [-topic topic [-topic topic 
[...]]]
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch merge by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_merge

The comment on the change is:
fixed classpath to org.apache

--
- merge is an alias for net.nutch.indexer.!IndexMerger
+ merge is an alias for org.apache.nutch.indexer.!IndexMerger
  
  IndexMerger creates an index for the output corresponding to a single fetcher 
run.
  
- Usage: bin/nutch net.nutch.indexer.!IndexMerger (-local | -ndfs 
nameserver:port) [-workingdir workingdir] outputIndex segments...
+ Usage: bin/nutch org.apache.nutch.indexer.!IndexMerger (-local | -ndfs 
nameserver:port) [-workingdir workingdir] outputIndex segments...
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch namenode by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_namenode

The comment on the change is:
fixed classpath to org.apache

--
- namenode is an alias for net.nutch.ndfs.NDFS
+ namenode is an alias for org.apache.nutch.ndfs.NDFS
  
  The NDFS class holds the NDFS client and server.
  
@@ -8, +8 @@

  
  This info is stored on disk (the !NameNode is responsible for asking other 
machines to replicate the data). The !DataNode reports the table's contents to 
the NameNode upon startup and every so often afterwards.
  
- Usage: bin/nutch net.nutch.ndfs.NDFS port namespace_dir
+ Usage: bin/nutch org.apache.nutch.ndfs.NDFS port namespace_dir
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch prune by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_prune

The comment on the change is:
fixed classpath to org.apache

--
- prune is an alias for net.nutch.tools.!PruneIndexTool
+ prune is an alias for org.apache.nutch.tools.!PruneIndexTool
  
  This tool prunes existing Nutch indexes of unwanted content. The main method 
accepts a list of segment directories (containing indexes). These indexes will 
be pruned of any content that matches one or more query from a list of Lucene 
queries read from a file (defined in standard config file, or explicitly 
overridden from command-line). Segments should already be indexed, if some of 
them are missing indexes then these segments will be skipped.
  
- NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link net.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
+ NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link org.apache.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
  If additional level of control is required, an instance of !PruneChecker can 
be provided to check each document before it's deleted. The results of all 
checkers are logically AND-ed, which means that any checker in the chain can 
veto the deletion of the current document. Two example checker implementations 
are provided - !PrintFieldsChecker prints the values of selected index fields, 
!StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them 
can be activated by providing respective command-line options.
  
  
- Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun 
-queries queries.txt -showfields url,title[[BR}}
+ Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir 
-dryrun -queries queries.txt -showfields url,title[[BR}}
  This command will just print out fields of matching documents.
  
- Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -queries 
queries.txt[[BR]]
+ Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir 
-queries queries.txt[[BR]]
  This command will actually remove all matching entries, according to the 
queries read from queries.txt file.
  
  NOTE 2: This tool removes matching documents ONLY from segment indexes (or 
from a merged index). In particular it does NOT remove the pages and links from 
WebDB. This means that unwanted URLs may pop up again when new segments are 
created. To prevent this, use your own link net.nutch.net.URLFilter, or 
PruneDBTool (under construction...).


[Nutch Wiki] Update of bin/nutch server by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_server

The comment on the change is:
fixed classpath to org.apache

--
- server is an alias for net.nutch.searcher.!DistributedSearch
+ server is an alias for org.apache.nutch.searcher.!DistributedSearch
  
  Implements the search API over IPC connnections.
  
- Usage: bin/nutch net.nutch.searcher.!DistributedSearch port index dir
+ Usage: bin/nutch org.apache.nutch.searcher.!DistributedSearch port index 
dir
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch updatedb by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_updatedb

The comment on the change is:
fixed classpath to org.apache

--
- updatedb is an alias for net.nutch.tools.!UpdateDatabaseTool
+ updatedb is an alias for org.apache.nutch.tools.!UpdateDatabaseTool
  
  This class takes the output of the fetcher and updates the page and link DBs 
accordingly. Eventually, as the database scales, this will broken into several 
phases, each consuming and emitting batch files, but, for now, we're doing it 
all here.
  
- Usage: bin/nutch net.nutch.tools.!UpdateDatabaseTool (-local | -ndfs 
namenode:port) [-max N] [-noAdditions] db seg_dir [ seg_dir ... ]
+ Usage: bin/nutch org.apache.nutch.tools.!UpdateDatabaseTool (-local | -ndfs 
namenode:port) [-max N] [-noAdditions] db seg_dir [ seg_dir ... ]
  
  CommandLineOptions
  


[Nutch Wiki] Update of bin/nutch segslice by JerryRussell

2006-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_segslice

The comment on the change is:
fixed classpath to org.apache

--
- segslice is an alias for net.nutch.segment.!SegmentSlicer
+ segslice is an alias for org.apache.nutch.segment.!SegmentSlicer
  
  This class reads data from one or more input segments, and outputs it to one 
or more output segments, optionally deleting the input segments when it's 
finished.
  
@@ -12, +12 @@

  
  NOTE 3: if one or more input segments are in non-parsed format, the output 
segments will also use non-parsed format. This means that any parseData and 
parseText data from input segments will NOT be copied to the output segments.
  
- Usage: bin/nutch net.nutch.segment.!SegmentSlicer (-local | -ndfs 
namenode:port) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] 
[-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]]
+ Usage: bin/nutch org.apache.nutch.segment.!SegmentSlicer (-local | -ndfs 
namenode:port) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] 
[-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]]
  NOTE: at least one segment dir name is required, or '-dir' option.
  outputDir is always required.[[BR]]
  -o outputDir[[BR]]