svn commit: r367251 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
Author: ab Date: Mon Jan 9 00:58:58 2006 New Revision: 367251 URL: http://svn.apache.org/viewcvs?rev=367251view=rev Log: Replace the custom metadata serialization with the one provided by the ContentProperties class. This fixes the breakage if multiple property values per key are in use. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java?rev=367251r1=367250r2=367251view=diff == --- lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java Mon Jan 9 00:58:58 2006 @@ -31,7 +31,7 @@ public final class ParseData extends VersionedWritable { public static final String DIR_NAME = parse_data; - private final static byte VERSION = 2; + private final static byte VERSION = 3; private String title; private Outlink[] outlinks; @@ -96,10 +96,15 @@ Outlink.skip(in); } -int propertyCount = in.readInt(); // read metadata -metadata = new ContentProperties(); -for (int i = 0; i propertyCount; i++) { - metadata.put(UTF8.readString(in), UTF8.readString(in)); +if (version 3) { + int propertyCount = in.readInt(); // read metadata + metadata = new ContentProperties(); + for (int i = 0; i propertyCount; i++) { +metadata.put(UTF8.readString(in), UTF8.readString(in)); + } +} else { + metadata = new ContentProperties(); + metadata.readFields(in); } } @@ -113,14 +118,7 @@ for (int i = 0; i outlinks.length; i++) { outlinks[i].write(out); } - -out.writeInt(metadata.size());// write metadata -Iterator i = metadata.entrySet().iterator(); -while (i.hasNext()) { - Map.Entry e = (Map.Entry)i.next(); - UTF8.writeString(out, (String)e.getKey()); - UTF8.writeString(out, (String)e.getValue()); -} +metadata.write(out); } public static ParseData read(DataInput in) throws IOException {
[Nutch Wiki] Update of FAQ by MichaelStack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MichaelStack: http://wiki.apache.org/nutch/FAQ -- Anchor text makes a large contribution to document score (You can see the anchor text for a page by browsing to explain then editing the URL to put in place anchors.jsp in place of explain.jsp). What is the RSS symbol in search results all about? - Clicking on the RSS symbol sends the current query back to Nutch to a servlet named OpenSearchServlet that redoes the search returning the results instead formatted as RSS (XML). The RSS format is based on [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from [http://www.a9.com a9.com] (Also see [href=http://opensearch.a9.com/ OpenSearch]). Nutch extensions add to the OpenSearch RSS the original query, navigation information, and any extra fields that available in the search result such as Nutch boost, segment name, etc. + Clicking on the RSS symbol sends the current query back to Nutch to a servlet named OpenSearchServlet. OpenSearchServlet reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from [http://www.a9.com a9.com]: OpenSearch RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification (See also [http://opensearch.a9.com/ OpenSearch]). Nutch in turn makes extension to OpenSearch. The Nutch extensions are identified by the 'nutch' namespace prefix and add to OpenSearch navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc. - Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against OpenSearchServlet rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development in nutch going by mailing list posts. + Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against OpenSearchServlet rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward. === Crawling ===
[Nutch Wiki] Update of FAQ by MichaelStack
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by MichaelStack: http://wiki.apache.org/nutch/FAQ The comment on the change is: Edit. Add mention of RSS 2.0. -- Anchor text makes a large contribution to document score (You can see the anchor text for a page by browsing to explain then editing the URL to put in place anchors.jsp in place of explain.jsp). What is the RSS symbol in search results all about? - Clicking on the RSS symbol sends the current query back to Nutch to a servlet named OpenSearchServlet. OpenSearchServlet reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from [http://www.a9.com a9.com]: OpenSearch RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification (See also [http://opensearch.a9.com/ OpenSearch]). Nutch in turn makes extension to OpenSearch. The Nutch extensions are identified by the 'nutch' namespace prefix and add to OpenSearch navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc. + Clicking on the RSS symbol sends the current query back to Nutch to a servlet named [http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html OpenSearchServlet]. [http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html OpenSearchServlet] reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch RSS 1.0] from [http://www.a9.com a9.com]: [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch] RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification (See also [http://opensearch.a9.com/ opensearch]). Nutch in turn makes extension to [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch]. The Nutch extensions are identified by the 'nutch' namespace prefix and add to [http://a9.com/-/spec/opensearchrss/1.0/ OpenSearch] navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc. - Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against OpenSearchServlet rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward. + Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against [http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html OpenSearchServlet] rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward. === Crawling ===
[Nutch Wiki] Update of bin/nutch segread by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_segread The comment on the change is: fix classpath from apache.org migration -- - segread is an alias for net.nutch.segment.SegmentReader + segread is an alias for org.apache.nutch.segment.SegmentReader This class holds together all data readers for an existing segment. Some convenience methods are also provided, to read from the segment and to reposition the current pointer.
[Nutch Wiki] Update of bin/nutch segread by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_segread The comment on the change is: fixed usage -- This class holds together all data readers for an existing segment. Some convenience methods are also provided, to read from the segment and to reposition the current pointer. - Usage: SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]] + Usage: bin/nutch segread [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]] NOTE: at least one segment dir name is required, or '-dir' option.[[BR]] -fix[[BR]] automatically fix corrupted segments[[BR]]
[Nutch Wiki] Update of bin/nutch admin by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_admin The comment on the change is: Fixed classpath change -- - admin is an alias for net.nutch.tools.WebDB!AdminTool + admin is an alias for org.apache.nutch.tools.WebDB!AdminTool The WebDB!AdminTool is for Nutch administrators who need special access to the webdb. It allows for finer editing of the stored values. - Usage: bin/nutch net.nutch.tools.WebDB!AdminTool (-local | -ndfs namenode:port) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k] + Usage: bin/nutch org.apache.nutch.tools.WebDB!AdminTool (-local | -ndfs namenode:port) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k] CommandLineOptions
svn commit: r367405 - /lucene/nutch/trunk/src/plugin/build.xml
Author: jerome Date: Mon Jan 9 13:45:25 2006 New Revision: 367405 URL: http://svn.apache.org/viewcvs?rev=367405view=rev Log: Remove deployment of analysis plugins (under dev) Remove protocol-http unit tests (moved to lib-http) Modified: lucene/nutch/trunk/src/plugin/build.xml Modified: lucene/nutch/trunk/src/plugin/build.xml URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/build.xml?rev=367405r1=367404r2=367405view=diff == --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Mon Jan 9 13:45:25 2006 @@ -6,8 +6,6 @@ !-- Build deploy all the plugin jars.-- !-- == -- target name=deploy - ant dir=analysis-de target=deploy/ - ant dir=analysis-fr target=deploy/ ant dir=clustering-carrot2 target=deploy/ ant dir=creativecommons target=deploy/ ant dir=index-basic target=deploy/ @@ -47,8 +45,8 @@ target name=test ant dir=creativecommons target=test/ ant dir=languageidentifier target=test/ + ant dir=lib-http target=test/ ant dir=ontology target=test/ - ant dir=protocol-http target=test/ ant dir=parse-ext target=test/ ant dir=parse-html target=test/ !-- ant dir=parse-mp3 target=test/ -- @@ -71,6 +69,7 @@ ant dir=index-basic target=clean/ ant dir=index-more target=clean/ ant dir=languageidentifier target=clean/ +ant dir=lib-http target=clean/ ant dir=lib-jakarta-poi target=clean/ ant dir=lib-lucene-analyzers target=clean/ ant dir=nutch-extensionpoints target=clean/
svn commit: r367406 - in /lucene/nutch/trunk/src: java/org/apache/nutch/ipc/RPC.java test/org/apache/nutch/ipc/TestRPC.java
Author: cutting Date: Mon Jan 9 13:50:48 2006 New Revision: 367406 URL: http://svn.apache.org/viewcvs?rev=367406view=rev Log: Fix parallel RPC calls to work correctly with methods that return void. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java?rev=367406r1=367405r2=367406view=diff == --- lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/ipc/RPC.java Mon Jan 9 13:50:48 2006 @@ -149,6 +149,10 @@ Writable[] wrappedValues = CLIENT.call(invocations, addrs); +if (method.getReturnType() == Void.TYPE) { + return null; +} + Object[] values = (Object[])Array.newInstance(method.getReturnType(),wrappedValues.length); for (int i = 0; i values.length; i++) Modified: lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java?rev=367406r1=367405r2=367406view=diff == --- lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java (original) +++ lucene/nutch/trunk/src/test/org/apache/nutch/ipc/TestRPC.java Mon Jan 9 13:50:48 2006 @@ -110,13 +110,17 @@ } assertTrue(caught); -// try a multi-call -Method method = +// try some multi-calls +Method echo = TestProtocol.class.getMethod(echo, new Class[] { String.class }); -String[] values = (String[])RPC.call(method, new String[][]{{a},{b}}, +String[] strings = (String[])RPC.call(echo, new String[][]{{a},{b}}, new InetSocketAddress[] {addr, addr}); -assertTrue(Arrays.equals(values, new String[]{a,b})); +assertTrue(Arrays.equals(strings, new String[]{a,b})); +Method ping = TestProtocol.class.getMethod(ping, new Class[] {}); +Object[] voids = (Object[])RPC.call(ping, new Object[][]{{},{}}, +new InetSocketAddress[] {addr, addr}); +assertEquals(voids, null); server.stop(); }
svn commit: r367408 - /lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java
Author: cutting Date: Mon Jan 9 13:55:31 2006 New Revision: 367408 URL: http://svn.apache.org/viewcvs?rev=367408view=rev Log: NUTCH-160: Switch RegexURLFilter to use Java regex's rather than oro, since Java's seem to be faster more reliable. By Rod Taylor. Modified: lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java Modified: lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java?rev=367408r1=367407r2=367408view=diff == --- lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java (original) +++ lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java Mon Jan 9 13:55:31 2006 @@ -32,12 +32,7 @@ import java.util.ArrayList; import java.util.Iterator; import java.util.logging.Logger; - -import org.apache.oro.text.regex.Perl5Compiler; -import org.apache.oro.text.regex.Perl5Matcher; -import org.apache.oro.text.regex.Perl5Pattern; -import org.apache.oro.text.regex.PatternMatcher; -import org.apache.oro.text.regex.MalformedPatternException; +import java.util.regex.*; /** * Filters URLs based on a file of regular expressions. The file is named by @@ -80,15 +75,14 @@ } private static class Rule { -public Perl5Pattern pattern; +public Pattern pattern; public boolean sign; public String regex; } private List rules; - private PatternMatcher matcher = new Perl5Matcher(); - public RegexURLFilter() throws IOException, MalformedPatternException { + public RegexURLFilter() throws IOException, PatternSyntaxException { String file = NutchConf.get().get(urlfilter.regex.file); // attribute file takes precedence if defined if (attributeFile != null) @@ -103,7 +97,7 @@ } public RegexURLFilter(String filename) -throws IOException, MalformedPatternException { +throws IOException, PatternSyntaxException { rules = readConfigurationFile(new FileReader(filename)); } @@ -111,7 +105,9 @@ Iterator i=rules.iterator(); while(i.hasNext()) { Rule r=(Rule) i.next(); - if (matcher.contains(url,r.pattern)) { + Matcher matcher = r.pattern.matcher(url); + + if (matcher.find()) { //System.out.println(Matched + r.regex); return r.sign ? url : null; } @@ -129,10 +125,9 @@ // private static List readConfigurationFile(Reader reader) -throws IOException, MalformedPatternException { +throws IOException, PatternSyntaxException { BufferedReader in=new BufferedReader(reader); -Perl5Compiler compiler=new Perl5Compiler(); List rules=new ArrayList(); String line; @@ -157,7 +152,7 @@ String regex=line.substring(1); Rule rule=new Rule(); - rule.pattern=(Perl5Pattern) compiler.compile(regex); + rule.pattern=Pattern.compile(regex); rule.sign=sign; rule.regex=regex; rules.add(rule); @@ -167,7 +162,7 @@ } public static void main(String args[]) -throws IOException, MalformedPatternException { +throws IOException, PatternSyntaxException { RegexURLFilter filter=new RegexURLFilter(); BufferedReader in=new BufferedReader(new InputStreamReader(System.in));
[Nutch Wiki] Update of bin/nutch datanode by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_datanode The comment on the change is: fixed classpath to org.apache -- - datanode is an alias for net.nutch.ndfs.NDFS + datanode is an alias for org.apache.nutch.ndfs.NDFS The NDFS class holds the NDFS client and server. @@ -8, +8 @@ This info is stored on disk (the !NameNode is responsible for asking other machines to replicate the data). The !DataNode reports the table's contents to the NameNode upon startup and every so often afterwards. - Usage: bin/nutch net.nutch.ndfs.NDFS dataDir localMachine namenode:port + Usage: bin/nutch org.apache.nutch.ndfs.NDFS dataDir localMachine namenode:port CommandLineOptions
[Nutch Wiki] Update of bin/nutch fetch by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_fetch -- - fetch is an alias for net.nutch.fetcher.Fetcher + fetch is an alias for org.apache.nutch.fetcher.Fetcher The fetcher. Most of the work is done by plugins. - Usage: bin/nutch net.nutch.fetcher.Fetcher (-local | -ndfs namenode:port) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] dir + Usage: bin/nutch org.apache.nutch.fetcher.Fetcher (-local | -ndfs namenode:port) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] dir CommandLineOptions
[Nutch Wiki] Update of bin/nutch inject by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_inject The comment on the change is: fixed classpath to org.apache -- - inject is an alias for net.nutch.db.!WebDBInjector + inject is an alias for org.apache.nutch.db.!WebDBInjector This class takes a flat file of URLs and adds them as entries into a web page link db. Useful for bootstrapping the system. - Usage: bin/nutch net.nutch.db.!WebDBInjector (-local | -ndfs namenode:port) db_dir (-urlfile url_file | -dmozfile dmoz_file) [-subset subsetDenominator] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile topic list file] [-topic topic [-topic topic [...]]] + Usage: bin/nutch org.apache.nutch.db.!WebDBInjector (-local | -ndfs namenode:port) db_dir (-urlfile url_file | -dmozfile dmoz_file) [-subset subsetDenominator] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile topic list file] [-topic topic [-topic topic [...]]] CommandLineOptions
[Nutch Wiki] Update of bin/nutch merge by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_merge The comment on the change is: fixed classpath to org.apache -- - merge is an alias for net.nutch.indexer.!IndexMerger + merge is an alias for org.apache.nutch.indexer.!IndexMerger IndexMerger creates an index for the output corresponding to a single fetcher run. - Usage: bin/nutch net.nutch.indexer.!IndexMerger (-local | -ndfs nameserver:port) [-workingdir workingdir] outputIndex segments... + Usage: bin/nutch org.apache.nutch.indexer.!IndexMerger (-local | -ndfs nameserver:port) [-workingdir workingdir] outputIndex segments... CommandLineOptions
[Nutch Wiki] Update of bin/nutch namenode by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_namenode The comment on the change is: fixed classpath to org.apache -- - namenode is an alias for net.nutch.ndfs.NDFS + namenode is an alias for org.apache.nutch.ndfs.NDFS The NDFS class holds the NDFS client and server. @@ -8, +8 @@ This info is stored on disk (the !NameNode is responsible for asking other machines to replicate the data). The !DataNode reports the table's contents to the NameNode upon startup and every so often afterwards. - Usage: bin/nutch net.nutch.ndfs.NDFS port namespace_dir + Usage: bin/nutch org.apache.nutch.ndfs.NDFS port namespace_dir CommandLineOptions
[Nutch Wiki] Update of bin/nutch prune by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_prune The comment on the change is: fixed classpath to org.apache -- - prune is an alias for net.nutch.tools.!PruneIndexTool + prune is an alias for org.apache.nutch.tools.!PruneIndexTool This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped. - NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. + NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link org.apache.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. If additional level of control is required, an instance of !PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - !PrintFieldsChecker prints the values of selected index fields, !StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options. - Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}} + Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}} This command will just print out fields of matching documents. - Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -queries queries.txt[[BR]] + Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -queries queries.txt[[BR]] This command will actually remove all matching entries, according to the queries read from queries.txt file. NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own link net.nutch.net.URLFilter, or PruneDBTool (under construction...).
[Nutch Wiki] Update of bin/nutch server by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_server The comment on the change is: fixed classpath to org.apache -- - server is an alias for net.nutch.searcher.!DistributedSearch + server is an alias for org.apache.nutch.searcher.!DistributedSearch Implements the search API over IPC connnections. - Usage: bin/nutch net.nutch.searcher.!DistributedSearch port index dir + Usage: bin/nutch org.apache.nutch.searcher.!DistributedSearch port index dir CommandLineOptions
[Nutch Wiki] Update of bin/nutch updatedb by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_updatedb The comment on the change is: fixed classpath to org.apache -- - updatedb is an alias for net.nutch.tools.!UpdateDatabaseTool + updatedb is an alias for org.apache.nutch.tools.!UpdateDatabaseTool This class takes the output of the fetcher and updates the page and link DBs accordingly. Eventually, as the database scales, this will broken into several phases, each consuming and emitting batch files, but, for now, we're doing it all here. - Usage: bin/nutch net.nutch.tools.!UpdateDatabaseTool (-local | -ndfs namenode:port) [-max N] [-noAdditions] db seg_dir [ seg_dir ... ] + Usage: bin/nutch org.apache.nutch.tools.!UpdateDatabaseTool (-local | -ndfs namenode:port) [-max N] [-noAdditions] db seg_dir [ seg_dir ... ] CommandLineOptions
[Nutch Wiki] Update of bin/nutch segslice by JerryRussell
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_segslice The comment on the change is: fixed classpath to org.apache -- - segslice is an alias for net.nutch.segment.!SegmentSlicer + segslice is an alias for org.apache.nutch.segment.!SegmentSlicer This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished. @@ -12, +12 @@ NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments. - Usage: bin/nutch net.nutch.segment.!SegmentSlicer (-local | -ndfs namenode:port) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]] + Usage: bin/nutch org.apache.nutch.segment.!SegmentSlicer (-local | -ndfs namenode:port) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]] NOTE: at least one segment dir name is required, or '-dir' option. outputDir is always required.[[BR]] -o outputDir[[BR]]