svn commit: r264964 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java

2005-08-31 Thread jerome
Author: jerome
Date: Wed Aug 31 01:04:52 2005
New Revision: 264964

URL: http://svn.apache.org/viewcvs?rev=264964view=rev
Log:
No more NullPointerException while logging the doc language if none

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java?rev=264964r1=264963r2=264964view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexSegment.java Wed 
Aug 31 01:04:52 2005
@@ -145,8 +145,9 @@
 
   // add the document to the index
   NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get(lang));
-  LOG.info( Indexing [ + doc.getField(url).stringValue() +
-   ] with analyzer  + analyzer +  ( + 
doc.getField(lang).stringValue() + ));
+  LOG.info( Indexing [ + doc.getField(url).stringValue() + ] 
+
+with analyzer  + analyzer +
+( + doc.get(lang) + ));
   //LOG.info( Doc is  + doc);
   writer.addDocument(doc, analyzer);
   if (count  0  count % LOG_STEP == 0) {




svn commit: r265020 - /lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java

2005-08-31 Thread jerome
Author: jerome
Date: Wed Aug 31 04:38:28 2005
New Revision: 265020

URL: http://svn.apache.org/viewcvs?rev=265020view=rev
Log:
Fixes some typo (analySer = analyZer)

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java

Modified: 
lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java?rev=265020r1=265019r2=265020view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java 
Wed Aug 31 04:38:28 2005
@@ -44,7 +44,7 @@
 
   private final static Map CACHE = new HashMap();
 
-  private final static NutchAnalyzer DEFAULT_ANALYSER = 
+  private final static NutchAnalyzer DEFAULT_ANALYZER = 
 new NutchDocumentAnalyzer();
   
   
@@ -60,22 +60,22 @@
 
   
   /**
-   * Returns the appropriate [EMAIL PROTECTED] Analyser} implementation given 
a language
-   * code.
+   * Returns the appropriate [EMAIL PROTECTED] NutchAnalyzer analyzer} 
implementation
+   * given a language code.
*
-   * pNutchAnalyser extensions should define the attribute lang. The first
+   * pNutchAnalyzer extensions should define the attribute lang. The first
* plugin found whose lang attribute equals the specified lang parameter is
* used. If none match, then the [EMAIL PROTECTED] NutchDocumentAnalyzer} is 
used.
*/
   public static NutchAnalyzer get(String lang) {
 
-NutchAnalyzer analyzer = DEFAULT_ANALYSER;
+NutchAnalyzer analyzer = DEFAULT_ANALYZER;
 Extension extension = getExtension(lang);
 if (extension != null) {
 try {
 analyzer = (NutchAnalyzer) extension.getExtensionInstance();
 } catch (PluginRuntimeException pre) {
-analyzer = DEFAULT_ANALYSER;
+analyzer = DEFAULT_ANALYZER;
 }
 }
 return analyzer;




[Nutch Wiki] Update of FrontPage by AndrzejBialecki

2005-08-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/FrontPage

--
  ||DissectingTheNutchCrawler by MattKangas
  ||Add, View, or Do tasks from the TaskList
  ||HowToContribute||   ||
+ ||[Committer's Rules]||   ||
  ||[Release HOWTO]||   ||
  ||[Website Update HOWTO]||   ||
  


[Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/Committer's_Rules

New page:
= Commits and Release Engineering =

Committers should follow these guidelines when deciding, which branch to use 
for committing the patches and when to commit.

== Branches and Release Engineering ==

1. The SVN repository consists of the following areas:

 a. '''trunk''' (equivalent to CVS HEAD), where the current development code 
base is found. This area is not always guaranteed to be in usable state, some 
occasional breakage may occur, some parts of the code base may not work 
properly or at all. This is the area for developers, the bleeding edge, and 
usually not suitable for stable production - average users are discouraged to 
use it, unless they miss some functionality available only here, and are 
prepared to face some hardships (such as the lack of documentation, the issue 
of setting up a development environment, bugs, etc).

 a. '''Release-x.x''' branches, where the code from each release is put for 
further maintenance. These areas contain code, which is considered stable, 
i.e. at the point of release it was known to be working well, ''within the 
limits of functionality available for that release''. The code here is also 
maintained in a well-working state for a certain period after release, but only 
minor fixes are applied here in order to provide a solid product with the 
functionality of the given release. Normally, no new functionality should be 
added to the maintenance branches. It is unacceptable to introduce changes to 
this branch, which would break the compatibility with the earlier code within 
the same branch.

 a. any other temporary branches (such as e.g. mapred), which serve as 
temporary repository for the work to be merged with the trunk at a later stage. 
You should not expect anything functional here, unless the developers 
explicitly ask for help in testing and integration.

2. The trunk is the area, where active current development occurs. New features 
and enhancements are first committed here.
 a. This requirement helps to minimize the risk of losing new features and 
enhancements somewhere on the branches, because as the time goes it is more and 
more difficult to forward-port them from the past branches to the trunk.

 a. If some changes are invasive and would result in prolonged periods of 
breakage, they probably need more development time before they are integrated 
with the trunk. If you want other developers to join you in work, it's a good 
idea to put these changes on a temporary branch to be merged later with the 
trunk.

3. If there are important features or fixes, which will benefit majority of 
users, these can be back-ported to release branches, after they have been 
committed to the trunk (if appropriate). The back-porting process should 
involve extensive testing to ensure that the code on the Release branch remains 
stable and production-quality. It is unacceptable to commit code, which breaks 
the build process, or is known to be unstable. Users will expect from the 
Release branches to be stable and working with production quality at all times.

== Backward compatibility ==

== Committer's checklist ==
Things to check before commit.


svn commit: r265503 - in /lucene/nutch/trunk/src: java/org/apache/nutch/clustering/ java/org/apache/nutch/fs/ java/org/apache/nutch/mapReduce/ java/org/apache/nutch/parse/ java/org/apache/nutch/protoc

2005-08-31 Thread jerome
Author: jerome
Date: Wed Aug 31 08:17:11 2005
New Revision: 265503

URL: http://svn.apache.org/viewcvs?rev=265503view=rev
Log:
Merged 0.7 branch changes 240321:240453 into trunk

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/clustering/OnlineClusterer.java
lucene/nutch/trunk/src/java/org/apache/nutch/fs/NutchFileSystem.java
lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/FileSplit.java
lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/MapOutputFile.java
lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/RecordReader.java
lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/package.html
lucene/nutch/trunk/src/java/org/apache/nutch/parse/Parse.java
lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java
lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolException.java
lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ResourceGone.java
lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ResourceMoved.java
lucene/nutch/trunk/src/java/org/apache/nutch/protocol/RetryLater.java
lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Hits.java
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
lucene/nutch/trunk/src/java/org/apache/nutch/util/Daemon.java

lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIdentifier.java

lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/XMLCharacterRecognizer.java

lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummySSLProtocolSocketFactory.java

Modified: 
lucene/nutch/trunk/src/java/org/apache/nutch/clustering/OnlineClusterer.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/clustering/OnlineClusterer.java?rev=265503r1=265502r2=265503view=diff
==
--- 
lucene/nutch/trunk/src/java/org/apache/nutch/clustering/OnlineClusterer.java 
(original)
+++ 
lucene/nutch/trunk/src/java/org/apache/nutch/clustering/OnlineClusterer.java 
Wed Aug 31 08:17:11 2005
@@ -23,8 +23,8 @@
  * algorithms.
  *
  * pBy the term bonline/b search results clustering we will understand
- * a clusterer that works on a set of [EMAIL PROTECTED] Hit}s retrieved for a 
user's query
- * and produces a set of [EMAIL PROTECTED] Clusters} that can be displayed to 
help
+ * a clusterer that works on a set of [EMAIL PROTECTED] HitDetails} retrieved 
for a user's
+ * query and produces a set of [EMAIL PROTECTED] HitsCluster} that can be 
displayed to help
  * the user gain insight in the topics found in the result./p
  *
  * pOther clustering options include predefined categories and off-line

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fs/NutchFileSystem.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/fs/NutchFileSystem.java?rev=265503r1=265502r2=265503view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/fs/NutchFileSystem.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/fs/NutchFileSystem.java Wed 
Aug 31 08:17:11 2005
@@ -80,8 +80,8 @@
   return getNamed(NutchConf.get().get(fs.default.name, local));
 }
 
-/** Returns a name for this filesystem, suitable to pass to [EMAIL 
PROTECTED]
- * NutchFileSystem#getNamed(String).*/
+/** Returns a name for this filesystem, suitable to pass to
+ * [EMAIL PROTECTED] NutchFileSystem#getNamed(String)}.*/
 public abstract String getName();
   
 /** Returns a named filesystem.  Names are either the string local or a

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/FileSplit.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/FileSplit.java?rev=265503r1=265502r2=265503view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/FileSplit.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/mapReduce/FileSplit.java Wed 
Aug 31 08:17:11 2005
@@ -25,9 +25,12 @@
 import org.apache.nutch.io.UTF8;
 import org.apache.nutch.fs.NutchFileSystem;
 
-/** A section of an input file.  Returned by [EMAIL PROTECTED]
- * InputFormat#getSplits(File[], int)} and passed to
- * InputFormat#getRecordReader(FileSplit). */
+/**
+ * A section of an input file.
+ * Returned by [EMAIL PROTECTED] InputFormat#getSplits(NutchFileSystem, 
JobConf, int)}
+ * and passed to
+ * [EMAIL PROTECTED] InputFormat#getRecordReader(NutchFileSystem, FileSplit, 
JobConf)}.
+ */
 public class FileSplit implements Writable {
   private File file;
   private long start;

Modified: