[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-05 Thread Shawn Gervais (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373256 ] 

Shawn Gervais commented on NUTCH-240:
-

This change seems to have caused an error to be thrown:

060405 034711 Generator: Partitioning selected urls by host, for politeness.
Exception in thread main java.lang.RuntimeException: class 
org.apache.nutch.crawl.Generator$SelectorInverseMapper not 
org.apache.hadoop.mapred.Mapper
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:262)
at org.apache.hadoop.mapred.JobConf.setMapperClass(JobConf.java:249)
at org.apache.nutch.crawl.Generator.generate(Generator.java:263)
at org.apache.nutch.crawl.Generator.main(Generator.java:317)

Just FYI.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-05 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] 

Andrzej Bialecki  commented on NUTCH-240:
-

Oops, sorry, that was a last moment change ... I fixed it now, thanks for 
spotting this.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-05 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373393 ] 

Jerome Charron commented on NUTCH-244:
--

While taking a quick look at this, something astonished me in the code.
The db.max.outlinks.per.page property is exclusively used in ParseData.
In the ParseData, the number of outlinks used is filtered in the readFields 
method ... 
Shouldn't it be directly filtered in the ParseData constructor ?

 Inconsistent handling of property values boundaries / unable to set 
 db.max.outlinks.per.page to infinite
 

  Key: NUTCH-244
  URL: http://issues.apache.org/jira/browse/NUTCH-244
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: AJ Banck


 Some properties like file.content.limit support using negative numbers (-1) 
 to 'disable' a limitation.
 Other properties do not support this. 
 I tried disabling the limit set by db.max.outlinks.per.page, but this isn't 
 possible.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-05 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373398 ] 

Jerome Charron commented on NUTCH-244:
--

That perfectly makes sense!
Thanks Andrzej.

 Inconsistent handling of property values boundaries / unable to set 
 db.max.outlinks.per.page to infinite
 

  Key: NUTCH-244
  URL: http://issues.apache.org/jira/browse/NUTCH-244
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: AJ Banck


 Some properties like file.content.limit support using negative numbers (-1) 
 to 'disable' a limitation.
 Other properties do not support this. 
 I tried disabling the limit set by db.max.outlinks.per.page, but this isn't 
 possible.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Patch to fix Redirects

2006-04-05 Thread Andrzej Bialecki

Dennis Kubes wrote:

Attached is a patch to fix redirects.  In the current version of 0.8-dev the
redirect functionality wasn't working because it was using the original key
value (original url) to get the output instead of the refresh url.

This is the first patch that I have submitted so if this needs to be
submitted differently please let me know.
  


Fixed. Thank you!

Please note that this does NOT fix the content-level redirects (i.e. 
meta-refresh), they are still broken.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Search quality evaluation

2006-04-05 Thread Doug Cutting
FYI, Mike wrote some evaluation stuff for Nutch a long time ago.  I 
found it in the Sourceforge Attic:


http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/

This worked by querying a set of search engines, those in:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/engines/

The results of each engine is scored by how much they differ from all of 
the other engines combined.  The Kendall Tau distance is used to compare 
rankings.  Thus this is a good tool to find out how close Nutch is to 
the quality of other engines, but it may not not be a good tool to make 
Nutch better than other search engines.


In any case, it includes a system to scrape search results from other 
engines, based on Apple's Sherlock search-engine descriptors.  These 
descriptors are also used by Mozilla:


http://mycroft.mozdev.org/deepdocs/quickstart.html

So there's a ready supply of up-to-date descriptions for most major 
search engines.  Many engines provide a skin specifically to simplify 
parsing by these plugins.


The code that implemented Sherlock plugins in Nutch is at:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/dynamic/

Doug

Andrzej Bialecki wrote:

Hi,

I found this paper, more or less by accident:

Scaling IR-System Evaluation using Term Relevance Sets; Einat Amitay, 
David Carmel, Ronny Lempel, Aya Soffer


   http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf

It gives an interesting and rather simple framework for evaluating the 
quality of search results.


Anybody interested in hacking together a component for Nutch and e.g. 
for Google, to run this evaluation? ;)




Re: Search quality evaluation

2006-04-05 Thread Dawid Weiss



In any case, it includes a system to scrape search results from other 
engines, based on Apple's Sherlock search-engine descriptors.  These 
descriptors are also used by Mozilla:


Just a note: we used to have exactly the same mechanism in Carrot2. 
Unfortunately this format does not make a clear distinction between 
title/ url/ snippet parts and stays at snippet granularity, so we 
additionally parsed each snippet with regular expressions...  The 
problem that lies beneath is in terms-of-use which forbid automatic 
scraping of search results using these plugins... That's the main reason 
why we switched to public APIs, actually.


D.


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss


One can presumably disable such minor warnings in Eclipse.  Arguably the 
bug is that Eclipse warns about such things by default, rather than in a 
'pedantic' mode.


I agree -- some of them are really annoying. Plus, Eclipse has been 
having notorious problems showing warnings for unused parameters in 
overriden methods... But I still think some of the warnings can be 
valuable and your idea with PMD is a good one.


One caution: we have run into problems where includes were removed 
because a tool said they were unused, but they were required for the 
Javadoc.  So code-analysis tools are not infallible!


Eclipse deals with these properly -- I use it all the time. I believe it 
also shows warnings for classes referenced in JavaDocs and not imported.


I would not be opposed to integrating PMD or something similar into 
Nutch's build.xml.  What do others think?  Any volunteers?


I'll do it. I meant to see PMD anyway so it'll be a good exercise.

D.


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
 PMD looks like a useful such tool:
 http://pmd.sourceforge.net/ant-task.html
 I would not be opposed to integrating PMD or something similar into
 Nutch's build.xml.  What do others think?  Any volunteers?

+1 (Very configurable, very good tool!)


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss


I'm a fan of automated testing and code analysis utilities, but I must 
say they only make sense if people actually use them and look at their 
results. So it's not really just about integration -- it's about looking 
at the results of these tools. PMD is neat because it can simply 
interrupt your build process so you'll have to either fix the warning or 
explicitly mark it as ignored. With code coverage... I don't know. It's 
up to you guys -- you spend much more time on Nutch code than I do and 
you know best what is needed and what isn't.


Let me know about PMD. I'll create the patch tomorrow if there's a 
consensus on if and how we should use it. For those impatient, the patch 
is in the attachment. Place the required PMD JARs in lib/pmd-ext/ and 
run 'ant pmd'.


D.

Jérôme Charron wrote:

I would not be opposed to integrating PMD or something similar into
Nutch's build.xml.  What do others think?  Any volunteers?

I'll do it. I meant to see PMD anyway so it'll be a good exercise.


Dawid, what about integrating a Code Coverage Tool like EMMA (
http://emma.sourceforge.net/)
while integrating PMD ?

Jérôme

Index: build.xml
===
--- build.xml   (revision 391739)
+++ build.xml   (working copy)
@@ -198,6 +198,34 @@
   /target
 
   !-- == --
+  !-- Run code checks (PMD)  -- 
+  !-- == --
+  target name=pmd
+   property name=pmd.report location=${build.dir}/pmd-report.html /
+   taskdef name=pmd classname=net.sourceforge.pmd.ant.PMDTask
+ classpath
+ fileset dir=${lib.dir} includes=pmd-ext/*.jar /
+ /classpath
+   /taskdef
+   pmd shortFilenames=true failonerror=true 
failOnRuleViolation=false
+encoding=${build.encoding} 
failuresPropertyName=pmd.failures
+ rulesetunusedcode/ruleset
+  formatter type=html toFile=${pmd.report} /
+ !-- formatter type=xml toFile=${tempbuild}/$report_pmd.xml/ --
+ fileset dir=${src.dir}
+include name=**/*.java/
+   !-- Exclude generated sources --
+   exclude name=**/NutchAnalysis.java /
+   exclude name=**/NutchAnalysisTokenManager.java /
+  /fileset
+/pmd
+   condition property=pmd.stop value=true
+   equals arg1=0 arg2=${pmd.failures} trim=true /
+   /condition
+   fail unless=pmd.stopFAILURE: PMD shows ${pmd.failures} rule 
violations. See ${pmd.report} for details./fail
+  /target
+
+  !-- == --
   !-- Run unit tests -- 
   !-- == --
   target name=test depends=test-core, test-plugins/


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Doug Cutting

Other options (raised on the Hadoop list) are Checkstyle:

http://checkstyle.sourceforge.net/

and FindBugs:

http://findbugs.sourceforge.net/

Although these are both under LGPL and thus harder to include in Apache 
projects.


Anything that generates a lot of false positives is bad: it either 
causes us to skip analysis of lots of files, or ignore the warnings. 
Skipping the JavaCC-generated classes is reasonable, but I'm wary of 
skipping much else.


Sigh.

Doug

Dawid Weiss wrote:


Ok, PMD seems like a good idea. I've added it to the build file. Unused 
code detection shows a few catches (javacc-generated classes need to be 
ignored because they contain a lot of junk), but unfortunately it also 
displays false positives such as in:


MapWritable.java   429   {Avoid unused private fields such as 
'fKeyClassId'}


This field is private but is used in an outside class (through a 
synthetic accessor I presume, so a simple syntax tree analysis PMD does 
is insufficient to catch it).


These things would need to be marked in the code as ignorable... Do you 
want me to create a JIRA issue for this, Doug? Or should we drop the 
subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to 
the codebase (Xerces can be reused).


D.



Patch to remove Nutch formating from logs

2006-04-05 Thread Christopher Burkey

Hello,

   Here is a patch to change org.apache.nutch.util.LogFormatter to not 
insert itself as the default handler for the system.


   I have been using Nutch for a year and have been waiting for a 
version that I can embed into OpenEdit. The problem has been that Nutch 
inserts itself as the formatter for the Java log system and that 
interferes with OpenEdit logging.



--
513-542-3401
[EMAIL PROTECTED]
http://www.openedit.org

diff -Naur ../java/org/apache/nutch/util/LogFormatter.java 
java/org/apache/nutch/util/LogFormatter.java
--- ../java/org/apache/nutch/util/LogFormatter.java 2006-03-31 
13:40:50.0 -0500
+++ java/org/apache/nutch/util/LogFormatter.java2006-04-05 
16:27:59.0 -0400
@@ -16,13 +16,23 @@
 
 package org.apache.nutch.util;
 
-import java.util.logging.*;
-import java.io.*;
-import java.text.*;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.PrintStream;
+import java.io.PrintWriter;
+import java.io.StringWriter;
+import java.text.FieldPosition;
+import java.text.SimpleDateFormat;
 import java.util.Date;
-
-/** Prints just the date and the log message. */
-
+import java.util.logging.Formatter;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+import java.util.logging.Logger;
+
+/** Prints just the date and the log message. 
+ *  This was also used to stop processing as nutch crawls a web site
+ *  [EMAIL PROTECTED] changed this code to use a LogWrapper class to catch 
severe errors
+ * */
 public class LogFormatter extends Formatter {
   private static final String FORMAT = yyMMdd HHmmss;
   private static final String NEWLINE = System.getProperty(line.separator);
@@ -35,20 +45,27 @@
   private static boolean showTime = true;
   private static boolean showThreadIDs = false;
 
+  protected static LogFormatter sharedformatter =  new LogFormatter();
+  protected static SevereLogHandler sharedhandler =  new 
SevereLogHandler(sharedformatter);
+
+  /*
   // install when this class is loaded
   static {
 Handler[] handlers = LogFormatter.getLogger().getHandlers();
 for (int i = 0; i  handlers.length; i++) {
-  handlers[i].setFormatter(new LogFormatter());
+  handlers[i].setFormatter(sharedformatter);
   handlers[i].setLevel(Level.FINEST);
 }
   }
-
+  */
   /** Gets a logger and, as a side effect, installs this as the default
* formatter. */
   public static Logger getLogger(String name) {
 // just referencing this class installs it
-return Logger.getLogger(name);
+   Logger logr = Logger.getLogger(name);
+   logr.addHandler(sharedhandler);
+   
+   return logr;
   }
   
   /** When true, time is logged with each entry. */
@@ -60,7 +77,10 @@
   public static void setShowThreadIDs(boolean showThreadIDs) {
 LogFormatter.showThreadIDs = showThreadIDs;
   }
-
+  public void setLoggedSevere( boolean inSevere )
+  {
+ loggedSevere = inSevere;
+  }
   /**
* Format the given LogRecord.
* @param record the log record to be formatted.
diff -Naur ../java/org/apache/nutch/util/SevereLogHandler.java 
java/org/apache/nutch/util/SevereLogHandler.java
--- ../java/org/apache/nutch/util/SevereLogHandler.java 1969-12-31 
19:00:00.0 -0500
+++ java/org/apache/nutch/util/SevereLogHandler.java2006-04-05 
16:29:20.0 -0400
@@ -0,0 +1,46 @@
+/*
+ * Created on Apr 5, 2006
+ */
+package org.apache.nutch.util;
+
+import java.util.logging.Handler;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+
+public class SevereLogHandler extends Handler
+{
+   protected LogFormatter fieldNutchFormatter;
+   
+   public SevereLogHandler(LogFormatter inFormatter)
+   {
+   setNutchFormatter(inFormatter);
+   }
+   
+   protected LogFormatter getNutchFormatter()
+   {
+   return fieldNutchFormatter;
+   }
+
+   protected void setNutchFormatter(LogFormatter inNutchFormatter)
+   {
+   fieldNutchFormatter = inNutchFormatter;
+   }
+
+   public void publish(LogRecord inRecord)
+   {
+   if ( inRecord.getLevel().intValue() == Level.SEVERE.intValue())
+   {
+   getNutchFormatter().setLoggedSevere(true);
+   }
+   }
+
+   public void flush()
+   {
+   }
+
+   public void close() throws SecurityException
+   {
+   }
+   
+
+}