Hi Lewis,

the javadoc obviously belongs to the first method
 generate(Path, Path, int, long, long)
This method also uses the two properties generate.filter
and generate#normalise. But this method is only referenced
by Crawl#run and Benchmark.

The third method (whith the javadoc) is used
by Generator#run which does not use the two properties
but requires to switch the filters off by command-line args
(-noFilter/-noNorm).

A solution would be to treat the two properties as deprecated
(as well as the Crawl class, see NUTCH-1087).

Btw, I've run a couple of regexes on the Java code to extract
the properties used and compared them to those "defined" in nutch-default.xml.
The rough counts:
 150  defined and used
  80  used but not defined
      (among them many used "temporary" to pass options from tool to job)
   5  defined but never used
Ideally, all really used properties should be explained in nutch-default.xml.
Most users won't study the sources to find out whether there are properties
which could be useful.
I would volunteer and prepare a table of all those properties as base of
further discussions. In the wiki?

Sebastian

On 07/28/2012 07:32 PM, Lewis John Mcgibbney wrote:
> Hi,
> 
> Looking at the three Generator#generate methods I see they all accept
> varying parameters with the final one accepting
> 
> public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
>       long curTime, boolean filter, boolean norm, boolean force, int
> maxNumSegments)
>       throws IOException {
> ...
> 
> Now this all looks OK so far, though what is concerning me is that the
> Javadoc mentions "Whether to filter URLs or not
> is read from the crawl.generate.filter property in the configuration
> files."... which does not exist. Also the Javadoc is pretty dated
> w.r.t defining the correct parameters accepted by the method e.g.
> 
>    * @param dbDir
>    *          Crawl database directory
>    * @param segments
>    *          Segments directory
>    * @param numLists
>    *          Number of reduce tasks
>    * @param topN
>    *          Number of top URLs to be selected
>    * @param curTime
>    *          Current time in milliseconds
>    *
>    * @return Path to generated segment or null if no entries were selected
>    *
>    * @throws IOException
>    *           When an I/O error occurs
> 
> What I know already is that filtering in the Generator is set by
> default to true as is normalizing and this can be overridden via CLI
> but it seems (I currently cannot however confirm) that the option to
> define this within nutch-default.xml never existed! Changing the
> Javadoc is one thing but I'm just not sure exactly what the problem
> seems to be here. Can someone also have a look and help me to
> straighten this out as I think it's important.
> 
> Thanks
> 
> Lewis
> 

Reply via email to