Hi Lewis, the javadoc obviously belongs to the first method generate(Path, Path, int, long, long) This method also uses the two properties generate.filter and generate#normalise. But this method is only referenced by Crawl#run and Benchmark.
The third method (whith the javadoc) is used by Generator#run which does not use the two properties but requires to switch the filters off by command-line args (-noFilter/-noNorm). A solution would be to treat the two properties as deprecated (as well as the Crawl class, see NUTCH-1087). Btw, I've run a couple of regexes on the Java code to extract the properties used and compared them to those "defined" in nutch-default.xml. The rough counts: 150 defined and used 80 used but not defined (among them many used "temporary" to pass options from tool to job) 5 defined but never used Ideally, all really used properties should be explained in nutch-default.xml. Most users won't study the sources to find out whether there are properties which could be useful. I would volunteer and prepare a table of all those properties as base of further discussions. In the wiki? Sebastian On 07/28/2012 07:32 PM, Lewis John Mcgibbney wrote: > Hi, > > Looking at the three Generator#generate methods I see they all accept > varying parameters with the final one accepting > > public Path[] generate(Path dbDir, Path segments, int numLists, long topN, > long curTime, boolean filter, boolean norm, boolean force, int > maxNumSegments) > throws IOException { > ... > > Now this all looks OK so far, though what is concerning me is that the > Javadoc mentions "Whether to filter URLs or not > is read from the crawl.generate.filter property in the configuration > files."... which does not exist. Also the Javadoc is pretty dated > w.r.t defining the correct parameters accepted by the method e.g. > > * @param dbDir > * Crawl database directory > * @param segments > * Segments directory > * @param numLists > * Number of reduce tasks > * @param topN > * Number of top URLs to be selected > * @param curTime > * Current time in milliseconds > * > * @return Path to generated segment or null if no entries were selected > * > * @throws IOException > * When an I/O error occurs > > What I know already is that filtering in the Generator is set by > default to true as is normalizing and this can be overridden via CLI > but it seems (I currently cannot however confirm) that the option to > define this within nutch-default.xml never existed! Changing the > Javadoc is one thing but I'm just not sure exactly what the problem > seems to be here. Can someone also have a look and help me to > straighten this out as I think it's important. > > Thanks > > Lewis >