) for the command bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50 >&logs/logs1.log
, I know the meaning of parameter , say , -depth 8 the maximum depth of links crawled is 8 (8 levels down from the seed urls) -topN 50000 maximum number of links/pages can be crawled at each depth -thread 16 issue 16 threads simultaneously but how to choose the proper number for each parameter? For example ,in craiglist web site , the usual url for a certain car goes like this:http://losangeles.craigslist.org/sgv/cto/2496560420.html But in Kbb.com, the usual url for a certain car goes like this: http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329&intent=buy-used&options=4098815|true|4098881|true&pricetype=private-party&condition=good how to determine the value of parameter for these 2 example ? 2) When I check the data in Luke in overview panel, I found that on the left side (available fields and term counts per field table)the anchor number value is zero , while the content value is not, and on the right side (top ranking terms table) all the rank values are also the same.I want to know the reason that it displays the information like this. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/some-questions-about-the-crawling-with-Nutch-tp3173828p3173828.html Sent from the Nutch - User mailing list archive at Nabble.com.

