)
for the command

bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50 >&logs/logs1.log

   , 

  I know the meaning of parameter  , say , 

-depth 8 the maximum depth of links crawled is 8 (8 levels down from the
seed urls) 

-topN 50000 maximum number of links/pages can be crawled at each depth 
-thread 16 issue 16 threads simultaneously


but how to choose the proper number for each parameter?  For example ,in
craiglist  web site , the usual url for a certain car goes like
this:http://losangeles.craigslist.org/sgv/cto/2496560420.html


 But in Kbb.com,   the usual  url for a certain car goes like this:
http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329&intent=buy-used&options=4098815|true|4098881|true&pricetype=private-party&condition=good


how to determine the value of parameter for these 2 example ?



2) When I check the data in Luke in overview panel, I found that on the left
side (available fields and term counts per field table)the anchor number
value is zero , while the content value is not, and on the right side (top
ranking terms table) all the rank values are also the same.I want to know
the reason that it displays the information like this. 


Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/some-questions-about-the-crawling-with-Nutch-tp3173828p3173828.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to