depth=10 does not imply that total urls=12500*10=125000.
Depth says that crawling will be performed recursively on the new urls and
old urls 10 times more. the number of urls fetched depends on the content of
web pages.....

On Wed, Mar 2, 2011 at 4:37 AM, Patricio Galeas-5 [via Lucene] <
[email protected]> wrote:

> Hello,
>
> I have some questions related to the nutch statistics.
> I ran five crawls with topN=12500, depth=2,4,7,10,11, with following
> results:
>
> https://spreadsheets.google.com/ccc?key=0AvF8Ig446DzEdGNxaDNLLTgtUzdoTVNzQTJIcVFSZXc&hl=es#gid=0
>
>
> Why is the number of TOTAL URLs not equal to (db_fetched + db_unfetched +
> db_gone) ?
>
> I expected to get a value about 125000 TOTAL URLs (using TopN=12500,
> depth=10),
> but I got only 34000 URLs (27% of TOTAL URLs). Has this difference to do
> with
> the regex-urlfilters only?
>
> When db_gone decreases (for example comparing crawl2 with crawl3) means
> that
> some URLs which were not available in the past will be now fetched?
>
> Thanks for your help!
>
> Regards
> Patricio
>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2607370.html
>  To start a new topic under Nutch - User, email
> [email protected]
> To unsubscribe from Nutch - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=>.
>
>



-- 
Kumar Anurag


-----
Kumar Anurag

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2611555.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to