Sebastian, thank you for your quick answer,
I have set generate.max.count to 1 000 000, so I think it shouldn`t be a problem, generate.count.mode is set to default byHost as well as generate.min.score to 0. It is a fresh start and I have set refetch time to 90 days so I don`t blame the pages that were already fetched, I believe they are going to fetch in those 90 days. db_unfetched pages are bothering me. I was looking at the statistics you mentioned, however I can`t get the cause from them. I will share them, maybe you will notice something... so: readdb -stats: 2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for CrawlDb: /nutch/crawl/crawldb 2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls: 11761224 2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch interval: 91 days, 07:27:11 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval: 912 days, 21:15:04 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch interval: 24855 days, 03:14:07 2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time: Wed Mar 26 10:12:00 CET 2025 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times: Mon Apr 14 23:27:00 CEST 2025 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time: Tue Aug 12 09:01:00 CEST 2025 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0: 11686131 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1: 37311 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2: 37782 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01: 0.0 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05: 0.0 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1: 2.8397222762258305E-9 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2: 1.5968612389262938E-7 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25: 3.0930187090863353E-7 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3: 5.316786869545881E-7 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4: 1.1587939392930727E-6 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5: 2.5814283409185258E-6 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6: 5.908654160550735E-6 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7: 1.7618150654886576E-5 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75: 2.9190806586558215E-5 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8: 8.921349358020518E-5 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9: 5.628835416129725E-4 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95: 0.007623361698306406 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99: 0.9327549852155366 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score: 0.0 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score: 0.015449582458424396 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score: 183.34019470214844 2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 9406711 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched): 1978175 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone): 98857 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 69608 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 169813 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6 (db_notmodified): 6045 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 32015 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics: done Please keep in mind, that "longest fetch interval: 24855 days" is not an issue as I have custom plugin that is doing so to prevent certain pages from fetching ever again. Generate counters: First one: 2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57 File System Counters FILE: Number of bytes read=259336882 FILE: Number of bytes written=731826855 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1460078564 HDFS: Number of bytes written=66867570 HDFS: Number of read operations=1680 HDFS: Number of large read operations=0 HDFS: Number of write operations=561 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=140 Launched reduce tasks=140 Data-local map tasks=137 Rack-local map tasks=3 Total time spent by all maps in occupied slots (ms)=16111006 Total time spent by all reduces in occupied slots (ms)=20877184 Total time spent by all map tasks (ms)=8055503 Total time spent by all reduce tasks (ms)=10438592 Total vcore-milliseconds taken by all map tasks=8055503 Total vcore-milliseconds taken by all reduce tasks=10438592 Total megabyte-milliseconds taken by all map tasks=65990680576 Total megabyte-milliseconds taken by all reduce tasks=85512945664 Map-Reduce Framework Map input records=11556117 Map output records=9192906 Map output bytes=1171410543 Map output materialized bytes=373113013 Input split bytes=19740 Combine input records=0 Combine output records=0 Reduce input groups=2022938 Reduce shuffle bytes=373113013 Reduce input records=9192906 Reduce output records=0 Spilled Records=18385812 Shuffled Maps =19600 Failed Shuffles=0 Merged Map outputs=19600 GC time elapsed (ms)=87351 CPU time spent (ms)=7536820 Physical memory (bytes) snapshot=159464194048 Virtual memory (bytes) snapshot=2564550729728 Total committed heap usage (bytes)=602469826560 Peak Map Physical memory (bytes)=703803392 Peak Map Virtual memory (bytes)=9257803776 Peak Reduce Physical memory (bytes)=1188073472 Peak Reduce Virtual memory (bytes)=9272107008 Generator MALFORMED_URL=2 SCHEDULE_REJECTED=2363211 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1460058824 File Output Format Counters Bytes Written=0 Looks fine, output records match with db_unfetched. Second one: 2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55 File System Counters FILE: Number of bytes read=24605358 FILE: Number of bytes written=95719210 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=66892770 HDFS: Number of bytes written=62854154 HDFS: Number of read operations=580 HDFS: Number of large read operations=0 HDFS: Number of write operations=8 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=140 Launched reduce tasks=4 Data-local map tasks=136 Rack-local map tasks=4 Total time spent by all maps in occupied slots (ms)=6395528 Total time spent by all reduces in occupied slots (ms)=106208 Total time spent by all map tasks (ms)=3197764 Total time spent by all reduce tasks (ms)=53104 Total vcore-milliseconds taken by all map tasks=3197764 Total vcore-milliseconds taken by all reduce tasks=53104 Total megabyte-milliseconds taken by all map tasks=26196082688 Total megabyte-milliseconds taken by all reduce tasks=435027968 Map-Reduce Framework Map input records=499940 Map output records=499940 Map output bytes=94641374 Map output materialized bytes=20372766 Input split bytes=25200 Combine input records=0 Combine output records=0 Reduce input groups=499914 Reduce shuffle bytes=20372766 Reduce input records=499940 Reduce output records=499940 Spilled Records=999880 Shuffled Maps =560 Failed Shuffles=0 Merged Map outputs=560 GC time elapsed (ms)=32655 CPU time spent (ms)=1206490 Physical memory (bytes) snapshot=68925956096 Virtual memory (bytes) snapshot=1315029512192 Total committed heap usage (bytes)=309841625088 Peak Map Physical memory (bytes)=519888896 Peak Map Virtual memory (bytes)=9196109824 Peak Reduce Physical memory (bytes)=517226496 Peak Reduce Virtual memory (bytes)=9233317888 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=66867570 File Output Format Counters Bytes Written=62854154 Again output/input records matches with topN parameter. Finally, fetcher counter: 2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65 File System Counters FILE: Number of bytes read=1670508693 FILE: Number of bytes written=2454067433 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=7590522 HDFS: Number of bytes written=778632266 HDFS: Number of read operations=156 HDFS: Number of large read operations=0 HDFS: Number of write operations=840 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=4 Launched reduce tasks=140 Other local map tasks=4 Total time spent by all maps in occupied slots (ms)=14400056 Total time spent by all reduces in occupied slots (ms)=758192886 Total time spent by all map tasks (ms)=7200028 Total time spent by all reduce tasks (ms)=379096443 Total vcore-milliseconds taken by all map tasks=7200028 Total vcore-milliseconds taken by all reduce tasks=379096443 Total megabyte-milliseconds taken by all map tasks=58982629376 Total megabyte-milliseconds taken by all reduce tasks=3105558061056 Map-Reduce Framework Map input records=59533 Map output records=106816 Map output bytes=3132357328 Map output materialized bytes=743594506 Input split bytes=636 Combine input records=0 Combine output records=0 Reduce input groups=54834 Reduce shuffle bytes=743594506 Reduce input records=106816 Reduce output records=106816 Spilled Records=352386 Shuffled Maps =560 Failed Shuffles=0 Merged Map outputs=560 GC time elapsed (ms)=35203 CPU time spent (ms)=9098390 Physical memory (bytes) snapshot=226334003200 Virtual memory (bytes) snapshot=1317038624768 Total committed heap usage (bytes)=309841625088 Peak Map Physical memory (bytes)=1892413440 Peak Map Virtual memory (bytes)=9298661376 Peak Reduce Physical memory (bytes)=1637662720 Peak Reduce Virtual memory (bytes)=9212592128 FetcherStatus access_denied=98 bytes_downloaded=3062491817 exception=1302 gone=26 moved=2598 notfound=723 redirect_count_exceeded=146 robots_denied=1145 robots_denied_maxcrawldelay=16 success=47650 temp_moved=1432 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=7589886 File Output Format Counters Bytes Written=778632266 2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished, elapsed: 3444497 ms Input/output records do not match with previous steps. What do I miss? Best, Maciej pt., 28 mar 2025 o 08:54 Sebastian Nagel <wastl.na...@googlemail.com.invalid> napisaĆ(a): > Hi Maciek, > > there are multiple configurations which set a limit on the items > during fetch list generations. > > - topN (Ok, it's obviously not the reason) > > - a limit per host is defined by the property generate.max.count > - default is -1 (no limit) > - eventually you want to set a limit per host, in order to avoid that > a single host with overlong fetch lists slows down the overall > crawling > - by generate.count.mode this limit can be applied per registered > domain or IP address > > - generate.min.score (default: 0.0): only CrawlDatum items with a higher > score are put into fetch lists > > - the fetch scheduling: re-fetch pages after a certain amount of time > (default: 30 days), also wait 1 day for retrying a page which failed > to fetch with an error (not a 404) > > > Running the CrawlDb statistics > > bin/nutch readdb crawldb -stats > > shows the number of items per status, retry count, the distribution > of scores and fetch intervals. > > > > should generate a segment with 500k pages to fetch. > > Not necessarily, see above. The final size of the fetch list > is shown by the counter > Reduce output records=NNN > Note: the second appearance of it in the generator log because of > NUTCH-3059. > > > fetches only around 100k pages. > > The fetcher counters also show how many items are skipped because of the > fetcher timelimit and alike. > > > Let us know whether you need more information. If possible please share the > CrawlDb statistics or the generator and fetcher counters. It might help > to find the reason. > > Best, > Sebastian > > On 3/27/25 18:56, Maciek Puzianowski wrote: > > Hi, > > I have a problem with topN value in Apache Nutch. > > I have 8 million+ db_unfetched pages in crawldb. I use crawl script with > > following command: > > bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20 > > --size-fetchlist 500000 /nutch/crawl 1 > > --size-fetchlist parameter is the topN for generate method, meaning that > it > > should generate a segment with 500k pages to fetch. However, the fetcher > > fetches only around 100k pages. Also I get around 1 million > > SCHEDULE_REJECTED counter in generate method, but I think its just pages > > that I have already fetched. > > > > I have checked url filters and they affect only few pages. > > > > What can be causing such issue with such a big difference? > > > > Best, > > Maciej > > > >