Re: Generator or fetcher does not get topN pages

Maciek Puzianowski Fri, 28 Mar 2025 01:21:05 -0700

Sebastian,

thank you for your quick answer,


I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
problem, generate.count.mode is set to default byHost as well as
generate.min.score to 0.
It is a fresh start and I have set refetch time to 90 days so I don`t blame
the pages that were already fetched, I believe they are going to fetch in
those 90 days. db_unfetched pages are bothering me.
I was looking at the statistics you mentioned, however I can`t get the
cause from them. I will share them, maybe you will notice something... so:

readdb -stats:
2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
/nutch/crawl/crawldb
2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch interval:
     91 days, 07:27:11
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:   912
days, 21:15:04
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch interval:
    24855 days, 03:14:07
2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:  Wed
Mar 26 10:12:00 CET 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:   Mon
Apr 14 23:27:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:    Tue
Aug 12 09:01:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:  0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:  0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
2.8397222762258305E-9
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
1.5968612389262938E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
 3.0930187090863353E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
5.316786869545881E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
1.1587939392930727E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
2.5814283409185258E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
5.908654160550735E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
1.7618150654886576E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
 2.9190806586558215E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
8.921349358020518E-5
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
5.628835416129725E-4
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
 0.007623361698306406
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
 0.9327549852155366
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
 0.015449582458424396
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
 183.34019470214844
2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
     9406711
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
     1978175
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
98857
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
    69608
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
    169813
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
(db_notmodified):    6045
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7 (db_duplicate):
     32015
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics: done

Please keep in mind, that "longest fetch interval:       24855 days" is not
an issue as I have custom plugin that is doing so to prevent certain pages
from fetching ever again.

Generate counters:
First one:
2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
        File System Counters
                FILE: Number of bytes read=259336882
                FILE: Number of bytes written=731826855
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1460078564
                HDFS: Number of bytes written=66867570
                HDFS: Number of read operations=1680
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=561
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=140
                Launched reduce tasks=140
                Data-local map tasks=137
                Rack-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=16111006
                Total time spent by all reduces in occupied slots
(ms)=20877184
                Total time spent by all map tasks (ms)=8055503
                Total time spent by all reduce tasks (ms)=10438592
                Total vcore-milliseconds taken by all map tasks=8055503
                Total vcore-milliseconds taken by all reduce tasks=10438592
                Total megabyte-milliseconds taken by all map
tasks=65990680576
                Total megabyte-milliseconds taken by all reduce
tasks=85512945664
        Map-Reduce Framework
                Map input records=11556117
                Map output records=9192906
                Map output bytes=1171410543
                Map output materialized bytes=373113013
                Input split bytes=19740
                Combine input records=0
                Combine output records=0
                Reduce input groups=2022938
                Reduce shuffle bytes=373113013
                Reduce input records=9192906
                Reduce output records=0
                Spilled Records=18385812
                Shuffled Maps =19600
                Failed Shuffles=0
                Merged Map outputs=19600
                GC time elapsed (ms)=87351
                CPU time spent (ms)=7536820
                Physical memory (bytes) snapshot=159464194048
                Virtual memory (bytes) snapshot=2564550729728
                Total committed heap usage (bytes)=602469826560
                Peak Map Physical memory (bytes)=703803392
                Peak Map Virtual memory (bytes)=9257803776
                Peak Reduce Physical memory (bytes)=1188073472
                Peak Reduce Virtual memory (bytes)=9272107008
        Generator
                MALFORMED_URL=2
                SCHEDULE_REJECTED=2363211
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1460058824
        File Output Format Counters
                Bytes Written=0

Looks fine, output records match with db_unfetched.

Second one:
2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
        File System Counters
                FILE: Number of bytes read=24605358
                FILE: Number of bytes written=95719210
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=66892770
                HDFS: Number of bytes written=62854154
                HDFS: Number of read operations=580
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=140
                Launched reduce tasks=4
                Data-local map tasks=136
                Rack-local map tasks=4
                Total time spent by all maps in occupied slots (ms)=6395528
                Total time spent by all reduces in occupied slots
(ms)=106208
                Total time spent by all map tasks (ms)=3197764
                Total time spent by all reduce tasks (ms)=53104
                Total vcore-milliseconds taken by all map tasks=3197764
                Total vcore-milliseconds taken by all reduce tasks=53104
                Total megabyte-milliseconds taken by all map
tasks=26196082688
                Total megabyte-milliseconds taken by all reduce
tasks=435027968
        Map-Reduce Framework
                Map input records=499940
                Map output records=499940
                Map output bytes=94641374
                Map output materialized bytes=20372766
                Input split bytes=25200
                Combine input records=0
                Combine output records=0
                Reduce input groups=499914
                Reduce shuffle bytes=20372766
                Reduce input records=499940
                Reduce output records=499940
                Spilled Records=999880
                Shuffled Maps =560
                Failed Shuffles=0
                Merged Map outputs=560
                GC time elapsed (ms)=32655
                CPU time spent (ms)=1206490
                Physical memory (bytes) snapshot=68925956096
                Virtual memory (bytes) snapshot=1315029512192
                Total committed heap usage (bytes)=309841625088
                Peak Map Physical memory (bytes)=519888896
                Peak Map Virtual memory (bytes)=9196109824
                Peak Reduce Physical memory (bytes)=517226496
                Peak Reduce Virtual memory (bytes)=9233317888
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=66867570
        File Output Format Counters
                Bytes Written=62854154

Again output/input records matches with topN parameter.

Finally, fetcher counter:
2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
        File System Counters
                FILE: Number of bytes read=1670508693
                FILE: Number of bytes written=2454067433
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=7590522
                HDFS: Number of bytes written=778632266
                HDFS: Number of read operations=156
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=840
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=4
                Launched reduce tasks=140
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=14400056
                Total time spent by all reduces in occupied slots
(ms)=758192886
                Total time spent by all map tasks (ms)=7200028
                Total time spent by all reduce tasks (ms)=379096443
                Total vcore-milliseconds taken by all map tasks=7200028
                Total vcore-milliseconds taken by all reduce tasks=379096443
                Total megabyte-milliseconds taken by all map
tasks=58982629376
                Total megabyte-milliseconds taken by all reduce
tasks=3105558061056
        Map-Reduce Framework
                Map input records=59533
                Map output records=106816
                Map output bytes=3132357328
                Map output materialized bytes=743594506
                Input split bytes=636
                Combine input records=0
                Combine output records=0
                Reduce input groups=54834
                Reduce shuffle bytes=743594506
                Reduce input records=106816
                Reduce output records=106816
                Spilled Records=352386
                Shuffled Maps =560
                Failed Shuffles=0
                Merged Map outputs=560
                GC time elapsed (ms)=35203
                CPU time spent (ms)=9098390
                Physical memory (bytes) snapshot=226334003200
                Virtual memory (bytes) snapshot=1317038624768
                Total committed heap usage (bytes)=309841625088
                Peak Map Physical memory (bytes)=1892413440
                Peak Map Virtual memory (bytes)=9298661376
                Peak Reduce Physical memory (bytes)=1637662720
                Peak Reduce Virtual memory (bytes)=9212592128
        FetcherStatus
                access_denied=98
                bytes_downloaded=3062491817
                exception=1302
                gone=26
                moved=2598
                notfound=723
                redirect_count_exceeded=146
                robots_denied=1145
                robots_denied_maxcrawldelay=16
                success=47650
                temp_moved=1432
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=7589886
        File Output Format Counters
                Bytes Written=778632266
2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
3444497 ms

Input/output records do not match with previous steps. What do I miss?

Best,
Maciej

pt., 28 mar 2025 o 08:54 Sebastian Nagel <wastl.na...@googlemail.com.invalid>
napisał(a):

> Hi Maciek,
>
> there are multiple configurations which set a limit on the items
> during fetch list generations.
>
> - topN (Ok, it's obviously not the reason)
>
> - a limit per host is defined by the property generate.max.count
>    - default is -1 (no limit)
>    - eventually you want to set a limit per host, in order to avoid that
>      a single host with overlong fetch lists slows down the overall
> crawling
>    - by generate.count.mode this limit can be applied per registered
>      domain or IP address
>
> - generate.min.score (default: 0.0): only CrawlDatum items with a higher
>    score are put into fetch lists
>
> - the fetch scheduling: re-fetch pages after a certain amount of time
>    (default: 30 days), also wait 1 day for retrying a page which failed
>    to fetch with an error (not a 404)
>
>
> Running the CrawlDb statistics
>
>    bin/nutch readdb crawldb -stats
>
> shows the number of items per status, retry count, the distribution
> of scores and fetch intervals.
>
>
>  > should generate a segment with 500k pages to fetch.
>
> Not necessarily, see above. The final size of the fetch list
> is shown by the counter
>    Reduce output records=NNN
> Note: the second appearance of it in the generator log because of
> NUTCH-3059.
>
>  > fetches only around 100k pages.
>
> The fetcher counters also show how many items are skipped because of the
> fetcher timelimit and alike.
>
>
> Let us know whether you need more information. If possible please share the
> CrawlDb statistics or the generator and fetcher counters. It might help
> to find the reason.
>
> Best,
> Sebastian
>
> On 3/27/25 18:56, Maciek Puzianowski wrote:
> > Hi,
> > I have a problem with topN value in Apache Nutch.
> > I have 8 million+ db_unfetched pages in crawldb. I use crawl script with
> > following command:
> > bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
> > --size-fetchlist 500000 /nutch/crawl 1
> > --size-fetchlist parameter is the topN for generate method, meaning that
> it
> > should generate a segment with 500k pages to fetch. However, the fetcher
> > fetches only around 100k pages. Also I get around 1 million
> > SCHEDULE_REJECTED counter in generate method, but I think its just pages
> > that I have already fetched.
> >
> > I have checked url filters and they affect only few pages.
> >
> > What can be causing such issue with such a big difference?
> >
> > Best,
> > Maciej
> >
>
>

Re: Generator or fetcher does not get topN pages

Reply via email to