[jira] [Commented] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539616#comment-17539616
 ] 

ASF GitHub Bot commented on NUTCH-2950:
---

sebastian-nagel opened a new pull request, #731:
URL: https://github.com/apache/nutch/pull/731

   (see NUTCH-2950 and commit messages)




> UpdateHostDb: performance improvements
> --
>
> Key: NUTCH-2950
> URL: https://issues.apache.org/jira/browse/NUTCH-2950
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses a couple of performance improvements when creating the 
> HostDb:
> - avoid needless conversions between hostname and URL
> - improvements of HostDb serialization (write and read)
> - parametrize logging and log less on level INFO
> - do not create DNS resolver threads if DNS look-ups are not requested by 
> command-line options
> A patch/PR is ready. Depending on the chosen command-line options, a 10-20% 
> speed-up should be visible if DNS look-ups, normalization and filtering are 
> off.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-19 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539617#comment-17539617
 ] 

Sebastian Nagel commented on NUTCH-2950:


If desired I could also split the issue/PR into multiple smaller changes.

> UpdateHostDb: performance improvements
> --
>
> Key: NUTCH-2950
> URL: https://issues.apache.org/jira/browse/NUTCH-2950
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses a couple of performance improvements when creating the 
> HostDb:
> - avoid needless conversions between hostname and URL
> - improvements of HostDb serialization (write and read)
> - parametrize logging and log less on level INFO
> - do not create DNS resolver threads if DNS look-ups are not requested by 
> command-line options
> A patch/PR is ready. Depending on the chosen command-line options, a 10-20% 
> speed-up should be visible if DNS look-ups, normalization and filtering are 
> off.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #731: NUTCH-2950 UpdateHostDb: performance improvements

2022-05-19 Thread GitBox


sebastian-nagel opened a new pull request, #731:
URL: https://github.com/apache/nutch/pull/731

   (see NUTCH-2950 and commit messages)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-2950) UpdateHostDb: performance improvements

2022-05-19 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2950:
--

 Summary: UpdateHostDb: performance improvements
 Key: NUTCH-2950
 URL: https://issues.apache.org/jira/browse/NUTCH-2950
 Project: Nutch
  Issue Type: Improvement
  Components: hostdb
Affects Versions: 1.18
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.19


This issue addresses a couple of performance improvements when creating the 
HostDb:
- avoid needless conversions between hostname and URL
- improvements of HostDb serialization (write and read)
- parametrize logging and log less on level INFO
- do not create DNS resolver threads if DNS look-ups are not requested by 
command-line options

A patch/PR is ready. Depending on the chosen command-line options, a 10-20% 
speed-up should be visible if DNS look-ups, normalization and filtering are off.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539600#comment-17539600
 ] 

Hudson commented on NUTCH-2946:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #74 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/74/])
NUTCH-2946 Fetcher: slow down fetching from hosts where requests fail 
repeatedly (snagel: 
[https://github.com/apache/nutch/commit/42ae2a34505e23319861e7b31fd9f87f1af68749])
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueues.java
NUTCH-2946 Fetcher: optionally slow down fetching from hosts with repeated 
exceptions (snagel: 
[https://github.com/apache/nutch/commit/bdbe7b330b5e7fd712f1b5126f69e2efebb194e8])
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueues.java
* (edit) conf/nutch-default.xml


> Fetcher: optionally slow down fetching from hosts with repeated exceptions
> --
>
> Key: NUTCH-2946
> URL: https://issues.apache.org/jira/browse/NUTCH-2946
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> The fetcher holds for every fetch queue a counter which counts the number of 
> observed "exceptions" seen when fetching from the host (resp. domain or IP) 
> bound to this queue.
> As an improvement to increase the politeness of the crawler, the counter 
> value could be used to dynamically increase the fetch delay for hosts where 
> requests fail repeatedly with exceptions or HTTP status codes mapped to 
> ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx 
> server errors, etc.) Of course, this should be optional. The aim to reduce 
> the load on such hosts already before the configured max. number of 
> exceptions (property fetcher.max.exceptions.per.queue) is hit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Jenkins build is back to normal : Nutch » Nutch-trunk #74

2022-05-19 Thread Apache Jenkins Server
See 




[jira] [Commented] (NUTCH-2947) Fetcher: keep state of empty fetch queues unless queue feeder is finished

2022-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539564#comment-17539564
 ] 

ASF GitHub Bot commented on NUTCH-2947:
---

sebastian-nagel commented on PR #729:
URL: https://github.com/apache/nutch/pull/729#issuecomment-1131710151

   Updated to be based on master branch after merging NUTCH-2946/#728. The 
state of a queue is also preserved if `fetcher.exceptions.per.queue.delay` > 
0.0 (in the discussion of NUTCH-2946 with Markus we came to defining the delay 
in seconds using a float just as the other fetcher delays. Internally the 
fetcher handles all delays in milliseconds.




> Fetcher: keep state of empty fetch queues unless queue feeder is finished
> -
>
> Key: NUTCH-2947
> URL: https://issues.apache.org/jira/browse/NUTCH-2947
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> If a fetch queue is empty (containing no fetch items) it may be removed from 
> the list of queues. This also remove the state of a fetch queue, namely the 
> next fetch time and the exception counter. If the queue feeder is still 
> active it may happened that the same queue (i.e. associated with the same 
> host/domain/IP) removed before is created again. In this case, certain 
> aspects of fetcher politeness cannot be guaranteed anymore:
> - the fetch delay (via earliest next fetch time) and
> - the mechanism to block fetching from the same host/domain/IP with too many 
> exceptions (NUTCH-769).
> The issue was observed while verifying NUTCH-2946 in the fetcher logs:
> {noformat}
> ... 10:19:16,912 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:20:16,250 * queue foo.bar >> delayed next fetch by 79248 ms after 2 
> exceptions in queue
> ... 10:21:52,675 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:25:40,931 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:27:45,066 * queue foo.bar >> delayed next fetch by 79248 ms after 2 
> exceptions in queue
> ... 10:29:40,407 * queue foo.bar >> delayed next fetch by 10 ms after 3 
> exceptions in queue
> ... 10:41:48,870 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:47:54,946 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:52:46,792 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 10:57:43,470 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:01:12,220 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:04:24,621 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:18:40,398 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:21:09,437 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:34:36,052 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:39:17,898 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:40:35,472 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:50:34,224 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:51:27,547 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:53:04,783 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 11:54:04,404 * queue foo.bar >> delayed next fetch by 79248 ms after 2 
> exceptions in queue
> ... 11:55:38,232 * queue foo.bar >> delayed next fetch by 10 ms after 3 
> exceptions in queue
> ... 11:57:37,942 * queue foo.bar >> delayed next fetch by 116096 ms after 4 
> exceptions in queue
> ... 12:01:08,619 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> ... 12:02:35,985 * queue foo.bar >> delayed next fetch by 5 ms after 1 
> exceptions in queue
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel commented on pull request #729: NUTCH-2947 Fetcher: keep state of empty fetch queues unless queue feeder is finished

2022-05-19 Thread GitBox


sebastian-nagel commented on PR #729:
URL: https://github.com/apache/nutch/pull/729#issuecomment-1131710151

   Updated to be based on master branch after merging NUTCH-2946/#728. The 
state of a queue is also preserved if `fetcher.exceptions.per.queue.delay` > 
0.0 (in the discussion of NUTCH-2946 with Markus we came to defining the delay 
in seconds using a float just as the other fetcher delays. Internally the 
fetcher handles all delays in milliseconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-19 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2946.

Resolution: Implemented

> Fetcher: optionally slow down fetching from hosts with repeated exceptions
> --
>
> Key: NUTCH-2946
> URL: https://issues.apache.org/jira/browse/NUTCH-2946
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> The fetcher holds for every fetch queue a counter which counts the number of 
> observed "exceptions" seen when fetching from the host (resp. domain or IP) 
> bound to this queue.
> As an improvement to increase the politeness of the crawler, the counter 
> value could be used to dynamically increase the fetch delay for hosts where 
> requests fail repeatedly with exceptions or HTTP status codes mapped to 
> ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx 
> server errors, etc.) Of course, this should be optional. The aim to reduce 
> the load on such hosts already before the configured max. number of 
> exceptions (property fetcher.max.exceptions.per.queue) is hit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2946) Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539553#comment-17539553
 ] 

ASF GitHub Bot commented on NUTCH-2946:
---

sebastian-nagel merged PR #728:
URL: https://github.com/apache/nutch/pull/728




> Fetcher: optionally slow down fetching from hosts with repeated exceptions
> --
>
> Key: NUTCH-2946
> URL: https://issues.apache.org/jira/browse/NUTCH-2946
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> The fetcher holds for every fetch queue a counter which counts the number of 
> observed "exceptions" seen when fetching from the host (resp. domain or IP) 
> bound to this queue.
> As an improvement to increase the politeness of the crawler, the counter 
> value could be used to dynamically increase the fetch delay for hosts where 
> requests fail repeatedly with exceptions or HTTP status codes mapped to 
> ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx 
> server errors, etc.) Of course, this should be optional. The aim to reduce 
> the load on such hosts already before the configured max. number of 
> exceptions (property fetcher.max.exceptions.per.queue) is hit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel merged pull request #728: NUTCH-2946 Fetcher: optionally slow down fetching from hosts with repeated exceptions

2022-05-19 Thread GitBox


sebastian-nagel merged PR #728:
URL: https://github.com/apache/nutch/pull/728


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Final reminder: ApacheCon North America call for presentations closing soon

2022-05-19 Thread Rich Bowen
[Note: You're receiving this because you are subscribed to one or more
Apache Software Foundation project mailing lists.]

This is your final reminder that the Call for Presetations for
ApacheCon North America 2022 will close at 00:01 GMT on Monday, May
23rd, 2022. Please don't wait! Get your talk proposals in now!

Details here: https://apachecon.com/acna2022/cfp.html

--Rich, for the ApacheCon Planners




[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-05-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539430#comment-17539430
 ] 

ASF GitHub Bot commented on NUTCH-2936:
---

sebastian-nagel commented on PR #726:
URL: https://github.com/apache/nutch/pull/726#issuecomment-1131460482

   +1 Afaics, this PR addresses only code style, code conventions, Javadoc, 
etc. but does not change anything functionally. Maybe this should be reflected 
in the commit messages as well.




> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode
> --
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel commented on pull request #726: NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

2022-05-19 Thread GitBox


sebastian-nagel commented on PR #726:
URL: https://github.com/apache/nutch/pull/726#issuecomment-1131460482

   +1 Afaics, this PR addresses only code style, code conventions, Javadoc, 
etc. but does not change anything functionally. Maybe this should be reflected 
in the commit messages as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-05-19 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2949:
--

 Summary: Tasks of a multi-threaded map runner may fail because of 
slow creation of URL stream handlers
 Key: NUTCH-2949
 URL: https://issues.apache.org/jira/browse/NUTCH-2949
 Project: Nutch
  Issue Type: Bug
  Components: net, plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.19


While running a custom Nutch job ([code 
here|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]),
 many but not all task failed exceeding the the Hadoop task time-out 
(`mapreduce.task.timeout`) without generating any "heartbeat" (output, counter 
increments, log messages). Hadoop logs the stacks of all threads of the timed 
out task. That's the base for the excerpts below.

The job runs a MultithreadedMapper - most of the mapper threads (48 in total) 
are waiting for the URLStreamHandler in order to construct a java.net.URL 
object:

{noformat}
"Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s 
tid=0x7f3eb5b0f800 nid=0x8e651 waiting for monitor entry  
[0x7f3e84ef9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
        - waiting to lock <0x0006a1bc0630> (a java.lang.String)
        at 
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
        at 
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
        at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
        at java.net.URL.(java.base@11.0.15/URL.java:651)
        at java.net.URL.(java.base@11.0.15/URL.java:541)
        at java.net.URL.(java.base@11.0.15/URL.java:488)
        at 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
        at 
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
        at 
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
        at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
        at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
        at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
        at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at 
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
{noformat}

Only a single mapper thread is active:

{noformat}
"Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s 
tid=0x7f3eb5b42800 nid=0x8e661 in Object.wait()  [0x7f3e842ec000]
   java.lang.Thread.State: RUNNABLE
at 
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native
 Method)
at 
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62)
at 
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45)
at 
java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
- locked <0x0006a1bc0630> (a java.lang.String)
at 
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
at 
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
at java.net.URL.(java.base@11.0.15/URL.java:651)
at java.net.URL.(java.base@11.0.15/URL.java:541)
at java.net.URL.(java.base@11.0.15/URL.java:488)
at 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
at 
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
at 
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
at 
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at