Hi,
The generator time has gone from 8 minutes to 106 minutes few days ago and
stayed there since then. AFAIK, I haven't made any configuration changes
recently (attached you can find some of the configurations that I thought might
be related).
A quick CPU sampling shows that most of the time is spent on
java.util.regex.Matcher.find(). Since I'm using default regex configurations
and my crawldb has only 3,052,412 urls, I was wondering if this is a known
issue with nutch-1.5.1 ?
Here are some more information that might help:
===================== Generator logs
2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: starting at
2012-11-09 03:14:50
2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: filtering: true
2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: normalizing: true
2012-11-09 03:14:50,921 INFO crawl.Generator - Generator: topN: 3000
2012-11-09 03:14:50,923 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2012-11-09 03:23:39,741 INFO crawl.Generator - Generator: Partitioning
selected urls for politeness.
2012-11-09 03:23:40,743 INFO crawl.Generator - Generator: segment:
segments/20121109032340
2012-11-09 03:23:47,860 INFO crawl.Generator - Generator: finished at
2012-11-09 03:23:47, elapsed: 00:08:56
2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: starting at
2012-11-09 05:35:14
2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: filtering: true
2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: normalizing: true
2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: topN: 3000
2012-11-09 05:35:14,037 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2012-11-09 07:21:42,840 INFO crawl.Generator - Generator: Partitioning
selected urls for politeness.
2012-11-09 07:21:43,841 INFO crawl.Generator - Generator: segment:
segments/20121109072143
2012-11-09 07:21:51,004 INFO crawl.Generator - Generator: finished at
2012-11-09 07:21:51, elapsed: 01:46:36
===================== CrawlDb statistics
CrawlDb statistics start: ./crawldb
Statistics for CrawlDb: ./crawldb
TOTAL urls:3052412
retry 0:3047404
retry 1:338
retry 2:1192
retry 3:822
retry 4:336
retry 5:2320
min score:0.0
avg score:0.015368268
max score:48.608
status 1 (db_unfetched):2813249
status 2 (db_fetched):196717
status 3 (db_gone):14204
status 4 (db_redir_temp):10679
status 5 (db_redir_perm):17563
CrawlDb statistics: done
===================== System info
Memory: 4 GB
CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
Available diskspace: 171.7 GB
OS: Release 12.10 (quantal) 64-bit
Thanks,
Mohammad
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.