very slow generator step

Mohammad wrk Mon, 12 Nov 2012 09:38:23 -0800

Hi,


The generator time has gone from 8 minutes to 106 minutes few days ago and 
stayed there since then. AFAIK, I haven't made any configuration changes 
recently (attached you can find some of the configurations that I thought might 
be related). 

A quick CPU sampling shows that most of the time is spent on 
java.util.regex.Matcher.find(). Since I'm using default regex configurations 
and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
issue with nutch-1.5.1 ?

Here are some more information that might help:

===================== Generator logs
2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
2012-11-09 03:14:50
2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
best-scoring urls due for fetch.
2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
'local', generating exactly one partition.
2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
selected urls for politeness.
2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
segments/20121109032340
2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
2012-11-09 03:23:47, elapsed: 00:08:56
2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
2012-11-09 05:35:14
2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
best-scoring urls due for fetch.
2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
'local', generating exactly one partition.
2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
selected urls for politeness.
2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
segments/20121109072143
2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
2012-11-09 07:21:51, elapsed: 01:46:36

===================== CrawlDb statistics
CrawlDb statistics start: ./crawldb
Statistics for CrawlDb: ./crawldb
TOTAL urls:3052412
retry 0:3047404
retry 1:338
retry 2:1192
retry 3:822
retry 4:336
retry 5:2320
min score:0.0
avg score:0.015368268
max score:48.608
status 1 (db_unfetched):2813249
status 2 (db_fetched):196717
status 3 (db_gone):14204
status 4 (db_redir_temp):10679
status 5 (db_redir_perm):17563
CrawlDb statistics: done

===================== System info
Memory: 4 GB

CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
Available diskspace: 171.7 GB
OS: Release 12.10 (quantal) 64-bit


Thanks,
Mohammad

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

very slow generator step

Reply via email to