Hi,

What makes it especially slow is when there are very long urls. Default
behaviour does not limit the url length. See:
https://issues.apache.org/jira/browse/NUTCH-1314

Another option is to NOT normalize/filter every time. We are running
nutchgora with a modification that only normalizes/filters when new urls
are inserted. (Injected urls and parsed outlink urls that is). However this
optimization might not always be possible. (Also makes it more complicated
when the rules change etc.)

On Wed, Jun 27, 2012 at 10:25 PM, Bai Shen <[email protected]> wrote:

> Currently, I have roughly 10M records in my crawldb.  I added some regex's
> to remove some urls from my crawldb.  Nothing complicated.  However, when I
> run with filtering turned on, the updatedb job took 118 hours.
>
> Looking in the regex-urlfilter.txt file, I noticed some of the other
> regex's are pretty broad.  So I commented them out and the updatedb job
> took 6 minutes.
>
> -[?*!@=]
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> These two regexs are what cause url filtering to be so slow.
>

Reply via email to