Currently, I have roughly 10M records in my crawldb.  I added some regex's
to remove some urls from my crawldb.  Nothing complicated.  However, when I
run with filtering turned on, the updatedb job took 118 hours.

Looking in the regex-urlfilter.txt file, I noticed some of the other
regex's are pretty broad.  So I commented them out and the updatedb job
took 6 minutes.

-[?*!@=]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

These two regexs are what cause url filtering to be so slow.

Reply via email to