Hi Folks,

I found out where the issue was. Just thought it might be useful for others. 

The performance issue I was facing in parse was due to the regular
expression URL filter and funny URL. "regex-URLfilter" plugin. One of the
regular expression was taking long... very long to process for some funny
URL.

Removing the content "-.*(/[^/]+)/[^/]+\1/[^/]+\1/" from regex-urlfilter.txt
in the conf saved tons of time on parsing.

Following thread discussed the similar matter.
http://lucene.472066.n3.nabble.com/Reduce-Error-during-fetch-td609736.html
https://issues.apache.org/jira/browse/NUTCH-233

Cheers,

Ye



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-tp4045827p4048185.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to