Hi,

> Now it is hanging again. still no error messages.
Is it really hanging? The stack below may point to
a performance problem with regex URL filter. It is
known to be slow with complex rules and/or overlong
URLs (more than 100 characters). Once we observed
a similar problem with a few hundred of documents
each containing 10000 long outlinks (about 200 chars).
Filtering (and normalization) of outlinks is done in
the parse step which then may take hours and indeed
seems to hang.

There is no single solution (it mainly depends on your
requirements, and whether you have control over the
crawled content):
- limit max. number of processed outlinks
  see property db.max.outlinks.per.page (default = 100)
- use urlfilter-automaton instead of urlfilter-regex
- try to localize the problem, i.e., document(s)
  containing many links, and fix/exclude them

Sebastian


"pool-1-thread-1" prio=10 tid=0x0000000001113800 nid=0x555d runnable
[0x00007fcbef6f5000]
   java.lang.Thread.State: RUNNABLE
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4158)
        at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
        at java.util.regex.Pattern$Start.match(Pattern.java:3408)
        at java.util.regex.Matcher.search(Matcher.java:1199)
        at java.util.regex.Matcher.find(Matcher.java:592)
        at
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.java:100)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.java:129)
        at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
        at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:257)


On 09/14/2013 11:23 AM, Law-Firms-In.com wrote:
> update:
> 
> I let the nutch parsing run over the night and I can see it did progress
> from document/domain starting with letter "b" to a document with
> starting letter "i". This means tens of thousand more documents have
> been parsed.
> 
> Now it is hanging again. still no error messages.
> 
> 

Reply via email to