[ https://issues.apache.org/jira/browse/NUTCH-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882894#comment-16882894 ]
Sebastian Nagel commented on NUTCH-2710: ---------------------------------------- Hi [~markus17], agreed to normalize before checking external/internal links (if {{db.ignore.external.links}} resp. {{db.ignore.internal.links}} is true). - the patch adds just an additional normalization and filtering before the internal/external link checks. But I guess it's just an incomplete patch? Doing the normalization and filtering twice seems useless. - URL filters can be expensive (eg. if there are many regex rules): maybe it's more efficient to normalize, then skip external/internal links (if configured) and apply the URL filters only to the remaining links? > Normalize outlinks before checking for internal or external links > ----------------------------------------------------------------- > > Key: NUTCH-2710 > URL: https://issues.apache.org/jira/browse/NUTCH-2710 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2710.patch > > > We have a normalizer that transforms external URLs back to internal URLs. But > those URLs are never passed to the normalizer, because they have already been > filtered out by internal and/or external host/domain checks in > parseOutputFormat.filterNormalize(). > This patch proposes to move the normalizers above the checks for > internal/external hosts/domains. -- This message was sent by Atlassian JIRA (v7.6.14#76016)