[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833943#comment-16833943 ] Hudson commented on NUTCH-2690: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3623 (See [https://builds.apache.org/job/Nutch-trunk/3623/]) NUTCH-2690 Configurable and fast URL filter - performs fast exact (sebastian: [https://github.com/apache/nutch/commit/1fc98bf061aedb98be4453865201ce6d9f1dede6]) * (edit) src/plugin/urlfilter-regex/sample/Benchmarks.urls * (add) src/plugin/urlfilter-fast/README.md * (add) src/plugin/urlfilter-fast/src/test/org/apache/nutch/urlfilter/fast/TestFastURLFilter.java * (add) src/plugin/urlfilter-fast/ivy.xml * (edit) src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java * (add) src/plugin/urlfilter-fast/sample/Benchmarks.urls * (edit) src/plugin/build.xml * (edit) build.xml * (edit) default.properties * (add) src/plugin/urlfilter-fast/sample/fast-urlfilter-test.txt * (edit) src/plugin/urlfilter-automaton/sample/Benchmarks.urls * (add) src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/package-info.java * (edit) conf/nutch-default.xml * (add) src/plugin/urlfilter-fast/plugin.xml * (add) conf/fast-urlfilter.txt.template * (add) src/plugin/urlfilter-fast/sample/test.urls * (add) src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java * (add) src/plugin/urlfilter-fast/build.xml * (add) src/plugin/urlfilter-fast/sample/fast-urlfilter-benchmark.txt > Configurable and fast URL filter > > > Key: NUTCH-2690 > URL: https://issues.apache.org/jira/browse/NUTCH-2690 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > This improvement introduces a new URL filter plugin "urlfilter-fast" (naming > debatable) which is in use at Common Crawl [since > 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd] > to apply a long list of filters. > # an exact (suffix) match against the host name is done to retrieve > host/domain-specific regex rules > # applies a regular expression against the path (and query) component of the > URL > What makes it faster than urlfilter-regex for common cases: > - regexes are selected by host name or it's domain suffix, so there are > usually fewer rules to be checked. That's similar to NUTCH-1838 but any > domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or > {{.}} for global rules. The selection by host name suffix is considerably > fast. > - regexes are applied only to the path component (optionally including the > query) and not the entire URL. > Matching against a shorter string can make a huge difference for more > complex regular expressions. > - the rule to deny everything from a host or domain gets special treatment to > be fast > More details about the rule format are found in the plugin's > [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833919#comment-16833919 ] ASF GitHub Bot commented on NUTCH-2690: --- sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and fast URL filter URL: https://github.com/apache/nutch/pull/433 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Configurable and fast URL filter > > > Key: NUTCH-2690 > URL: https://issues.apache.org/jira/browse/NUTCH-2690 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > This improvement introduces a new URL filter plugin "urlfilter-fast" (naming > debatable) which is in use at Common Crawl [since > 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd] > to apply a long list of filters. > # an exact (suffix) match against the host name is done to retrieve > host/domain-specific regex rules > # applies a regular expression against the path (and query) component of the > URL > What makes it faster than urlfilter-regex for common cases: > - regexes are selected by host name or it's domain suffix, so there are > usually fewer rules to be checked. That's similar to NUTCH-1838 but any > domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or > {{.}} for global rules. The selection by host name suffix is considerably > fast. > - regexes are applied only to the path component (optionally including the > query) and not the entire URL. > Matching against a shorter string can make a huge difference for more > complex regular expressions. > - the rule to deny everything from a host or domain gets special treatment to > be fast > More details about the rule format are found in the plugin's > [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355 ] Sebastian Nagel commented on NUTCH-2690: PR updated, squashed and rebased to current master. I'll commit next week, but reviews are welcome. Thanks! Below the benchmark results from the unit tests. While the new plugin outperforms urlfilter-regex, the plugin urlfilter-automaton is still faster. However, the regular expressions supported by the [dk.brics.automaton](https://www.brics.dk/automaton/) are less expressive, eg. the "skip URLs with slash-delimited segment that repeats 3+ times" rule cannot be expressed because there are no back-references. {noformat} % ant test ... % grep 'bench time' build/urlfilter-regex/test/*.txt | sed -E 's@api\..*( - )@\1... @' 2019-04-11 13:46:18,776 INFO ... - bench time (50) 107ms 2019-04-11 13:46:18,845 INFO ... - bench time (100) 66ms 2019-04-11 13:46:18,961 INFO ... - bench time (200) 116ms 2019-04-11 13:46:19,192 INFO ... - bench time (400) 231ms 2019-04-11 13:46:19,663 INFO ... - bench time (800) 471ms % grep 'bench time' build/urlfilter-fast/test/*.txt | sed -E 's@api\..*( - )@\1... @' 2019-04-11 13:48:05,024 INFO ... - bench time (50) 72ms 2019-04-11 13:48:05,112 INFO ... - bench time (100) 84ms 2019-04-11 13:48:05,233 INFO ... - bench time (200) 121ms 2019-04-11 13:48:05,446 INFO ... - bench time (400) 213ms 2019-04-11 13:48:05,687 INFO ... - bench time (800) 241ms % grep 'bench time' build/urlfilter-automaton/test/*.txt | sed -E 's@api\..*( - )@\1... @' 2019-04-11 13:43:11,794 INFO ... - bench time (50) 43ms 2019-04-11 13:43:11,834 INFO ... - bench time (100) 37ms 2019-04-11 13:43:11,899 INFO ... - bench time (200) 65ms 2019-04-11 13:43:11,996 INFO ... - bench time (400) 97ms 2019-04-11 13:43:12,175 INFO ... - bench time (800) 178ms {noformat} > Configurable and fast URL filter > > > Key: NUTCH-2690 > URL: https://issues.apache.org/jira/browse/NUTCH-2690 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > This improvement introduces a new URL filter plugin "urlfilter-fast" (naming > debatable) which is in use at Common Crawl [since > 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd] > to apply a long list of filters. > # an exact (suffix) match against the host name is done to retrieve > host/domain-specific regex rules > # applies a regular expression against the path (and query) component of the > URL > What makes it faster than urlfilter-regex for common cases: > - regexes are selected by host name or it's domain suffix, so there are > usually fewer rules to be checked. That's similar to NUTCH-1838 but any > domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or > {{.}} for global rules. The selection by host name suffix is considerably > fast. > - regexes are applied only to the path component (optionally including the > query) and not the entire URL. > Matching against a shorter string can make a huge difference for more > complex regular expressions. > - the rule to deny everything from a host or domain gets special treatment to > be fast > More details about the rule format are found in the plugin's > [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter
[ https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748853#comment-16748853 ] ASF GitHub Bot commented on NUTCH-2690: --- sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and fast URL filter URL: https://github.com/apache/nutch/pull/433 See [README](https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md) and [NUTCH-2690](https://issues.apache.org/jira/browse/NUTCH-2690). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Configurable and fast URL filter > > > Key: NUTCH-2690 > URL: https://issues.apache.org/jira/browse/NUTCH-2690 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > This improvement introduces a new URL filter plugin "urlfilter-fast" (naming > debatable) which is in use at Common Crawl [since > 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd] > to apply a long list of filters. > # an exact (suffix) match against the host name is done to retrieve > host/domain-specific regex rules > # applies a regular expression against the path (and query) component of the > URL > What makes it faster than urlfilter-regex for common cases: > - regexes are selected by host name or it's domain suffix, so there are > usually fewer rules to be checked. That's similar to NUTCH-1838 but any > domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or > {{.}} for global rules. The selection by host name suffix is considerably > fast. > - regexes are applied only to the path component (optionally including the > query) and not the entire URL. > Matching against a shorter string can make a huge difference for more > complex regular expressions. > - the rule to deny everything from a host or domain gets special treatment to > be fast > More details about the rule format are found in the plugin's > [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)