[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-05-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833943#comment-16833943
 ] 

Hudson commented on NUTCH-2690:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3623 (See 
[https://builds.apache.org/job/Nutch-trunk/3623/])
NUTCH-2690 Configurable and fast URL filter - performs fast exact (sebastian: 
[https://github.com/apache/nutch/commit/1fc98bf061aedb98be4453865201ce6d9f1dede6])
* (edit) src/plugin/urlfilter-regex/sample/Benchmarks.urls
* (add) src/plugin/urlfilter-fast/README.md
* (add) 
src/plugin/urlfilter-fast/src/test/org/apache/nutch/urlfilter/fast/TestFastURLFilter.java
* (add) src/plugin/urlfilter-fast/ivy.xml
* (edit) 
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
* (add) src/plugin/urlfilter-fast/sample/Benchmarks.urls
* (edit) src/plugin/build.xml
* (edit) build.xml
* (edit) default.properties
* (add) src/plugin/urlfilter-fast/sample/fast-urlfilter-test.txt
* (edit) src/plugin/urlfilter-automaton/sample/Benchmarks.urls
* (add) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/package-info.java
* (edit) conf/nutch-default.xml
* (add) src/plugin/urlfilter-fast/plugin.xml
* (add) conf/fast-urlfilter.txt.template
* (add) src/plugin/urlfilter-fast/sample/test.urls
* (add) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
* (add) src/plugin/urlfilter-fast/build.xml
* (add) src/plugin/urlfilter-fast/sample/fast-urlfilter-benchmark.txt


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-05-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833919#comment-16833919
 ] 

ASF GitHub Bot commented on NUTCH-2690:
---

sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and 
fast URL filter
URL: https://github.com/apache/nutch/pull/433
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-04-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355
 ] 

Sebastian Nagel commented on NUTCH-2690:


PR updated, squashed and rebased to current master.
I'll commit next week, but reviews are welcome. Thanks!

Below the benchmark results from the unit tests. While the new plugin 
outperforms urlfilter-regex, the plugin urlfilter-automaton is still faster. 
However, the regular expressions supported by the 
[dk.brics.automaton](https://www.brics.dk/automaton/) are less expressive, eg. 
the "skip URLs with slash-delimited segment that repeats 3+ times" rule cannot 
be expressed because there are no back-references.
{noformat}
% ant test
...

% grep 'bench time' build/urlfilter-regex/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:46:18,776 INFO  ... - bench time (50) 107ms
2019-04-11 13:46:18,845 INFO  ... - bench time (100) 66ms
2019-04-11 13:46:18,961 INFO  ... - bench time (200) 116ms
2019-04-11 13:46:19,192 INFO  ... - bench time (400) 231ms
2019-04-11 13:46:19,663 INFO  ... - bench time (800) 471ms

% grep 'bench time' build/urlfilter-fast/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:48:05,024 INFO  ... - bench time (50) 72ms
2019-04-11 13:48:05,112 INFO  ... - bench time (100) 84ms
2019-04-11 13:48:05,233 INFO  ... - bench time (200) 121ms
2019-04-11 13:48:05,446 INFO  ... - bench time (400) 213ms
2019-04-11 13:48:05,687 INFO  ... - bench time (800) 241ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:43:11,794 INFO  ... - bench time (50) 43ms
2019-04-11 13:43:11,834 INFO  ... - bench time (100) 37ms
2019-04-11 13:43:11,899 INFO  ... - bench time (200) 65ms
2019-04-11 13:43:11,996 INFO  ... - bench time (400) 97ms
2019-04-11 13:43:12,175 INFO  ... - bench time (800) 178ms
{noformat}


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748853#comment-16748853
 ] 

ASF GitHub Bot commented on NUTCH-2690:
---

sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and 
fast URL filter
URL: https://github.com/apache/nutch/pull/433
 
 
   See 
[README](https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md)
 and [NUTCH-2690](https://issues.apache.org/jira/browse/NUTCH-2690).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)