[jira] [Commented] (NUTCH-2685) Add README.md file to all exchange plugins

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748884#comment-16748884
 ] 

ASF GitHub Bot commented on NUTCH-2685:
---

sebastian-nagel commented on pull request #429: NUTCH-2685: README.md file for 
exchange-jexl plugin.
URL: https://github.com/apache/nutch/pull/429
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README.md file to all exchange plugins
> --
>
> Key: NUTCH-2685
> URL: https://issues.apache.org/jira/browse/NUTCH-2685
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation, indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> Adding the README.md file with plugin-specific documentation to all exchange 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2685) Add README.md file to all exchange plugins

2019-01-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748900#comment-16748900
 ] 

Hudson commented on NUTCH-2685:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3607 (See 
[https://builds.apache.org/job/Nutch-trunk/3607/])
NUTCH-2685: README.md file for exchange-jexl plugin. (r0ann3l: 
[https://github.com/apache/nutch/commit/636f576d8bf4276562d36a70e1dafb524783e503])
* (add) src/plugin/exchange-jexl/README.md


> Add README.md file to all exchange plugins
> --
>
> Key: NUTCH-2685
> URL: https://issues.apache.org/jira/browse/NUTCH-2685
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation, indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> Adding the README.md file with plugin-specific documentation to all exchange 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Nutch-trunk #3607

2019-01-22 Thread Apache Jenkins Server
See 


Changes:

[r0ann3l] NUTCH-2685: README.md file for exchange-jexl plugin.

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on H28 (ubuntu xenial) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision 6e000e1646839c7ce7b9937ea12e20a98dae0c45 
(refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 6e000e1646839c7ce7b9937ea12e20a98dae0c45
Commit message: "Merge pull request #429 from r0ann3l/NUTCH-2685"
 > git rev-list --no-walk 6934d52a501b7aa50f8a9d017b2bdf61490ab99f # timeout=10
[Nutch-trunk] $ /home/jenkins/tools/ant/latest/bin/ant -file build.xml 
-Dtest.junit.output.format=xml clean nightly javadoc
Buildfile: 
Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

clean-build:
   [delete] Deleting directory 


clean-default-lib:

clean-test-lib:

clean-lib:

clean-dist:
   [delete] Deleting directory 


clean-runtime:
   [delete] Deleting directory 


clean:

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:
[mkdir] Created dir: 
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

 [copy] Copying 1 file to 

 [copy] Copying 

 to 


clean-default-lib:

resolve-default:
[ivy:resolve] :: Apache Ivy 2.4.0 - 20141213170938 :: 
http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = 

[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.8.1/commons-lang3-3.8.1.jar
 ...
[ivy:resolve] ... (490kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.commons#commons-lang3;3.8.1!commons-lang3.jar (34ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.6/httpclient-4.5.6.jar
 ...
[ivy:resolve]  (749kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.httpcomponents#httpclient;4.5.6!httpclient.jar (37ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-frontend-jaxws/3.2.7/cxf-rt-frontend-jaxws-3.2.7.jar
 ...
[ivy:resolve]  (338kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.cxf#cxf-rt-frontend-jaxws;3.2.7!cxf-rt-frontend-jaxws.jar(bundle) 
(29ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-frontend-jaxrs/3.2.7/cxf-rt-frontend-jaxrs-3.2.7.jar
 ...
[ivy:resolve] .. (665kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.cxf#cxf-rt-frontend-jaxrs;3.2.7!cxf-rt-frontend-jaxrs.jar(bundle) 
(34ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-transports-http/3.2.7/cxf-rt-transports-http-3.2.7.jar
 ...
[ivy:resolve] . (355kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.cxf#cxf-rt-transports-http;3.2.7!cxf-rt-transports-http.jar(bundle) 
(29ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-transports-http-jetty/3.2.7/cxf-rt-transports-http-jetty-3.2.7.jar
 ...
[ivy:resolve]  (93kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 

[jira] [Resolved] (NUTCH-2685) Add README.md file to all exchange plugins

2019-01-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2685.

Resolution: Fixed

Merged. Thanks, [~roannel]!

> Add README.md file to all exchange plugins
> --
>
> Key: NUTCH-2685
> URL: https://issues.apache.org/jira/browse/NUTCH-2685
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation, indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> Adding the README.md file with plugin-specific documentation to all exchange 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Yossi Tamari (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yossi Tamari updated NUTCH-2691:

Description: 
Currently the scoring-depth plugin emits a "Missing depth, removing all 
outlinks from url" log message for every page that failed parsing (and does not 
have outlinks anyway).

Will provide a patch that exits immediately when there are no outlinks.

 

  was:
Currently the scoring-depth plugin emits a "Missing depth, removing all 
outlinks from url" log message for every page that failed parsing (and does not 
have outlinks anyway).

Will provide a patch that exits immediately when there is no outlinks.

 


> Improve logging from scoring-depth plugin
> -
>
> Key: NUTCH-2691
> URL: https://issues.apache.org/jira/browse/NUTCH-2691
> Project: Nutch
>  Issue Type: Improvement
>  Components: scoring
>Affects Versions: 1.15
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.16
>
>
> Currently the scoring-depth plugin emits a "Missing depth, removing all 
> outlinks from url" log message for every page that failed parsing (and does 
> not have outlinks anyway).
> Will provide a patch that exits immediately when there are no outlinks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748878#comment-16748878
 ] 

Sebastian Nagel commented on NUTCH-2691:


+1

> Improve logging from scoring-depth plugin
> -
>
> Key: NUTCH-2691
> URL: https://issues.apache.org/jira/browse/NUTCH-2691
> Project: Nutch
>  Issue Type: Improvement
>  Components: scoring
>Affects Versions: 1.15
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.16
>
>
> Currently the scoring-depth plugin emits a "Missing depth, removing all 
> outlinks from url" log message for every page that failed parsing (and does 
> not have outlinks anyway).
> Will provide a patch that exits immediately when there is no outlinks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2691:
---

 Summary: Improve logging from scoring-depth plugin
 Key: NUTCH-2691
 URL: https://issues.apache.org/jira/browse/NUTCH-2691
 Project: Nutch
  Issue Type: Improvement
  Components: scoring
Affects Versions: 1.15
Reporter: Yossi Tamari
 Fix For: 1.16


Currently the scoring-depth plugin emits a "Missing depth, removing all 
outlinks from url" log message for every page that failed parsing (and does not 
have outlinks anyway).

Will provide a patch that exits immediately when there is no outlinks.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748867#comment-16748867
 ] 

ASF GitHub Bot commented on NUTCH-2691:
---

YossiTamari commented on pull request #434: NUTCH-2691: Improve logging from 
scoring-depth plugin
URL: https://github.com/apache/nutch/pull/434
 
 
   Exit distributeScoreToOutlinks immediately if there are no outlinks. This is 
a very small performance improvement, but more importantly it prevents the 
plugin from emitting a "Missing depth, removing all outlinks from url" warn 
message for every page that failed parsing.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve logging from scoring-depth plugin
> -
>
> Key: NUTCH-2691
> URL: https://issues.apache.org/jira/browse/NUTCH-2691
> Project: Nutch
>  Issue Type: Improvement
>  Components: scoring
>Affects Versions: 1.15
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.16
>
>
> Currently the scoring-depth plugin emits a "Missing depth, removing all 
> outlinks from url" log message for every page that failed parsing (and does 
> not have outlinks anyway).
> Will provide a patch that exits immediately when there is no outlinks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2598) URLNormalizerChecker fails on invalid URLs in input

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748869#comment-16748869
 ] 

ASF GitHub Bot commented on NUTCH-2598:
---

sebastian-nagel commented on pull request #435: NUTCH-2598 URLNormalizerChecker 
fails on invalid URLs in input
URL: https://github.com/apache/nutch/pull/435
 
 
   Output empty string for invalid URLs and do not exit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> URLNormalizerChecker fails on invalid URLs in input
> ---
>
> Key: NUTCH-2598
> URL: https://issues.apache.org/jira/browse/NUTCH-2598
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> I use the URLNormalizerChecker (urlnormalizer-regex and urlnormalizer-basic) 
> to normalize URLs before further processing them. If one of the used 
> normalizers throws a MalformedURLException when the 
> URLNormalizer.normalize(...) method is called, this isn't caught and causes 
> the checker to exit:
> {noformat}
> Exception in thread "main" java.net.MalformedURLException: For input string: 
> "???120810002"
> at java.net.URL.(URL.java:627)
> at java.net.URL.(URL.java:490)
> at java.net.URL.(URL.java:439)
> at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:100)
> at 
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319)
> at 
> org.apache.nutch.net.URLNormalizerChecker.process(URLNormalizerChecker.java:75)
> at 
> org.apache.nutch.util.AbstractChecker.processStdin(AbstractChecker.java:97)
> at org.apache.nutch.util.AbstractChecker.run(AbstractChecker.java:77)
> at 
> org.apache.nutch.net.URLNormalizerChecker.run(URLNormalizerChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:80)
> Caused by: java.lang.NumberFormatException: For input string: "???120810002"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:222)
> at java.net.URL.(URL.java:622)
> ... 10 more
> {noformat}
> The URLNormalizer interface declares the MalformedURLException, it should be 
> caught in the normalizer checker:
> - log the error
> - return/output empty string



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748853#comment-16748853
 ] 

ASF GitHub Bot commented on NUTCH-2690:
---

sebastian-nagel commented on pull request #433: NUTCH-2690 Configurable and 
fast URL filter
URL: https://github.com/apache/nutch/pull/433
 
 
   See 
[README](https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md)
 and [NUTCH-2690](https://issues.apache.org/jira/browse/NUTCH-2690).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2690) Configurable and fast URL filter

2019-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2690:
--

 Summary: Configurable and fast URL filter
 Key: NUTCH-2690
 URL: https://issues.apache.org/jira/browse/NUTCH-2690
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Sebastian Nagel
 Fix For: 1.16


This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
debatable) which is in use at Common Crawl [since 
2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
 to apply a long list of filters. 

# an exact (suffix) match against the host name is done to retrieve 
host/domain-specific regex rules
# applies a regular expression against the path (and query) component of the URL

What makes it faster than urlfilter-regex for common cases:
- regexes are selected by host name or it's domain suffix, so there are usually 
fewer rules to be checked. That's similar to NUTCH-1838 but any domain suffix 
can be matched including {{subdomain.domain.com}}, {{com}} or {{.}} for global 
rules. The selection by host name suffix is considerably fast.
- regexes are applied only to the path component (optionally including the 
query) and not the entire URL.
  Matching against a shorter string can make a huge difference for more complex 
regular expressions.
- the rule to deny everything from a host or domain gets special treatment to 
be fast

More details about the rule format are found in the plugin's 
[README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2686) Separate field for mime types mapped by index-more plugin

2019-01-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2686.

Resolution: Fixed

Thanks, [~roannel]! Resolving because PR is merged. The failure of unit tests 
was caused by inconsistent libs cached in the ~/.ivy/ folder on the build 
machine.

> Separate field for mime types mapped by index-more plugin
> -
>
> Key: NUTCH-2686
> URL: https://issues.apache.org/jira/browse/NUTCH-2686
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.16
>
>
> Since [NUTCH-1262|https://issues.apache.org/jira/browse/NUTCH-1262], several 
> mime types can be mapped to a different value. By default, the behavior is to 
> replace the original value with the new one. But if we want to keep the 
> original mime type too? This issue pretends to accomplish this requirement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748783#comment-16748783
 ] 

Markus Jelsma commented on NUTCH-2689:
--

Nice catch! It is always nice to see low hanging fruit like this to pay off!

+1

> Speed up urlfilter-regex and urlfilter-automaton
> 
>
> Key: NUTCH-2689
> URL: https://issues.apache.org/jira/browse/NUTCH-2689
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of urlfilter-regex and urlfilter-automaton include a 
> benchmark. After playing and benchmarking modifications the following changes 
> seem to significantly improve the performance:
> - do not extract host and domain name from the URL if not needed (no 
> host/domain-specific rules used, cf. NUTCH-1838)
> - use non-capturing groups if possible
> - use {{(?i)}} to make the patterns case insensitive and remove uppercase 
> variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748757#comment-16748757
 ] 

ASF GitHub Bot commented on NUTCH-2689:
---

sebastian-nagel commented on pull request #432: NUTCH-2689 Speed up 
urlfilter-regex and urlfilter-automaton
URL: https://github.com/apache/nutch/pull/432
 
 
   - do not extract host and domain name from the URL if not needed
   - speed up regular expressions:
 - use non-capturing groups if possible
 - use (?i) to make the patterns case insensitive and remove uppercase 
variants to keep alternations shorter
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speed up urlfilter-regex and urlfilter-automaton
> 
>
> Key: NUTCH-2689
> URL: https://issues.apache.org/jira/browse/NUTCH-2689
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of urlfilter-regex and urlfilter-automaton include a 
> benchmark. After playing and benchmarking modifications the following changes 
> seem to significantly improve the performance:
> - do not extract host and domain name from the URL if not needed (no 
> host/domain-specific rules used, cf. NUTCH-1838)
> - use non-capturing groups if possible
> - use {{(?i)}} to make the patterns case insensitive and remove uppercase 
> variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748765#comment-16748765
 ] 

Sebastian Nagel commented on NUTCH-2689:


The benchmark times before ...
{noformat}
% grep 'bench time' build/urlfilter-regex/test/*.txt
2019-01-22 14:23:42,772 INFO  api.RegexURLFilterBaseTest ... - bench time (50) 
146ms
2019-01-22 14:23:42,946 INFO  api.RegexURLFilterBaseTest ... - bench time (100) 
171ms
2019-01-22 14:23:43,195 INFO  api.RegexURLFilterBaseTest ... - bench time (200) 
248ms
2019-01-22 14:23:43,680 INFO  api.RegexURLFilterBaseTest ... - bench time (400) 
485ms
2019-01-22 14:23:44,574 INFO  api.RegexURLFilterBaseTest ... - bench time (800) 
893ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt
2019-01-22 14:24:17,793 INFO  api.RegexURLFilterBaseTest ... - bench time (50) 
136ms
2019-01-22 14:24:17,894 INFO  api.RegexURLFilterBaseTest ... - bench time (100) 
97ms
2019-01-22 14:24:18,071 INFO  api.RegexURLFilterBaseTest ... - bench time (200) 
177ms
2019-01-22 14:24:18,324 INFO  api.RegexURLFilterBaseTest ... - bench time (400) 
252ms
2019-01-22 14:24:18,761 INFO  api.RegexURLFilterBaseTest ... - bench time (800) 
436ms
{noformat}

... and after the improvements (see PR):
{noformat}
% grep 'bench time' build/urlfilter-regex/test/*.txt
2019-01-22 15:14:19,886 INFO  api.RegexURLFilterBaseTest ... - bench time (50) 
232ms
2019-01-22 15:14:20,017 INFO  api.RegexURLFilterBaseTest ... - bench time (100) 
126ms
2019-01-22 15:14:20,158 INFO  api.RegexURLFilterBaseTest ... - bench time (200) 
140ms
2019-01-22 15:14:20,385 INFO  api.RegexURLFilterBaseTest ... - bench time (400) 
227ms
2019-01-22 15:14:20,794 INFO  api.RegexURLFilterBaseTest ... - bench time (800) 
409ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt
2019-01-22 15:14:37,708 INFO  api.RegexURLFilterBaseTest ... - bench time (50) 
48ms
2019-01-22 15:14:37,752 INFO  api.RegexURLFilterBaseTest ... - bench time (100) 
41ms
2019-01-22 15:14:37,821 INFO  api.RegexURLFilterBaseTest ... - bench time (200) 
69ms
2019-01-22 15:14:37,914 INFO  api.RegexURLFilterBaseTest ... - bench time (400) 
93ms
2019-01-22 15:14:38,080 INFO  api.RegexURLFilterBaseTest ... - bench time (800) 
165ms
{noformat}

There is some variation when the benchmarks are repeated, but the performance 
increase seems significant.

> Speed up urlfilter-regex and urlfilter-automaton
> 
>
> Key: NUTCH-2689
> URL: https://issues.apache.org/jira/browse/NUTCH-2689
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of urlfilter-regex and urlfilter-automaton include a 
> benchmark. After playing and benchmarking modifications the following changes 
> seem to significantly improve the performance:
> - do not extract host and domain name from the URL if not needed (no 
> host/domain-specific rules used, cf. NUTCH-1838)
> - use non-capturing groups if possible
> - use {{(?i)}} to make the patterns case insensitive and remove uppercase 
> variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2689:
--

Assignee: Sebastian Nagel

> Speed up urlfilter-regex and urlfilter-automaton
> 
>
> Key: NUTCH-2689
> URL: https://issues.apache.org/jira/browse/NUTCH-2689
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of urlfilter-regex and urlfilter-automaton include a 
> benchmark. After playing and benchmarking modifications the following changes 
> seem to significantly improve the performance:
> - do not extract host and domain name from the URL if not needed (no 
> host/domain-specific rules used, cf. NUTCH-1838)
> - use non-capturing groups if possible
> - use {{(?i)}} to make the patterns case insensitive and remove uppercase 
> variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2689) Speed up urlfilter-regex and urlfilter-automaton

2019-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2689:
--

 Summary: Speed up urlfilter-regex and urlfilter-automaton
 Key: NUTCH-2689
 URL: https://issues.apache.org/jira/browse/NUTCH-2689
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The unit tests of urlfilter-regex and urlfilter-automaton include a benchmark. 
After playing and benchmarking modifications the following changes seem to 
significantly improve the performance:
- do not extract host and domain name from the URL if not needed (no 
host/domain-specific rules used, cf. NUTCH-1838)
- use non-capturing groups if possible
- use {{(?i)}} to make the patterns case insensitive and remove uppercase 
variants 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)